Articles

Home >
Receive an email every 2 months, informing you of updates to this article directory.

Discover how an SOC helps you fight intruders from the confines of a safe house.

By: By Jay Milne

No SOC is complete without a place to analyze suspicious programs and test possible solutions. This is especially true for organizations unlucky enough to be hit by a zero-day or unknown exploit, worm or virus. Most of the time, even unknowns are a variant of some known worm. The variant may use a different payload or a new propagation method. Still, it may be difficult to determine what's occurring and what you must do to stop it. Enter the sandbox.

A sandbox is a set of systems--or even a single system--that's separate from any other network in your organization and used to analyze the behavior of suspicious applications. You can also use a single system running a virtual machine application, such as VMWare. Set up multiple VMWare OSs to share a private network segment--for instance, you can have Windows System A as the infected system and Windows System B as the target system. Run a packet-capture tool, such as Ethereal, on System A to capture all packets going in and out of the system, then run Ethereal on System B and compare the two. Other useful tools for sandboxing, including regmon, filemon and diskmon, are available from Sysinternals. Of course, similar solutions can be found for your favorite Unix OS, too.

A SOC also demands a large display to monitor the health of network traffic and load, routers and firewalls, Active Directory, DNS, production-application servers and other key items.

A Series of Unfortunate Events

Security events can be generated from many devices and applications, including firewalls, routers, VPN concentrators, Web and antivirus applications, and IDS/IPS devices. The sheer number of events that many companies must collect, analyze and store can be staggering. To help, many SOC teams are turning to SEM (security event management) tools. Arcsight, GuardedNet and Network Intelligence are the leaders in this area. It's not uncommon to hear that some SEM tools process several thousand events per second.

But not every event must be captured and analyzed. Use aggregation and selectively choose which types of events get captured to reduce the volume of data. You can limit your analysis to only events coming from critical servers and systems, for instance. A SEM product working with a vulnerability assessment (VA) tool can help you identify those events that are targeted toward systems vulnerable to the attacks. Make sure your VA data is up-to-date and scans are done frequently. This can be challenging when dynamic IP addressing is used, because many VA products use only the host IP address as the unique identifier. Thus, if the workstation's IP changes over time, the VA solution may have difficulty tracking it and redundant scan information will be included.

As for storing the data generated from these events, the length of time to keep the logs depends on various government and industry regulations and standards, such as HIPAA, SOX and ISO 17779. Some of these don't define a retention time, but others specify a minimum of seven years. NIST, in its "Computer Security Incident Handling Guide, Special Publication 800-61," recommends retaining at least several months of data. The North American Electric Reliability Council says audit logs must be kept for three years. To help reduce log storage, some SEM tools can save log data by type. You may want to retain your IDS data for 60 days, for instance, but keep VPN logs for 120 days and application logs for years.

Any SEM tool--or any event collection/aggregation tool--must be able to keep working during abnormally high event counts. IDS/IPS devices and firewalls can generate an enormous number of events during a virus or worm outbreak. You can increase the SEM's availability with multiple horizontally partitioned databases. One database can be used for events generated recently while another database is used for historical analysis. The database used for correlation and reporting may be different from the database used for the short-term events. If everything is stored on a single database, the additional load on the server will hamper its ability to run reports.

You also can ensure information availability using other methods. A fault-tolerant design, for instance, keeps your system operational in the event of a network or physical infrastructure failure. With a fully redundant design, events from each device are sent to primary and secondary event collectors, which then send data to aggregation engines. Ideally, from a network-design perspective, these should all be located near the event source. Additionally, you could have two event-management databases--one at a primary SOC and the other at a secondary SOC.

Log Jam

Most security devices can send their logs to a syslog server, which acts as an event collector for these various devices, but the real integration challenge comes when integrating incident logs with application logs, such as those from a Web server or Web application (for example, someone logging in). The vendors of enterprise-class SEMs offer agents or wizards to help simplify this integration, but you must decide which data elements should be captured and written to logs. Some key data to include are date and time, IP address, action performed and user name.

Information on an event must be analyzed in order to be categorized as an incident. Once an event becomes an incident, the ability to track, monitor and document it becomes critical. You can record all this information on a spreadsheet or in an Access database, but these aren't scalable. Much better is a robust management program from the likes of Remedy.

Policies and Processes

Because the SOC will be the first place security events are collected and managed, there must be a clear and consistent set of processes through which to execute your organization's policies and procedures. (This consistent methodology will satisfy documenting requirements specified by SOX and other regulations.)

Incident-response guidelines dictate how incidents are prioritized, who performs what tasks and when to declare an enterprise-level incident. There are many models a company can use to build an incident-response process, including those from SANS, NIST, Carnegie Mellon University, IETF and RFC 2350.

The basic components of an incident-response process are identification, containment, removal, recovery and post-incident analysis. Most companies tend to do the first four steps very well, but truncate the last or never bother to implement the fixes recommended by post-incident analysis. Thus, companies will find themselves dealing with the same problems again and again.

Part of the process is identifying and training an incident-response team, which will usually consist of dedicated and nondedicated personnel. Having an incident-response team reduces chaos, provides consistency and ensures that all process elements, especially notification and documentation, are followed. In addition to the usual techies, such as firewall and server administrators and network operators, include a representative to communicate to the business side. This will guarantee that accurate and timely information is delivered to system users so backup and contingency plans can be implemented.

Jay Milne, CISSP, is a senior security consultant for a large healthcare provider in Northern California. He has 15 years of IT security experience and has consulted for companies such as Chevron and Genentech. Write to him at jmilne@nwc.com