Data Mining in Cyber Operations

Data mining models describe dynamic behavior of attacks and failures, enabling defenders to detect and differentiate simultaneous attacks on a target network.

Defending cyberspace is a complex and largely scoped challenge that considers emerging threats to security in space, land, and sea. The global cyber infrastructure presents many challenges because of the complexity and massive amounts of information transferred across the global network daily. The cyber infrastructure is made up of the data resources, network protocols, computing platforms, and computational services that bring people, information, and computational tools together.

The Knowledge Discovery from Data process allows for the “mining” of valuable knowledge from vast amounts of data, just as a miner mines for gold.

Data mining is a process of discovering interesting patterns in large amounts of data, which is often a challenge in cyber operations. In order to gain a tactical edge, a warfighter must be able to apply data mining techniques to be maneuverable in cyber space. Maneuverability in cyberspace allows attackers and defenders to simultaneously conduct actions across multiple systems at multiple levels of warfare. For defenders, this can mean hardening multiple systems simultaneously when new threats are discovered, killing multiple access points during attacks, collecting and correlating data from multiple sensors in parallel, or other defensive actions.

Defensive operators must be vigilant to identify new attack vectors, real-time attacks as they happen, and signs of attacks that have gotten through the security perimeter. This means that defenders must continuously sift through vast amounts of sensor data that could be made more efficient with advances in data mining techniques to accurately map the attack surface, collect and integrate data, synchronize time, select features, develop models, extract knowledge, and produce useful visualization. Effective techniques would enable models that describe dynamic behavior of complicated attacks and failures, and allow defenders to detect and differentiate simultaneous sophisticated attacks on a target network.

Before one attempts to extract useful knowledge from data, it is important to understand the steps in the data mining process. Simply knowing many algorithms used for data analysis is not sufficient for successful data mining (DM). The figure outlines the process of mining data that leads to knowledge discovery.

The traditional approach to understanding and protecting the cyber domain is a highly manual and human-intensive process. It is growing increasingly difficult for these manual processes to keep up with both the massive amount of data, and the quickly changing landscape of the cyber domain. It has become necessary to utilize automated techniques to maintain situational awareness, and effective offensive and defensive strategies in the cyber realm. Data mining within cyber operations provides some techniques to address these challenges. Through the data mining process, one can find hidden patterns, interesting data, or relevant correlations within large datasets. It provides techniques to automate the discovery of structure or patterns that would otherwise be out of reach from human analysts. This analysis is typically performed in an automated process with a variable amount of human interaction, depending on the application.

Intrusion Detection and Prevention Systems (IDPS) are automated software designed to monitor traffic or mine through select data sources in search of evidence of an intruder attempting to compromise the network. An IDPS is created to monitor characteristics of a host, the network, and a combination of both host/network. IDPS use three basic types of detection to discover intrusions: signature-based detection, anomaly-based detection, and stateful protocol analysis.

Signature-based IDPS use signatures — patterns known to indicate a threat — to compare to observable event patterns in order to identify a current threat. A signature-based IDPS is used in firewalls as a first line of defense, as it can efficiently identify threats and act before damage is done for very precisely defined and common threats. A disadvantage to this approach is that it relies entirely on a database of known attack signatures to compare against the current network activity. Data mining may be applied to a signature-based IDPS by observing and analyzing known and suspected attacks to discover new signatures and patterns indicative of an intrusion.

Anomaly-based detection depends on understanding normal patterns of network activity and looking for activity that appears abnormal relative to normal activity. An anomaly-based IDPS can be successful in detecting attacks that are novel or vary too far from a signature to be detectable by the signaturebased IDPS. Data mining is very applicable to this approach, as anomaly detection relies entirely on defining a baseline of normalcy. Various data mining techniques may be effectively used to learn a meaningful definition of normalcy based on known benign network connections.

Stateful Protocol Analysis also looks at behavior outside of known signature patterns to determine precisely how protocols are designed to be used, and what the protocol creators expect to see when those protocols are used. The key is not only in finding anomalous behavior, but also in finding an anomalous behavior beyond what is typical for a specific network activity. Again, data mining proves useful for defining what constitutes normal use based on previous network activity.

This work was done by Misty Blowers, Stefan Fernandez, Brandon Froberg, and Jonathan Williams of the Air Force Research Laboratory; and George Corbin and Kevin Nelson of BAE Systems. AFRL-0238