Developing and Validating Statistical Cyber Defenses

The development and validation of advanced cyber security technology frequently relies on data that captures normal and suspicious activities at various system layers.

Enterprise business processes are more connected than ever before, driven by the ability to share the right information with the right partners at the right time. While this interconnectedness and situational awareness is crucial to success, it also opens the possibility for misuse of the same capabilities by sophisticated adversaries to spread attacks and corrupt critical, sensitive information. This is particularly true for an insider threat scenario in which adversaries have legitimate access to some resources and unauthorized access to other resources that is not directly controlled by a fine-grained policy.

BBAC is a data-intensive system that turns real-time feeds into actionable information through a combination of unsupervised and supervised machine learning (clustering and SVMs).
Behavior-Based Access Control (BBAC) augments existing authorization frameworks such as Firewalls, HTTP proxies, and application-level Attribute Based Access Control to provide a layered defense in depth. The specific focus of BBAC is to analyze behaviors of actors and assess trustworthiness of information through machine learning. BBAC uses statistical anomaly detection techniques to make predictions about the intent of creating new TCP connections, issuing HTTP requests, sending emails, or making changes to documents. By focusing on behaviors that are nominally allowed by static access control policies, but might look suspicious upon closer investigation, BBAC aims to detect targeted attacks that are currently going unnoticed for an extended amount of time, usually months before defenders are aware of cyber attacks.

The figure shows a high-level diagram of the processing flow in BBAC, together with the various data sets involved. As shown on the bottom, BBAC needs to ingest a large variety of data from real-time feeds through a feature extraction process. During online use, this data will be used for classification purposes. After parsing the raw observables, BBAC proceeds to go into a feature enrichment phase, where aggregate statistics are computed and information from multiple feeds is merged into a consistent representation. At this stage, BBAC needs to manage intermediate state required for more complex enrichment functions, e.g., calculating periodicity of events.

BBAC is a data-intensive system with successful execution hinging on (a) access to a large amount of external data, and (b) efficient management of internal data. Specifically, meaningful data sets are needed to develop and validate the accuracy, precision, and latency overhead of the BBAC algorithms and prototypes. BBAC’s analysis techniques work best with data that has a rich context and feature space. What is needed is a large amount of granular data to do statistical inference. Getting access to more granular information generally means installing software on end-systems or even recompiling applications (to map memory regions etc.), both of which raise practical concerns. To address granularity issues, BBAC focuses its analysis on data that is easily observable without new software or modifying end systems.

Since BBAC performs analysis at multiple different system layers, it not only needs access to data from sensors at these layers, but the data in each layer needs to be linked to the other layers to represent a consistent picture of observables. To address the problem of independence between data sets, BBAC uses an approach for injecting malicious URLs into request streams of benign hosts. Known bad HTTP requests are retrieved from black-lists, and intelligently inserted into existing connection patterns. It is important to keep the ratio of normal vs. abnormal traffic roughly equal allowing the resulting classifier to make decisions both on known proper behavior as well as known improper behavior.

Development and validation of statistical cyber defenses needs a well-labeled, appropriately sized, and readily available amount of relevant data to make innovative progress, yet too little of such data sets is available today. Agile project management techniques help deliver innovative technology in a difficult-to-work-in, data-intensive environment.

This work was done by Michael Jay Mayhew of the Air Force Research Laboratory, Michael Atighetchi and Aaron Adler of Raytheon BBN Technologies, and Rachel Greenstadt of Drexel University. AFRL-0231