Evolutionary Data Mining Approach to Creating Digital Logic

Genetic program-based data mining is used for automated reverse engineering of a system.

When required to reverse-engineer a product, engineers often do not have design specifications for the system, and the machine may not be disassembled or invasively examined. The engineer might attempt to find the correct signal through trial and error, but this would be very time-consuming, and access to experimental resources is very expensive. To deal with this problem, a genetic program (GP)-based data mining (DM) procedure has been invented.

The first digital logic (DL) to be reverse-engineered using the Genetic Program (GP)-Based Data Mining procedure. This DL is not known to the GP. The GP only has access to a database of input signals to the DL and measured output, as well as a database of rules provided by experts for building the DL.
A genetic program is an algorithm based on the theory of evolution that automatically evolves populations of computer programs or mathematical expressions, eventually selecting one that is optimal in the sense it maximizes a measure of effectiveness, referred to as a fitness function. The system to be reverse-engineered is typically a sensor. The sensor is used to create a database of input signals and output measurements. Rules about the likely design properties of the sensor are collected from experts. The rules are used to create a fitness function for the genetic program. Genetic program-based data mining is then conducted. This procedure incorporates not only experts’ rules into the fitness function, but also the information in the database. The information extracted through this process is the internal design specifications of the sensor. The design properties extracted through this process can be used to design a signal that will produce a desired output. Determination of such signals can be essential to ultimate determination of control rules for automatic multiplatform coordination.

GPs require a terminal set and function set as inputs. The terminals are the actual variables of the problem. These can include a variable like “x” used as a symbol in building a polynomial and also real constants. The function set consists of a list of functions that can operate on the variables. When the GP is used as a data mining function, a database of input and output information is required. When the GP is used as a data mining function for evolving digital logic (DL), the database contains inputs to the DL as well as measured outputs. The experts’ opinions are manifested in the selection of the input and associated output to be included in the database. For the DL case, an additional form of input consisting of “rules” about DL construction is included.

Data mining is the efficient extraction of valuable, non-obvious information embedded in a large quantity of data. Data mining consists of three steps: the construction of a database that represents truth; the calling of the data mining function to extract the valuable information, e.g., a clustering algorithm, neural net, genetic algorithm, genetic program, etc; and determining the value of the information extracted in the second step — this generally involves visualization. When used for reverse engineering, the GP typically data mines a database to determine a graph-theoretic structure, e.g., a system’s DL diagram or an algorithm’s flow chart or decision tree. The GP mines the information from a database consisting of input and output values, e.g., a set of inputs to a sensor and its measured outputs.

The DL consists of three input channels, each with a sensor attached. The sensors receive signals from sources one, two, and three. Only measurements from the central source are of interest. Due to the geometry of the sources and properties of the sensors, only sensor two can receive emissions from the central source that are significant. Unfortunately, sensor two’s measurement may be corrupted by emissions from the other two sources. The digital logic is constructed so that if there were significant corruption of sensor two’s measurements, then the final OR-gate returns unity, so the measurements can be ignored.

Genetic program-based data mining has proven effective for reverse engineering the complex digital logic underlying sensor devices (SDs) when the original design specifications for these devices are unavailable and invasive study of the systems is impossible.

The database that was subjected to data mining consisted of known input to the digital logic (DL), the associated measured output, and a set of rules provided by experts relating to their assumptions about the digital logic. It is found that having a set of expert rules in the database is essential; the measured output of the digital logic is rarely sufficient to uniquely reverse-engineer the design.

Experimental observation and theoretical analysis of the effects of uncertainty show that even when there is a significant reduction in the quality of input-output measurement information, the DL map evolved by the GP will still carry enough information for the design of signals with specific properties. The creation of these signals is considered of greater importance than having the exact DL design for the SD. The signals frequently relate to the determination of control rules for platform or multiplatform automation.

This work was done by James F. Smith III and ThanhVu H. Nguyen of the Naval Research Laboratory. NRL-0053