Educating UGVs

Implementing AI Advancements in Thermal Image Training Data Sets

Advancements in the field of artificial intelligence (AI) are accelerating as the technology matures from being research-orientated to being deployed in a wide range of products and services, such as autonomous vehicles. While Convolutional Neural Networks (CNNs) were first described in the 1950s, the technology remained an academic concept until the availability of large training data sets and powerful Graphics Processor Units (GPUs), a processor architecture ideal for the heavy math computational demands associated with neural network processing. Once scientists had low-cost and high-performance platforms, the technology exploded for many commercial uses. Military use is more challenging due to the lack of large data sets, but that is changing too as areas including thermal imagery are starting to be used.

FLIR is developing attribute detectors that utilize special networks to distinguish things like clothing type, whether someone is carrying a weapon, and many other elements of interest.

CNNs, shown in Figure 1, are a form of machine learning that mimic the way our brain processes incoming senses like sight. A collection of “neurons” are arranged in layers with connections between the layers. The term “deep” refers to the number of layers, or depth. The process basically breaks down an image into edges and as the information is transferred to deeper and deeper layers, elements are created like the shapes of wheels, eyes, nose, mouth, and other features. These features, in combination, are then used to create object detectors.

Figure 1. CNNs break down an image into edges and as the information is transferred to deeper layers, elements are created like the shapes of wheels.

Leading research institutions — including Stanford University and companies like Google and Facebook — built large image training libraries using their customers’ data to create a series of object detectors that could intelligently identify people, pets, and other familiar objects. Many companies have gone on to extend deep learning to vision, speech recognition, and other data-centric applications. While developers have access to the underlying technology from academics and large technology companies, object detectors for military applications are not open source and must be created. The process is as follows and is illustrated in Figure 2:

Figure 2. Development of training datasets or image libraries is required for algorithm development.
  1. Collect Images,

  2. Annotate Images,

  3. Train Network,

  4. Test and Optimize,

  5. Release Algorithm.

Most deep learning developers use open source training sets like ImageNet and MS COCO because they have over one million images and over 1,000 annotated object classes. However, these training sets have neither the relevant images, nor object classes, to create AI capabilities for military use cases. FLIR has been focused on creating training data sets containing thermal images, as well as visible images, and applying deep learning techniques to create solutions for challenging defense applications such as drone detection, intrusion detection, vehicle tracking, and many others.

Figure 3. Example CNN hardware

Thermal imagery offers many desirable characteristics. It is totally passive, generates good image quality in all ambient lighting conditions, is effective in seeing through obscurants, and is extremely effective at detecting people and vehicles at very long standoff ranges. The challenge has been how best to generate the hundreds of thousands of images necessary to train networks to create accurate object detectors processing thermal images. The logistics of going out into the field and gathering thermal image data of all the objects of interest in all the environments of interest and from various perspectives is simply too time consuming and costly. In addition, there are objects and scenarios that are very hard to reproduce in the real world. FLIR uses its scale, market penetration, and enabling nature to work with partners to create imagery for even the most difficult scenario, resulting in the ability to generate thousands of images of nearly any object from any vantage point and train networks to detect targets of interest. This has led to the creation of datasets that included weather, changing light conditions, and even odd system noise, resulting in superior system performance.

It is worth noting that deep learning only processes single images so each frame from incoming video is processed one frame at a time. The demands on the resulting frame rate dictate the processing power needed to achieve the desired system response.

CNNs process images at fixed resolutions. Popular networks like Inception process images at either 300 x 300 or 512 x 512 pixel resolution. Does that mean much of the resolution coming from today's high-resolution sensors is thrown away? FLIR has a unique approach to maximizing pixels on target by running the images through a CNN called a single shot detector (SSD) that looks for objects of interest in the image and then extracts the object and pushes it to the object detection classifier that processes 512 x 512 images. The benefits of this approach are significant. Speed is maintained and the ability to put as many pixels on a target improves system accuracy. This is especially important in surveillance applications aimed at identifying objects at as great a standoff distance as possible.

There are many difficulties in applying vision systems in military applications. In applications where the operator is directly viewing or watching a remote monitor the concerns over fatigue or simply not turning on these systems leads to potentially dangerous gaps in situational awareness. Computers never tire, and cameras are very reliable, so this combination can offer both operator relief and increased awareness.

One major area of concern is the threat that weaponized drones present to military planners. Commercial drones are fast and very small, making them difficult to detect and track. Accurate detection at ranges that enable countermeasures is an important capability both military and public safety stakeholders are working to develop. FLIR is developing a drone detection capability that uses radar to detect small UAVs at long range that automatically points long-range visible and thermal cameras at the drone. Images are processed through FLIR's drone detection object detector for classification. Positive detections are relayed to the system for appropriate mission response. Once the target is classified the target can be tracked at update speeds much higher than most radar systems can deliver.

Thermal imagery is extremely effective at detecting people and vehicles at very long standoff ranges.

Expanding from an object detector, FLIR is developing attribute detectors, which utilize specialized networks to create attribute detection including clothing type, identifying carried bags, weapons, and many other elements. In addition, there are networks for “fine grain” classifiers that can distinguish subtle object variations including make and model of a car, or a person's gender or race.

The training set is created and the images are processed in real time. Processing is typically performed on a dedicated computer with powerful GPU cards like those from NVIDIA. It can take multiple days for the computer to process a training data set. After processing is complete, the object detection algorithm is available for run-time testing. Video, live or recorded, is run through the computer. The performance of the object detector is evaluated and a probability of detection is calculated. To meet “real-world” requirements, false alarms must be minimized to a small fraction of instances, which requires fine tuning of the training set. Optimization is accomplished by simply adding more images to the training data set and using techniques including hard negative mining and back propagation. The network processes images of what it is told are not the objects it is being trained to detect. An example is training the network that a fire hydrant is not a person (they have a similar aspect ratio) by adding images of fire hydrants and labeling them as such.

In a period of only a few years deep learning has established itself as a powerful technology and new research into ever more capable networks is happening at a furious pace. The next major advancements will be made in hardware. GPUs originally developed for gaming have become the processors running CNNs, however industry is rushing the development of purpose-built CNN processors. Within the next 2-3 years, we will see ten- to thirty-times more efficient processors that enable more edge CNN computing. Autonomous military platforms operating at the edge will be critical to bringing AI to the field and utilizing all its power to the benefit of the warfighter.

This article was written by Arthur Stout, Director of Business Development, FLIR Industrial Business Unit (Wilsonville, OR). For more information, visit here .