Developing a Multi-Modal UGV Robot Control Interface

Unmanned ground vehicles (UGV) are currently being used or developed for operational maneuvers (e.g., reconnaissance and IED defeat), maneuver support (e.g., route clearance), and sustainment (e.g. convoy and resupply) missions. Despite the demonstrable benefits of UGVs, significant challenges remain to their effective integration into military operations.

Clearpath Husky (left) and Segway RMP (right) platforms equipped with MINOTAUR components.

Currently-fielded UGVs require active remote control or teleoperation, even for mundane tasks such as accompanying a soldier or vehicle during maneuvers. While UGVs have proven invaluable in removing warfighters from harm's way, most notably in explosive ordnance disposal missions, the current reliance on teleoperation strongly restricts the number of platforms that can be deployed and the range of missions that can be effectively supported.

Developing an Interface

Charles River Analytics and 5D Robotics, Inc. are developing a Multimodal Interface for Natural Operator Teaming with Autonomous Robots (MINOTAUR) human-machine interface (HMI). The technology is designed to enable a UGV to become a true support agent as part of a human-robot team. The work is particularly focused on developing the autonomy and controls necessary to enable a UGV to accompany a human operator through challenging outdoor environments in a similar manner as a fellow squad member, by responding to speech and gesture commands, providing verbal and non-verbal feedback, maintaining formation, and avoiding obstacles along the way. In the near term, the immediate application of this technology is aimed at enabling a UGV to act as a robotic mule that could offload much of the 60-100lb of equipment and supplies carried by each member of a squad.

An autonomous system needs to be understood and trusted by its users to be successfully integrated into regular operations. If an operator must constantly check in on a UGV, whether via the operator control unit (OCU) or by looking over their shoulder, they may decide to revert to manual control, thereby losing the benefits imparted by autonomy. A primary goal of the program is to develop a simple but highly robust semi-autonomous system that can establish a user's trust in autonomy, and act as a foundation for future manned/unmanned teaming technologies that will incorporate advanced autonomy and reasoning.

Human pose estimation and recognition of a “follow” gesture.

The MINOTAUR system consists of an autonomy payload installed on the host UGV, and a wearable OCU. The system fuses complementary sensing modalities, including cameras, LIDAR, and ultra-wide band (UWB) beacons to track a human operator, recognize hand gestures, and detect and avoid obstacles across a range of environments and lighting conditions. A wearable OCU (currently any Android device) provides the operator with at-a-glance UGV status, captures speech commands, and allows the operator to assume manual control if needed. The system implements an open architecture using the robot operating system (ROS), and is currently integrated on board Clearpath Husky and Segway RMP UGV platforms.

Person Tracking

Person tracking is achieved by fusing data produced by three complementary modules: a camera-based tracker, a LIDAR-based tracker, and an ultra-wide band (UWB) beacon-based tracker. The camera-based tracker uses a machine learning-based pedestrian detection algorithm coupled with an appearance and kinematics-based tracker to differentiate between the robot operator and nearby individuals. The ability to accurately track humans in video is a long-standing area of research in the field of computer vision that has witnessed tremendous advances in recent years, leading to software implementations capable of tracking individual pedestrians at camera frame rates.

A complementary PulsON UWB tracking component, developed by 5D Robotics, supports operations in degraded or non-line-of-sight environments, enabling the UGV to track the operator through dust, fog, dazzle, lens fouling, and vegetation via a small UWB beacon carried by the operator. The combination of optical and UWB tracking provides a more robust tracking solution than a single mode approach. The system also incorporates LIDAR (interchangeably a Velodyne VLP-16 PUCK or Hokuyo UTM-30LX), which provides accurate range data to nearby pedestrians and surfaces, and enables safe navigation through cluttered environments.

User opportunistically directing the robot across multiple modalities. In this example, the user verbally commands the vehicle to go to a vehicle and uses a gesture to provide additional contextual information to the robot on location of the specified vehicle.

Beyond providing an estimate of the operator's position relative to the UGV, the data produced by the camera-based tracker additionally enables estimation of the operator's 3D pose, which is used to recognize gesture commands including “follow me,” “stop,” “go <that way>,” “back off,” and “come closer.” The recovered pose data further enables recognition of important actions or events, such as the operator going prone, which may be interpreted as an implicit “stop” command.

In addition to gestures, the platform may be controlled via speech commands using a microphone linked to the wearable OCU. The system incorporates natural language processing (NLP) technology to process multi-modal speech and gesture inputs, enabling the recognition of commands such as “drive 20 meters in <that> direction,” or “follow <him/her>.” The large space of possible expressions that can be used to convey the same command poses an ongoing challenge to designing effective language-based human machine interfaces. Phrasing, use of synonyms, idioms, abbreviations, and improper grammar or informal language have historically required users to learn specific, predefined commands to control a system, requiring extensive training and leading to poor performance in high-stress scenarios. Integration of an appropriate NLP technology enables robust processing of a range of complex expressions based on simple, rapidly-designed grammars. Automatic speech recognition, which has long performed too inconsistently to be viable in most real-world environments, has recently been shown to achieve human parity in conversational speech, paving the way for field-capable speech-based natural interfaces in the near future.

In addition to processing operator inputs, the NLP component can be applied in reverse, enabling the UGV to provide feedback in a manner that is easily understood by the operator, playing a crucial role in maintaining the operator's awareness of the UGV's current status and plans. Status and acknowledgments are provided verbally or as text outputs via the OCU display or linked speaker/earpiece.

Designing a wearable OCU capable of providing an operator with sufficient state information and controls for a UGV poses multiple challenges due to limited screen real estate. The Android-based OCU provides a centralized location for the multi-modal interactions with the vehicle. To minimize the amount of “head down” time, the watch-based OCU enables quick control inputs through lightweight interactions as well as at-a-glance information status summaries. This enables operators to quickly understand and modify UGV behavior while maintaining focus on the mission at hand. Furthermore, this approach enables operators to flexibly and opportunistically choose operationally appropriate input modalities and to provide redundant commands across modalities (e.g., a “stop” command simultaneously issued verbally and with a gesture), which promotes robustness in challenging environments and improves command accuracy. This approach also enables operators to leverage the strengths of each modality to provide additional information on base commands, such as giving a verbal command to go to a particular location while providing directional input with a pointing gesture.

Recent advances in machine learning, computer vision, speech recognition, natural language processing, and wearable technologies are paving the way for the development of natural human-robot interfaces that require minimal training and perform robustly in high-stress environments. The development of the first reliable, easy-to-use and easily-understood autonomous ground systems that can demonstrably earn the trust of human operators will mark an important turning point in the adoption and deployment of UGVs as true support agents within human-robot teams.

This article was written by Camille Monnier, Principal Scientist, Charles River Analytics Inc. (Cambridge, MA). For more information, Click Here .