An object-recognition algorithm analyzes data from two-dimensional (2D) images to locate and identify possibly complexly shaped three-dimensional (3D) objects in possibly highly cluttered scenes depicted in the images. More specifically, the algorithm implements a relatively simple, effective, and fast process for recognizing 2D objects that may be partly occluded and that have shapes that can be modeled by use of sets of line segments (see figure). Because the algorithm tolerates a fair amount of perspective distortion, it is also applicable to 3D objects represented by sets of viewpoint- dependent 2D models.

Books in a Test Image were modeled by use of line segments. Then in a process involving correspondences between model lines and lines detected in the image, the modeled books in the image were recognized.

Most object-recognition processes, including the one implemented in this algorithm, involve (1) finding correspondences between features in images and features in models of the objects to be recognized, then (2) computing hypothesized model poses, then (3) searching for additional image features that support these poses. The most challenging part of this process is the identification of corresponding features when the images are affected by clutter, partial occlusion of objects, changes in illumination, and changes in perspective. Indeed, once the feature-correspondence problem is solved, the objectrecognition task becomes almost trivial.

Features that have been employed in prior object-recognition algorithms include points, edges, and textured regions. Each of these types of features offers different advantages and disadvantages in different applications. Line segments were chosen as the features to use in this algorithm because, relative to other features, they are better suited for recognition of partially occluded objects amid clutter.

The algorithm incorporates the assumption that at least one line of a model of an object that one seeks to recognize is detected as an unfragmented line in the image. As used here, “unfragmented” signifies that the affected image line is extracted from the image as a single continuous segment between the two end points of the projected model line. To be unfragmented, the line must also be unoccluded. Additional model lines must be present in the image for verification, but these may be partially occluded or fragmented.

The object-recognition process of this algorithm comprises three stages, following an approach that enables rapid focusing of computational resources on the highest-payoff hypothesized poses, thereby enabling a large reduction in the amount of computation time. In the first stage, approximate model poses are hypothesized and listed. Every pairing of a model line to an image line first contributes a pose hypothesis consisting of a similarity transformation. When a pair of non-parallel model lines and corresponding pairs of image lines form corner-like structures and the angles of the corners are sufficiently similar (within 45°), a pose hypothesis consisting of an affine transformation is added to the list of hypotheses. Typically, each model-to-image line correspondence contributes between one and six hypotheses to the list.

The algorithm uses information inherent in a single line correspondence (position, orientation, and scale) to reduce the number of correspondences that must be examined in order to find an approximately correct pose. For m model lines and n image lines, the number of poses hypothesized is O(mn) [where “O(x)” signifies “of the order of x”]. This number stands in contrast to the generally much larger numbers [O(m3n3) or O(n3)] generated in some prior algorithms. In this algorithm, starting with an approximate pose instead of a precise pose makes it possible to greatly reduce the number of poses that must be examined to ultimately find a correct precise pose.

Most of the hypothesized poses are expected to be inaccurate because most of the feature correspondences used to generate them are incorrect. Accordingly, in the second stage of the process, each hypothesized pose is ranked on the basis of similarity of the corresponding local neighborhoods of lines in the model and image. The similarity measure is largely unaffected by image clutter, partial occlusion, and fragmentation of lines. The ranking of the hypothesized poses is invariant under image translation, scaling, and rotation, and is partially invariant under affine distortion of the image. The combination of (1) the generation of hypothesized poses from image lines that are assumed to be unfragmented with (2) the neighborhood similarity measure makes it possible to quickly generate a ranked list of approximate model poses that is likely to include a number of highly ranked poses that are close to the correct model pose.

In the third stage of the process, a robust pose-estimation subalgorithm effects a subprocess of refinement and verification, starting from the few approximate poses ranked most highly in the second stage. The subalgorithm used in this stage is a modified version of a previously developed algorithm that was selected because it is efficient, is tolerant of clutter and occlusion, and does not make binary correspondence decisions until an optimal pose is found.

This work was done by Philip David of the Army Research Laboratory and Daniel DeMenthon of the University of Maryland. For more information, download the Technical Support Package (free white paper) at  under the Information Sciences category. ARL-0009

This Brief includes a Technical Support Package (TSP).
Shape-Based Recognition of 3D Objects in 2D Images

(reference ARL-0009) is currently available for download from the TSP library.

Don't have an account? Sign up here.