The focus of our work is to achieve a high level of interaction between a real-time vision system capable of tracking moving objects in 3-D and a robot arm equipped with a dexterous hand that can be used to intercept, grasp and pick up a moving object. We are interested in exploring the interplay of hand-eye coordination for dynamic grasping tasks such as grasping of parts on a moving conveyor system, assembly of articulated parts or for grasping from a mobile robotic system. Coordination between an organism's sensing modalities and motor control system is a hallmark of intelligent behavior, and we are pursuing the goal of building an integrated sensing and actuation system that can operate in dynamic as opposed to static environments.
There has been much research in robotics over the last few years that addresses either visual tracking of moving objects or generalized grasping problems. However, there have been few efforts that try to link the two problems. It is quite clear that complex robotic tasks such as automated assembly will need to have integrated systems that use visual feedback to plan, execute and monitor grasping.
The system we have built addresses three distinct problems in robotic hand-eye coordination for grasping moving objects: fast computation of 3-D motion parameters from vision, predictive control of a moving robotic arm to track a moving object, and grasp planning. The system is able to operate at approximately human arm movement rates, using visual feedback to track, stably grasp, and pickup a moving object. The algorithms we have developed that relate sensing to actuation are quite general and applicable to a variety of complex robotic tasks that require visual feedback for arm and hand control.
Our work also addresses a very fundamental and limiting problem that is inherent in building integrated sensing/actuation systems; integration of systems with different sampling and processing rates. Most complex robotic systems are actually amalgams of different processing devices, connected by a variety of methods. For example, our system consists of 3 separate computation systems: a parallel image processing computer, a host computer that filters, triangulates and predicts 3-D position from the raw vision data, and a separate arm control system computer that performs inverse kinematic transformations and joint-level servoing. Each of these systems has its own sampling rate, noise characteristics, and processing delays, which need to be integrated to achieve smooth and stable real-time performance. In our case, this involves overcoming visual processing noise and delays with a predictive filter based upon a probabilistic analysis of the system noise characteristics. In addition, real-time arm control needs to be able to operate at fast servo rates regardless of whether new predictions of object position are available.
The system consists of two fixed cameras that can image a scene containing a moving object (Figure~\ref{system}). A PUMA-560 with a parallel jaw gripper attached is used to track and pick up the object as it moves (Figure \ref{experimentalhardware}). The system operates as follows: \begin{enumerate} \item The imaging system performs a stereoscopic optic-flow calculation at each pixel in the image. From these optic-flow fields, a motion energy profile is obtained that forms the basis for a triangulation that can recover the 3-D position of a moving object at video rates. \item The 3-D position of the moving object computed by step 1 is initially smoothed to remove sensor noise, and a non-linear filter is used to recover the correct trajectory parameters which can be used for forward prediction, and the updated position is sent to the trajectory-planner/arm-control system. \item The trajectory planner updates the joint level servos of the arm via kinematic transform equations. An additional fixed gain filter is used to provide servo-level control in case of missed or delayed communication from the vision and filtering system. \item Once tracking is stable, the system commands the arm to intercept the moving object and the hand is used to grasp the object stably and pick it up. \end{enumerate} \psfig{figure=/u/rhythmics/yoshimi/WRITE/papers/track2/pics/exp-setup.ps,height=3in} } \psfig{figure=/u/rhythmics/yoshimi/WRITE/papers/track2/pics/train6.ps,height=4in} }
Because we are using a parallel jaw gripper, the jaws must remain aligned with the tangent to the actual trajectory of the moving object. This tangential direction is computed directly from the calculation of the bending parameter $\phi$ during the trajectory modeling phase and is used to align joint 6 of the robot to keep the gripper correctly aligned. This correct alignment allows grasping to occur at any point in the trajectory.
Figure \ref{graspseq} shows 3 frames taken from a video tape of the system intercepting, grasping and picking up the object. The system is quite repeatable, and is able to track other arbitrary trajectories in addition to the one shown.
\begin{figure} \psfig{figure=/u/seful/timcenko/Proj/tracking/mma/dsNew.ps,width=6in} \caption{Input signal $s_1$ (black) and filtered signal $s_0$ (gray)} \label{dsfig} \end{figure} %pics/dsComplete1.ps,height=3in} \begin{figure} \psfig{figure=pics/xy.ps,height=2in} \caption{Input trajectory (black) and filtered trajectory (gray)} \label{xy} \end{figure} \begin{figure} \centerline{ \psfig{figure=pics/seq3.ps,height=2.75in}} \centerline{ \psfig{figure=pics/seq4.ps,height=2.75in}} \centerline{ \psfig{figure=pics/seq5.ps,height=2.75in}} \caption{Intercepting, grasping and picking up the object} \label{graspseq} \end{figure} \vspace {.5in}
The system is robust in a number of ways. The vision system does not require special lighting, object structure or reflectance properties to compute motion since it is based upon calculating optic-flow fields. The control system is able to cope with the inherent visual sensor noise and triangulation error by using a probabilistic noise model and local parameterization that can be used to build a non-linear filter to extract accurate control parameters. The arm control system is able to cope with the inherent bandwidth mismatches between the vision sampling rate and the servo-update rate by using a fixed gain predictive filter that allows arm control to function in the occasional absence of a video control signal. Finally, the system is robust enough to repeatedly pick up a moving object and stably grasp it.
We are currently extending this system to other hand-eye coordination tasks. An extension we are pursuing is to implement other grasping strategies. One strategy is to visually monitor the interception of the hand and object and use this visual information to update the {\bf Drive} transform at video update rates. This approach is computationally more demanding, requiring multiple moving object tracking capability. The initial vision tracking described above is capable of single object tracking only. If we attempt to visually servo the moving robotic arm with the moving object, we have introduced multiple moving objects into the scene.
We have identified 2 possible approaches to tracking these multiple objects visually. The first is to use the PIPE's region of interest operator that can effectively ``window'' the visual field and compute different motion energies in each window concurrently. Each region can be assigned to a different stage of the PIPE and compute its result independently. This approach assumes that the moving objects can be segmented. This is possible since the motion of the hand in 3-D is known - we have commanded it ourselves. Therefore, since we know the camera parameters and 3-D position of the hand, it will be possible to find the relevant image-space coordinates that correspond to the 3-D position of the hand. Once these are known, we can form a window centered on this position in the PIPE, and concurrently compute motion energy of the moving object and the moving hand in each camera. Each of these motion centroids can then be triangulated to find the effective positions of both the hand and object and compute the new {\bf Drive} transform. Both computations must, however, compete for the hardware histogramming capability needed for centroid computation, and this will effectively reduce the bandwidth of position updating by a factor of 2.
Another approach is to use a coarse-fine hierarchical control system that uses a multi-sensor approach. As we approach the object for grasping, we can shift the visual attention from the static cameras used in 3-D triangulation to a single camera mounted on the wrist of the robotic hand. Once we have determined that the moving object is in the field of view of this camera, we can use its estimates of motion via optic-flow to keep the object to grasped in the center of the wrist camera's field of view. This control information will be used to compute the {\bf Drive} transform to correctly move the hand to intercept the object. We have implemented such a tracking system with a different robotic system \cite{alle89z} and can adapt this method to this particular task.