Our desk plane was a piece of white poster board on a desk. The object to be scanned was placed on the board. Off to the left and raised up a few feet was our light source. Our light source was a small desk lamp with circular bulb. It was the closest approximation we could get to a point source. We had a good quality grey scale camera mounted on a tripod looking down at the object. The light source was slightly higher than the camera. Our camera (Sony XC-77) was of a higher quality than necessary to implement the shadow scanner. We could have made do with a quick-cam. The camera itself was connected to a low cost WinTV video capture card (worth about $50). The WinTV card was mounted in a Pentium II 400 PC running Linux with the V4L drivers installed on it. The V4L (Video for Linux) package, in addition to coming with drivers for the WinTV card, also came with a useful program called streamer, which allowed us to take grey scale snapshots at a rate of over 5 frames per second at a resolution of 640x480 pixels.
First, we took a number of images of a checkerboard pattern on the desk plane (each square side was 23 mm). These images were used with the calibration software, explained below, to get an equation for the desk plane.
Next we took at least two images of a pencil (length 76 mm) at different locations on the desk plane. These images were to be passed to the light calibration program to extract the coordinates of the light source.
Finally, we used the streamer program to capture approximately 60 images at a rate of 5 frames per second as one of our group members waved the stick across the object being scanned. Once we captured these images, we then processed them separately (detailed in the following sections).
This setup also allowed us to take two different scans of the same object with different light sources. It is useful to take scans with different light sources since each light source casts a different set of shadows over the object. Since we do not get any data from portions of the object obscured in shadow, the more scans we take with different light sources the more blanks we can fill in. The trick is to take two scans with different light sources but with the camera and object stationary. This way the origin of the system, and the location of the object never change and therefore the point clouds are already registered with each other and can be directly merged.
To take our two light source scan, we first set the platform up
normally, and took our initial light source and camera calibration
images. Then we took a series of images of the stick passing over
the object. That was the end of the first scan. Next we moved the
light source only to a different location, and then took another
series of images of the stick passing over the object. Finally, we
removed the object but left the light source in its second
position. We then took a second set of light calibration images
AFTER the second scan was taken. Now we have enough information to
compute the two point clouds separately. We can then simply overlay
them to fill the holes with the additional scan.
The main goal of ccalib (the camera calibration program) is to automatically find the corner features within the calibration image. After these pixel locations are found they are then associated with some 3D point and given to the Tsai camera calibration routines for the actual camera calibration calculations.
The steps for automatically finding corner features are as follows:
The Edge Detection step is a simple neighborhood averaging with the pixel values at the edges of the image set to zero. Also in this step a histogram of the edge image is constructed so that a threshold value can be computed for the Hough Transform. This histogram has a know structure, looking at it from the left to the right ( dark pixel values to bright pixel values ), the histogram starts with high values for the dark pixel values. Then as you move to the right this slopes down very quickly until a local minimum is reached. From here the histogram values increase as the pixel intensity increases until a local maximum is reach. From here the histogram values drop down to zero. The threshold value that the Hough transform step is interested in is at the local maximum that appears after the local minimum. This value is found by knowing the structure of the histogram and looking from both the left and the right.
The edge image found in the Edge Detection step along with the threshold value are then sent to the Hough Transform step. The Hough Transform is the basic Hough Transform algorithm and in fact the code is from Naveed's Computer Vision homework. After the first part of the Hough Transform is computed a image in the Hough space is created. The bright spots in the Hough image represent lines that exist in edge image.
After the Hough image is computed and labeled it then is given to the step were these regions are transformed back into lines in the image space. The centers of mass of each labeled region in the Hough image is computed and the end points of the corresponding lines are found. These endpoints are then checked to make sure that they are indeed real lines and if they are, they are added to the list of lines found. After all of the end points are found they are checked again so that they are nice lines for the rest of the processing. Nice lines are ones that go from left-to-right or top-to-bottom.
Now that we have a list of lines within the image we need to find possible corners in the image. The first step is to split the list of lines into horizontal and vertical lines. Then sort the lines so that horizontal lines go from top-to-bottom and vertical lines go from left-to-right. After this the intersections of all of the horizontal and vertical lines are found. These intersections are the possible corners of the calibration pattern.
See Analysis section for discussion about the nature of these possible
corner points and their usage.
find_light (the light calibration program) takes as arguments the Tsai parameters file output by ccalib, the measured height of the calibration pencil in millimeters and a set of two or more light calibration images. The whole pencil shadow (tip and base) must be visible in these images.
For any pair of light calibration input images, this program exactly follows the algorithm described by Jean-Yves Bouguet and Pietro Perona in their paper 3D Photography on your desk. A brief description of this is as follows: with the camera's extrinsic parameters and the user's clicks on the tip and base of the pencil shadow in image space, we can simply calculate the the actual 3D coordinate of both by separately intersecting their vectors with the desktop plane. With the known height of the pencil and the pencil base point, calculating the pencil tip is trivially inferable. Now, we know that the light must lie on the ray formed by the tip of the pencil and the tip of its shadow. With two or more such candidate rays, we can find their intersection - this is the location of the light.
When there are more than two such input images, we use the mean of the light locations calculated from each unique pair. This creates N * (N - 1) / 2 pairs and corresponding light locations, where N is the number of input images. The averaging smooths the error associated with human input and also compensates for the fact that the light is not a perfect point source.
This program requires the X11 windowing system to function, the GUI is written using this library.
When an image is presented to the user, they should carefully click first on the top point in the pencil shadow (the tip) and second on the bottom point of the pencil / shadow (slightly ambiguous, but a consistent relative position choice should work well) . The interface will then proceed to show the next input image, etc. until there are none left. The same sort of input is expected of the user for each image.
The output of the program is the 3D coordinate of the light. This can be redirected to a file if later use is required (e.g. in the desk_scan program).
The thresholding is divided into 2 phases:
Computing the shadow plane for each image is accomplished by finding the three points that define the shadow plane. First we have the 3D coordinate of the light source. This has already been computed during the light calibration phase and is actually a command line argument to the program. The other two points are computed from the "reference rows." Near the top and the bottom of the image we pick a pair of rows that never intersect the object that we are scanning. Therefore those rows pass through the projection of the shadow directly on the desk plane. For each image, we find the exact column that the shadow edge (we are tracking the left edge of the shadow) crosses each of these two rows. Since we know that these two points are on the desk plane we can directly apply the transformation given to us by the camera calibration to recover their 3D coordinates. These 3 points are computed for each image (the light coordinate always stays the same, so we are really only computing two points per image).
We use a thresholding technique on the intensities of each pixel along that reference row to determine the exact column where the left edge of the shadow intersects with the reference row. We need to keep track of a number of different intensities. These are grey scale images, so the intensities range from 0 (black) to 255 (white):
Once we have computed all of these values across the reference row, we then go back and scan across the row a second time. We look for the pixel that crosses the Ishadow threshold for the first time. (Pixels at the beginning of the image should be above Ishadow, and we are looking for the first pixel that drops below this threshold). We then do simple linear interpolation between the pixel immediately before we crossed the threshold and the pixel immediately after the threshold to find the exact pixel at which the shadow edge crossed the reference row. Since we are interpolating, we get a column value at the sub-pixel level.
Computing the time that the shadow edge crosses each pixel is done in a very similar manner to how we compute the reference points. However, in this case we look at a given pixel's value over time, holding the row and column fixed. In the previous case we hold the row and time fixed and varied the column value.
We again compute the intensities: Imax, Imin, Icontrast, and Ishadow. We get these values for every pixel. If a particular pixel never has an Icontrast value greater than the threshold set, we can say that the pixel has either never been crossed by the shadow, or was always in shadow. In most cases this occurs in pixels that are always in the shadow of the object that we are scanning. No data is computed for pixels that are always in shadow.
After making a first pass of the images and getting the intensity values, we make a second pass of the image and find the first frame in which the image has crossed the Ishadow threshold. We then interpolate between that frame and the frame immediately before it to find the exact time (frame number) that the shadow edge crossed that pixel. We get this value at the sub-frame level.
See the analysis section for a more detailed description of the performance of this method and the problems encountered.
From the camera calibration, we have the normal of the desk plane and the perpendicular distance between the desk plane and the optical center of the camera (the origin of our system). From this we compute the equation of the desk plane in the format: ax + by + cz = d.
To find the 3D coordinate of a pixel in the plane, we look up the shadow crossing time for that pixel (computed above for all pixels). We then go to our list of reference points that we computed for each image and find the reference points for the exact time frame that the shadow edge crossed this pixel. Since the time is at the sub-frame level we do another linear interpolation of the reference points, which were only computed for each whole frame.
These reference points are in 2D image coordinates (row, column). By shooting a ray through the reference points on the image and intersecting that array with the desk plane, we get the 3D coordinates of the two reference points for the time frame of the pixel that we are trying to compute. These two reference points, along with the light position (computed from the light calibration) define another plane, the shadow plane. We intersect the ray shot through the pixel that we are trying to compute with the shadow plane, and this gives us the 3D coordinate of that pixel. We repeat this process for all pixels in the image to get the point cloud.
A sample camera calibration input image and output parameters. The camera parameters output file is in the exact format that is output by the Tsai code and it specifies the intrinsic and extrinsic parameters of the camera.
Coming up with a suitable camera and light setup was one of the
biggest challenges of the project, and even now, it could still use
some improvement. It took us hours to get images of the object
with visible shadows on it. Finally, we were relegated to scanning
in complete darkness (with the exception of the one light source)
and to scanning only white objects.
Some analysis of our camera calibration scheme, via discussion of the Refine the intersections step from the items listed in ccalib technical discussion.
The possible corners found are very noisy and need to be refined, which is the next step in the processing. Since we know that these points are close to the true corner point we just want to move them around so that they are closer to the true corner point. Following Paul's suggestion, we first try to move the point up and down to find the place were neighbor pixels switch colors between black and white. Then the point is moved left and right again looking for a change in color. If the point was successfully moved to a "better" location it is kept; otherwise the point is thrown away. After all of the points have been moved, they are then refined more to get rid of points that close to each other so that only one point is at a corner.
These refined corner points are the points that get sent to the Tsai camera calibration code a the feature points with in the calibration image. Since we sorted the lines and computed the intersections in a certain order we know that in the "best" situation that these points are in a square pattern.
Situations that are bad for the program are:
A future improvement of this automatic camera calibration program would
include a fallback to manual corner selection by the user when the automatic
version fails. This is simply implemented, but requires user input in the form
of clicks on the checkerboard image. This would eliminate the limitation of
ccalib failing altogether when bad situations arise with the input images.
The light location is calculcated using very simple geometric reasoning. The only real complexity encountered in implementing this is in the method of user input / automation. Automation is inherently difficult in this case, as there is no simple way to obtain the coordinates of tip and base of the pencil shadow without some human intervention. This may be a future avenue of exploration.
Another major issue is choosing between two or more input calibration images. It turns out that the final point clouds are quite robust to light location, as long as it is near the correct value. This was seen during testing. However, using the average of many candidate locations helps to smooth out clicking errors / bad input images. This way there is a better estimate of the light location.
A drawback built-in to this scheme is that this particular paper and system relies on the assumption that the light is a point source. This is very unrealistic, and the authors have gone on to address this issue in later research. For future implementors / modifiers of this scanning system, this would also be a very good item to look into.
When you look at the final point clouds that we extracted from the data, you will notice that the clouds seem somewhat noisy. Our desk planes have a noticable thickness to them and our objects have a lot of surface variation. For example, the mesh that we generated from the medicine bottle is very jagged.
We think that this variation is mainly due to flaws in the reference point extraction method.
The temporal analysis worked extremely well, and for the most part, we got very clearly defined shadow edges for each image. This process was good enough, and we don't think that it could account for the noise we saw in our final results.
The spatial analysis for finding the reference points however
was much more troublesome. If you look at some sample images
from our scans (included with the source distribution), you will
notice that there is often a very large and faint penumbra at
the shadow edge. What happened in many of the frames was that
the shadow threshold was computed to be in the middle of that
very large penumbra. Visually inspecting these reference points
showed that they often were found in the middle of the penumbra
and weren't that close to the well defined shadow. To make
things worse, the penumbra apparently changed a bit in size as
the stick moved across the object. This was probably due to the
person holding the stick moving the stick up and down instead of
just from left to right. The result was that the reference
points did not always smoothly move across the reference line
and therefore the shadow planes were not particularly
consistent from one image to the next. This moving shadow plane
could easily account for the noise that we saw in our data.
Here is a sample set of images to test it with: