This article presents a method for variably de-focusing a stereoscopic image in 3D space around a desired focal point. The goal is to better simulate a single moment of human vision, which is based on a focal point rather than a focal plane as with conventional photography.
Comparison of de-focusing methods — 1: original, 2: 3D planar focus, 3: 3D point focus
Human Vision vs. Camera Vision
Photographs are actually a fairly poor representation of how a person sees the world with their natural senses. This is often overlooked, however, because photographs have become so commonplace and integral that it's hard to even imagine an alternative. Similarly, the conventional 24 frames-per-second of movies looks "right", though not because it's actually realistic, but because we're used to it. I've previously discussed how chromatic (color) realism can be augmented in some photos using histogram diffusion. In this article, I'll turn my attention to focal realism.
In a single moment, a camera focuses on a focal plane, i.e. the set of all points a certain distance from the lens. In reality, the notion of a "focal plane" is a convenient approximation and "focal sphere" would be more accurate, but the distinction is not important here. This contrasts with human vision, which focuses only on a single focal point at a time. As a simple experiment, try fixing your eyes on a convenient focal point, and "looking" at other parts of the image away from the focal point, without moving your eyes. You'll notice that the image becomes progressively more "vague" as you look farther from the focal point. This effect happens simultaneously and seamlessly in all three dimensions. The greater the distance between focal point and observed point, whether up-down, left-right, or front-back, the more vague the image becomes. In effect, we have permanent tunnel vision, and compensate with rapid eye movements, piecing a scene together from multiple glances. Similarly, you can think of a photograph as a superimposition of an infinite number of "glanced" images.
Realizing this, I became interested in how to make photos more focally realistic.
Enter Stereoscopic Photography
At a high level, my idea is simple: for each pixel in a given image, blur it in proportion to its 3D distance from a given focal point. Since conventional photography produces an insufficient 2D result, I turned to stereoscopic photography which produces pseudo-3D images.
Stereoscopic photography is based on "stereo pairs", a set of two 2D images taken at the same time with a small offset between the lenses, in position and sometimes angle. This is essentially the same process at work in human eyes. This is usually achieved by synchronizing two identical cameras, using a purpose-built stereo camera, or less commonly by using mirrors to split a single lens into two sub-lenses. In the absence of special equipment, a crude method involves moving left-right slightly between taking two photos with a single ordinary camera. For the purpose of this article, I'll sidestep the hardware aspect and demonstrate with existing stereo pairs.
Depth Mapping
Stereoscopic depth mapping is the process of matching every pixel in the left image to its displaced position in the right image, and vice versa, to determine its distance from the lens. This provides a third spatial dimension to the image, conventionally called Z. While simple in theory, the practice is complex for several reasons:
- Occlusions, or regions seen by one lens but not the other
- Lens axial misalignment
- Differences in incident light between lenses
- Inexactness of pixel matching
This technique also has subject limitations where pixels are not readily matched 1:1 between left/right images, such as surfaces that are featureless (clear blue sky), noisy (sand), reflective, or clear.
Rather than re-invent the wheel, I decided to use existing depth mapping software: Epipolar Rectification 9b (ER9b) and Depth Map Automatic Generator 5 (DMAG5) courtesy of developer Ugo Capeto. Both are implementations of academic papers, dated 2008 and 2012. I wrote wrappers to use both within MATLAB.
ER9b performs sparse, high-confidence pixel matching to transform both images so that all matches occur along horizontal lines, thereby eliminating angular misalignment between lenses. In other words, the images are rectified to simulate idealized coplanarity of the lens axes. This is a prerequisite for effective depth mapping, as lens axial coplanarity is a key assumption in efficient depth mapping algorithms.
Animation showing sparse high-confidence pixel matches; minimal rectification is necessary
DMAG5 takes as input a rectified stereo pair. It performs dense pixel matching, finds occlusions, and attempts to produce a physically realistic depth map for each image. Whereas ER9b requires no input parameters beyond the images themselves, DMAG5 requires roughly a dozen different parameters. These are fairly esoteric and in practice most can be left at their default values without affecting the result significantly. What is critical, however, is that the minimum and maximum disparity values (an output of ER9b) match the result of the rectification step, so that the depth map is correctly scaled. In practice, I've found that iteratively doubling/halving each parameter, keeping the helpful adjustments and reverting the unhelpful ones, is a decent if inelegant optimization method - a sort of poor man's gradient ascent method. This black-box approach is improved by making use of the built-in downsampling functionality to quickly evaluate parameters sets at low resolution before investing the computational time for the full resolution.
Depth map for left image
Dither Upsampling
At this stage, a shortcoming of stereoscopic depth mapping emerges: its low resolution. Typical disparity ranges are between 20 and 70 pixels, roughly analogous to 5- or 6-bit color, which compares unfavorably to standard 8-bit color. As a result, stereoscopic depth maps often show a "stairstep" effect where smooth transitions artificially show as jagged, similar in appearance to lines of constant elevation on topographical maps.
At this stage, a shortcoming of stereoscopic depth mapping emerges: its low resolution. Typical disparity ranges are between 20 and 70 pixels, roughly analogous to 5- or 6-bit color, which compares unfavorably to standard 8-bit color. As a result, stereoscopic depth maps often show a "stairstep" effect where smooth transitions artificially show as jagged, similar in appearance to lines of constant elevation on topographical maps.
It would be easy to increase the disparity range in hardware, but only at the cost of increasing the size of occluded regions. In practice, this doesn't seem to be feasible. I found Floyd-Steinberg dither upsampling to be an effective strategy to counteract the low-resolution artifacts, albeit at the cost of adding noise. However, this step was slow and turned out to not be necessary for my purposes, so I discarded it.
Comparison of depth map before/after dither upsampling
Note that the dithered depth map appears noisier but smoother, and per the histograms more fully utilizes the available range of values in 8-bit color space.
Blurring
The last step is blurring the image. Several inputs are required here:
- Focal point - where the eyes are focused
- Depth-to-planar spatial ratio - how shallow or deep the image is
- Minimum and maximum blur radius - how pronounced the blurring effect is
With these inputs defined, I measure the 3D distance from each pixel to the focal point, and map this linearly to the blur radius range. This results in a "radius map" which specifies the blur radius for each pixel. The stairsteps result from my simplifying assumption to only support whole-number blur radii, thereby obviating the need for sub-pixel calculations.
Blur radius map with eyes focused on foreground facial features
I initially defaulted to a Gaussian blur, but this didn't produce good results as the stairstep effect from the depth and radius maps persisted in the resulting image.
Instead, I used a median blur. Instead of setting each pixel's value as a normally-weighted average of its neighbors, the median blur sets each pixel's value as the unweighted median of its radial neighborhood. The median blur preserves edges, which is extremely beneficial in hiding the stairstep effect and smoothly transitioning between varying blur radii. Subjectively, the image also feels to be more similar to one seen out of the corner of one's eye, whereas a Gaussian blur feels more camera-like. I used a circular neighborhood as it produced the best results, though square and diamond neighborhoods also exist.
Unlike typical fixed-size median filters, mine requires the radius to vary as a function of its position; instead of one kernel, there are many. For efficiency, I pre-calculate all the kernels necessary before starting the blur calculation, and recall them as needed. This is about 3x faster than generating a kernel from scratch at each pixel.
Here are some sample result images, showing how the focal point can be moved to simulate looking at different parts of the scene. Note how the blur radius transitions seamlessly both within and between focal planes!
Demo Gallery
Shown here are several additional images created using the process outlined above, to further demonstrate the effect. These represent "good" results; some images work better than others for a variety of reasons.
Depth Map Gallery
Shown here are the depth maps corresponding to the images above.
Acknowledgements
- Ugo Capeto develops and maintains excellent free depth-mapping software
- 3D Shoot provided most of these stereoscopic images under a non-commercial license
Source Code
Freely available on my GitHub.
No comments:
Post a Comment