An Optimized Software-Based Implementation of a Census-Based ...

23 downloads 8804 Views 487KB Size Report
Cite this paper as: Zinner C., Humenberger M., Ambrosch K., Kubinger W. (2008) An Optimized Software-Based Implementation of a Census-Based Stereo ...
An Optimized Software-Based Implementation of a Census-Based Stereo Matching Algorithm Christian Zinner, Martin Humenberger, Kristian Ambrosch, and Wilfried Kubinger Austrian Research Centers GmbH – ARC Donau-City-Str. 1, 1220 Vienna, Austria {christian.zinner,martin.humenberger,kristian.ambrosch, wilfried.kubinger}@arcs.ac.at

Abstract. This paper presents S 3E, a software implementation of a high-quality dense stereo matching algorithm. The algorithm is based on a Census transform with a large mask size. The strength of the system lies in the flexibility in terms of image dimensions, disparity levels, and frame rates. The program runs on standard PC hardware utilizing various SSE instructions. We describe the performance optimization techniques that had a considerably high impact on the run-time performance. Compared to a generic version of the source code, a speedup factor of 112 could be achieved. On input images of 320×240 and a disparity range of 30, S 3E achieves 42fps on an Intel Core 2 Duo CPU running at 2GHz. Keywords: Stereo vision, Census-transform, real-time, performance optimization, SSE, OpenMP.

1

Introduction

Stereo vision is a well-known sensing technology that is already used in several applications. It becomes more and more interesting, e.g., in the area of domestic robotics, but also for industrial applications. Stereo matching means solving the correspondence problem between images from a pair of cameras mounted on a common baseline. It is possible to reconstruct world-coordinates for each pixel with a known correspondence by a triangulation process. In this paper we present the Smart Systems Stereo Engine (S 3E), a performance-optimized implementation of a stereo matching algorithm based on the Census transform. Stereo matching in general demands a high computational effort. Thus, practically all known software-based real-time stereo systems use matching approaches based on the sum of absolute differences (SAD) or the sum of squared differences (SSD) over a relatively small mask size of 3×3 up to 7×7 and/or a quite limited range of disparities of, e.g., 16. Census-based systems have shown significantly better matching results, especially when using larger mask 

The research leading to these results has received funding from the European Community’s Sixth Framework Programme (FP6/2003-2006) under grant agreement # FP6-2006-IST-6-045350 (robots@home).

G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 216–227, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Optimized Software-Based Implementation of Census-Based Stereo Matching

217

sizes of up to 15×15 pixels. They are also more robust to real-world illumination conditions [1]. The drawback is that Census-based methods require operations that poorly match the typical instruction sets of general purpose CPUs. This is the reason why fast Census-based systems, such as [2], [3], [4] and [5], usually run on dedicated hardware such as FPGAs or ASICs. Therefore, we present a novel approach to realize such a system in form of a flexible PC-software module that is able to run on mobile PC platforms of frame rates significantly above 10 fps. We achieve this without using any graphic card hardware acceleration, but with intensive use of the Streaming SIMD Extensions (SSE) and the multi-core architectures of state-of-the-art PC CPUs. The remainder of this paper is outlined as follows. Section 2 gives an overview of related stereo vision systems with a focus on Census-based systems. Section 3 depicts the underlying algorithm and the properties of S 3E as a software module. The various ideas and methods applied for optimizing the run-time performance of the program are the main part of Sect. 4. In Sect. 5 the impact of the optimizations are summarized and some comments about the future direction of our work are added.

2

Related Work

The Census transform was introduced by Zabih and Woodfill [6] as a nonparametric local transform to be used as the basis for correlation. This method has shown to have various advantages when compared to other common correlation methods. Its implementation requires only simple logic elements which can be parallelized, and thus it is well suited for realization on FPGAs and dedicated ASICs. Therefore, most known implementations are of this kind. An early system from 1997 [2] is based on the PARTS engine which consists of 16 FPGAs. It processes QVGA stereo images with 24 disparities at 42fps. 10 years later, the authors of [4] presented a low-cost FPGA system with a 13×13 Census mask that is able to process 20 disparities on QVGA images at more than 150fps. They see their system as a kind of low-cost and high-performance descendant of the PARTS engine. The approach proposed in [3] consists of an ASIC that comprises both, a Census-based matching module and an SSD unit, and merges the matching results of both units. It processes 256×192 pixel images at more than 50fps with 25 disparities. The Tyzx DeepSea G2 Vision System [5] is a full-featured embedded stereo system that contains a dedicated correlation ASIC, an FPGA, a signal processor, and a general purpose CPU running Linux. It provides additional computing power for higher-level applications like tracking or navigation. Stereo matching performance reaches 200fps on 512×480 pixel images and 52 disparities. A recent Census-based real-time implementation in software is the MESVSII [7]. It runs on a DSP and focuses on miniaturization of the system. A remarkable frame rate of 20fps is achieved with 30 disparities, although at relatively small resolutions (160×120) and a small Census mask size (3×3).

218

3

C. Zinner et al.

Census-Based Stereo Matching

3.1

Algorithm Description

The workflow of the algorithm to be optimized is shown in Fig. 1.

Stereo Camera

L R

Rectification, Undistortion

Lrect Rrect

Census Trans. 16x16

Lcensus Rcensus

DSI Calculation (Hamming Dist.)

DSIRL DSILR

Camera Calibration Z-Image

Costs Aggregation

DSILR, Agg

dmthresh 3D Reconstruction

Confidence Thresholding

dmsub

Left/Right Consistency

dmsub, l dmsub, r

DSIRL, Agg

WTA + Subpixel Refinement

3D Point Cloud Disparity Map Confidence Map

Confidence Calculation

Fig. 1. Workflow of the implemented algorithm. L, R stand for left and right image and dm for disparity map.

In a first step two digital cameras capture a synchronized image pair. To correct the lens distortion the images are undistorted, and to fulfill the constraint of horizontal epipolar lines, rectified afterwards. The according stereo camera calibration is performed offline. Due to the rectification, the pixel rows of the input images are aligned in parallel to the baseline which is essential for efficient matching (registration). The first step of the registration is the Census transform of the left and right images. Each pixel is converted into a bit string (e.g., 11010110. . . ) representing the intensity relations between the center pixel and its neighborhood. Equation 1 and 2 give a formal description of the transform as introduced by [6] for a mask size of 16 × 16.  0 p1 ≤ p2 (1) ξ(p1 , p2 ) = 1 p1 > p2 I(x, y) =

i=7,j=7 

(ξ(I(x, y), I(x + i, y + j)))

(2)

i=−8,j=−8

 The operator denotes a catenation. The Census transform converts the input stereo pair into images with 16×16 = 256 bits per pixel1 . After that, the 1

Using an asymmetric 16×16 mask and including the comparison result from the center pixel itself into the bit string gives advantages for an efficient implementation.

Optimized Software-Based Implementation of Census-Based Stereo Matching

219

main part of the matching, the DSI calculation, follows. DSI stands for disparity space image and is a three dimensional data structure which holds the costs of each pixel for each possible match. The DSI is computed from right to left and vice versa. The costs of a match between two Census transformed pixels is their hamming distance. It is assumed that neighboring pixels, except at disparity discontinuities, have a similar disparity level so a cost aggregation makes the matches more unique. The aggregation mask size is variable and can be chosen according to the application. The best matches are selected by winner takes all (WTA) and refined with subpixel accuracy. To filter occluded and bad matches, a left/right consistency check is performed. Additionally, a confidence map is calculated out of the WTA and left/right check results. The disparity values are thresholded according to the confidence map with an arbitrarily chosen threshold value, which results in the final disparity map. The last step is a 3D reconstruction, which delivers a Z image and a 3D point in the world coordinate system for each pixel with a valid disparity value. 3.2

Multi-purpose Software Module

The stereo matching algorithm is encapsulated in form of a C/C++ library. The module is intended to be a subsystem of machine vision systems and thus it will be used by developers or system integrators. In its current stage, S 3E is able to deliver following data as its output. – – – – –

A disparity map (sub-pixel accurate, 16 bit values, fixed point format) A Z-image (16 bits per pixel, fixed point format) A confidence map (a measure for the probability of a correct match; 8 bits) The input images after rectification (left + right) An image that contains the re-projected 3D-points (point cloud)

To enable the adaption of S 3E to application driven demands, we focused not only on the performance optimizations, but also on the flexibility of the system. Thus, the following parameters of the system can be adjusted. – – – – –

Input image size and format (8 bit, 16 bit, grayscale, RGB) Dimensions for resizing the input images in the preprocessing phase Aggregation mask size Disparity range (minimum disparity and number of disparities) Scaling exponent for the fixed point representation of computed disparity values with subpixel refinement – Confidence threshold value to adjust the filtering of unlikely matches – Further parameters to control the behavior of the Census transform and preand postprocessing filtering operations One of the advantages of a pure software implementation is the ability to provide such a level of flexibility, which makes it possible to run the S 3E with a large variety of sensor heads. It is our intention to use this software in several in-house projects and applications under a variety of different requirements.

220

4

C. Zinner et al.

Performance Optimization

Effective run-time performance optimization is a key requirement for a software based vision system to be competitive in terms of a cost-performance ratio. The following sections will, after some general explanations, highlight several measures that had the most impact on getting a high-performing stereo matching implementation. 4.1

Target Platform

The main target platform during the performance optimization was an Intel Mobile Core 2 Duo processor (model T7200) running at 2 GHz [8]. This CPU model is commonly used in notebook PCs, but also in industrial and so-called “embedded” PCs. We intentionally avoided to optimize our software only for high-end PCs with respect to a broader field of application in the industrial / mobile domain. The platform chosen supports a lot of Streaming SIMD Extensions (SSE, SSE2, SSE3) which were vital for achieving high performance. One important goal was to get an implementation of a stereo matching algorithm that provides both, capability of running on multiple platforms and delivering the best performance possible on the respective platform. We achieved this by extensively using special multi-platform high-performance imaging libraries. Currently, S 3E can be compiled for MS Windows and for Linux on a PC platform with special hardware-specific performance optimizations. We expect that an already scheduled migration of the software to the C6000 series of high-end Digital Signal Processors (DSPs) from Texas Instruments [9] will be possible with limited effort. S 3E and the libraries used can also be built in a generic mode on any ANSI-C compliant platform, but then with far slower performance. 4.2

Development Environment

The S 3E software has been implemented in C with the requirement to keep it portable among MS Windows, Linux and some DSP platforms. We used MSVC 8 on the Windows side and under Linux the GCC 4.3 compiler came to service. Both compilers can deal with basic OpenMP [10] directives which proved to be sufficient for our requirements. A common guideline of our software design process is to consequently encapsulate all computation intensive image processing functions into dedicated performance primitives libraries. The remaining part of the program is usually not performance critical and thus can be coded in a portable manner. We use the PfeLib [11] as the back-end for the performance critical functions, which in turn uses parts of the Intel Performance Primitives for Image Processing (ippIP) [12] whenever possible. Much of the performance optimizations described in the following sections were actually done in functions of the PfeLib. The library has an open architecture for adding new functions, migrating them to other platforms, and optimizing them for each platform. It also provides components for verification and high resolution performance timing on various platforms.

Optimized Software-Based Implementation of Census-Based Stereo Matching

4.3

221

The Starting Point

Before optimizing, we define a reference implementation as the starting point for the upcoming work. At this stage the program provides all features of the final version, except that most of the image processing is performed by “generic” code that lacks any platform-specific optimizations. Table 1 shows the timing of the non-optimized program when processing one stereo frame. The profiling data is structured into several program stages. For every stage it provides information about the size of the images processed, how many CPU cycles per pixel (cpp) this process required, how often this image processing function came to service and how many milliseconds the whole task took. It is clear that a processing time of ∼8.5 seconds is inapplicable for realtime stereo vision, but a priority list for an optimization strategy can be directly deduced from this table. Table 1. Performance and complexity figures of an initial non-optimized implementation with a 450×375 input image pair and a search range of 50 disparities

Function

Image dimensions

Hamming distance Census transform Cost aggregation WTA minimum search Lens undist. & rectification Aggregation: DSI lines shifting Disparity to 3D point cloud calc. Disparity to Z-image calc. Initialization stage LR/RL consistency check Confidence thresholding Complete program run

386 × 1 1136.3 435 × 360 2138.0 19996 × 1 20.1 382 × 50 11.5 450 × 375 204.5 386 × 50 0.7 333 × 356 55.9 450 × 375 27.8 450 × 375 17.4 382 × 1 10.5 450 × 375 7.9 450 × 375 100913.6

4.4

Cycles Function per pixel calls 36000 2 712 712 2 2880 1 1 1 356 1 -

Time (ms) 7894.94 334.81 143.28 78.5 34.51 19.75 3.32 2.34 1.47 0.71 0.66 8514.59

Fast Hamming Distances

Matching costs are obtained by calculating the hamming distance between two Census-transformed pixels. Simply spoken, the hamming distance can be gathered by “XORing” the two binary strings, and afterwards counting the number of set bits in the result. This task is the most expensive one of the whole stereo algorithm, because using a 16×16 Census mask results in quite long 256 bit strings. The required number of hamming distance calculations is nham ≈ 2 W HD ,

(3)

where W and H denote the width and height of the input image, and D is the number of disparities. The additional factor of two results from the need of doing an LR/RL consistency check. Another reason is, that general purpose CPUs

222

C. Zinner et al.

hardly provide dedicated instructions for bit-counting. This is also the case for SSE up to version 3. The instruction POPCNT for counting the set-bits in a 128 bit register will appear with SSE4, but it is not yet available on our target hardware. Starting from a quite cumbersome version that processed bit by bit in a loop—it took over 1100 CPU clock cycles for a bit string of 256 bits (Table 1)—some experiments using 64k lookup-tables reduced the run time to 126 cycles. The final method of choice was inspired from the BitMagic library [13]. Table 2 exemplifies the procedure on 8-bit words. It is easy to extend the method for processing 128 bit wide registers with SSE instructions. Finally, we achieved a speed of 64 cycles for calculating a 256-bit hamming distance on a single CPU core. Table 2. Hamming distance calculation scheme Pseudocode c=xor(a,b) d=and(c, 010101012 ) e=and(shr(c,1),010101012 ) f=add(d,e) g=and(f,001100112 ) h=and(shr(f,2),001100112 ) i=add(g,h) j=and(i,000011112 ) k=and(shr(i,4),000011112 ) l=add(j,k)

4.5

Data bits c7 c6 c5 c4 c3 c2 c1 c0 0 c6 0 c4 0 c2 0 c0 0 c7 0 c5 0 c3 0 c1 c6 + c7 c4 + c5 c2 + c3 c0 + c1 0 0 c4 + c5 0 0 c0 + c1 0 0 c6 + c7 0 0 c2 + c3 c4 + c5 + c6 + c7 c0 + c1 + c2 + c3 0 0 0 0 c0 + c1 + c2 + c3 0 0 0 0 c4 + c5 + c6 + c7 c0 + c1 + c2 + c3 + c4 + c5 + c6 + c7

Comment get differing bits 1-bit comb mask shift and mask add bits 2-bit comb mask shift and mask add 4-bit comb mask shift and mask hamming weight

Fast Census Transform

According to Equ. 2, a Census transform with a large mask requires many comparison operations between pixel values. SSE provides the mm cmplt epi8 intrinsic that compares 16 pairs of signed 8-bit values at once. As pixel values are unsigned, this instruction cannot be used directly. Since negative numbers are represented in the two’s complement system, it is a remedy to add a constant value of 128 to every 8-bit operand before the comparison. By this we deliberately produce arithmetic overflows, but the result of the signed comparison is the same as doing an unsigned comparison on the original values. Using the unsigned comparison operator ξ as defined in Equ. 1, and introducing ψ as a comparison operator for a signed data type, we can write this equivalence as ξ(p1 , p2 ) ≡ ψ(p1 ⊕ 128, p2 ⊕ 128) ,

(4)

where px are 8-bit words and ⊕ is an 8-bit addition with overflow. The described 16×16 Census transform produces strings 256 bits long for every pixel. Especially the computation costs for the hamming distances are, despite intensive SSE optimizations, still quite large. We recently discovered a way to cope with large mask sizes, which we call a “sparse Census transform”,

Optimized Software-Based Implementation of Census-Based Stereo Matching

223

where only every second pixel of every second row of the 16×16 mask is evaluated. This means that the bit string becomes significantly shorter, namely only 64 bits. Empiric evaluations revealed that the quality loss of the stereo matching is very small, so it was clear to us that using it is an excellent tradeoff between quality and speed. A more detailed analysis and quantification of the quality effects is ongoing work within our group. The initial implementation of the 16×16 Census took more than 2100 cpp, while the optimized SSE version runs at up to 135 cpp on a single CPU core. The sparse Census transform brought a further reduction down to 75 cpp. The hamming distance calculation also profits significantly from the reduction of the bit string length, it roughly takes only a quarter of the time compared to the value stated in Sect. 4.4. 4.6

Fast Aggregation

Aggregation operates on 16-bit disparity space images (DSI) where each pixel value represents the matching costs for a given input pixel at a certain disparity level. The aggregated cost value for a certain disparity of a pixel is the sum of cost values over a rectangular neighborhood of size {m, n} around the position {x, y} n

Cagg (x, y) =

m

2  2 

Cx+i,y+j .

(5)

m −n 2 − 2

Due to the line-by-line processing of the input images, the program computes the DSI layer-by-layer. We call such a slice through the DSI a Disparity Space Layer (DSL). A DSL comprises the cost values for all disparities of an image line, thus it is a 2D-data structure, which in turn can be treated with image processing functions. It is necessary to keep the last n DSLs in memory, cf. Fig. 2(a). The problem is that the y-neighborhood of cost values is actually spread among different images, which makes it hard to use common optimized filter functions. We faced this by storing the last n DSLs into one single image frame buffer of n-fold height as shown in Fig. 2(b). Now it would be fine to have cost values with equal disparities, but from adjacent y-coordinates, on top of each other. This can be easily achieved by tweaking the image parameters describing width, height, and the number of bytes per line. We finally get a result as shown in Fig. 2(c), which is an image with only n pixels in height, but with D times the width. We notably achieved this without any movement of pixel data in memory. Now, aggregation means not more than applying a linear filter with all filter mask coefficients set to one and the common divisor also set to one. The filter kernel is indicated in its start position in Fig. 2(c). The result of the filtering can be transformed into an “ordinary” DSL in the reverse manner. The implementation of Table 1 used the general linear filter function of the ippIP and a relatively fast value of 20.1 cycles per pixel was achieved for a 5×5 aggregation mask from the beginning on. As a replacement we implemented a dedicated sum filter function with a fixed mask size with extensive use of SSE intrinsics. We achieved a single core peak performance of 4.2 cpp for a 5×5 filter.

224

C. Zinner et al. dmax

d dmax dmin

y

dmax

y-2 y-1 y

dmin dmax

x

(a) Conventional method: storing separate DSLs for each line

dmin

y-2

y

d=dmin

d=dmin+1

Ɣ

Ɣ

Ɣ

d=dmax

y-1 y

x x

(b) Storing the last n DSLs into a common frame buffer

x=0

x

x

x=W

(c) The same frame buffer with tweaked image description parameters. Common filtering functions are now applicable.

Fig. 2. Memory layout for efficient cost aggregation (example with a 3×3 mask)

4.7

Combined DSL for Left/Right Consistency Check

A left/right consistency check procedure usually implies that the DSL for a certain line of an image pair is calculated separately by horizontally shifting the right image line and matching it against the fixed left image line (RL), and also doing this the opposite way (LR). The resulting DSLs are plotted in Fig. 3(a) and Fig. 3(b). To get the disparities with the smallest costs dopt (x), the minimum values of each column have to be searched (winner takes all strategy, WTA). If the minima for a certain pixel are located at the same disparities in the LR-DSL as well as in the RL-DSL, the consistency check passed and the likelihood for having a correct match is high. It is obvious that many pairs of pixels are actually matched twice against each other during this procedure, so we analyzed if an implementation could be improved. The lower part of Fig. 3 points out our approach. When the RL-DSL is skewed as shown in Fig. 3(c), it can be overlayed to the LR-DSL without loss of information. The resulting data structure is shown in Fig. 3(d), which takes almost only half of the computations and memory, compared to the two separate DSLs. The only difference is that for searching the best LR-matches the WTA must operate in a diagonal direction rather than vertically. This technique allows us to reduce the memory usage and calculation effort for the hamming distances as well as for the cost aggregation by almost one half, so the factor of two in Equ. 3 has actually disappeared. 4.8

Using Multiple Cores with OpenMP

Due to the line-by-line processing of major parts of the stereo matching algorithm it is quite easy to partition the workload to several threads which run on different CPU cores. We parallelized S 3E for a dual core CPU, which could be accomplished with little effort and resulted in a decrease of calculation time from 143.8ms to 75.9ms. This corresponds to a speedup factor of almost 1.9.

Optimized Software-Based Implementation of Census-Based Stereo Matching xRL 2/2 2/1 2/0

3/3 3/2 3/1

4/4 4/3 4/2

5/5 5/4 5/3

xLR 6/6 6/5 6/4

7/7 7/6 7/5

8/8 8/7 8/6

9/9 9/8 9/7

0/0 1/0 2/0

dRL

1/1 2/1 3/1

2/2 3/2 4/2

xLR

2/1 3/1

4/4 5/4 6/4

5/5 6/5 6/5

6/6 7/6 8/6

7/7 8/7 9/7

(b) Separately calculated LR-DSL

xRL

2/0

3/3 4/3 5/3

dLR

(a) Separately calculated RL-DSL

dRL

225

2/2 3/2 4/2

3/3 4/3 5/3

4/4 5/4 6/4

5/5 6/5 7/5

6/6 7/6 8/6

7/7 8/7 9/7

(c) Skewed RL-DSL

8/8 9/8

xRL

0/0 1/1 2/2 3/3 4/4 5/5 6/6 7/7 1/0 2/1 3/2 4/3 5/4 6/5 7/6 8/7 2/0 3/1 4/2 5/3 6/4 7/5 8/6 9/7

9/9

dLR

8/8 9/8

9/9

dRL

(d) Union of LR and skewed RL DSL

Fig. 3. Evolving a combined DSL. The example illustrates the matching of a single 10 pixel wide image line for dmin = 0 and dmax = 2. The entries u/v stand for the matching costs of the uth pixel from the left line against the v th pixel from the right line.

5

Summary and Future Work

The performance figures of the final implementation are presented in Table 3. We were able to speed up the system from 8.5s to 75.9ms per frame for input image dimensions of 450×375 and 50 disparities on a Core 2 Duo at 2GHz. This results in a frame rate of 13fps. Table 3. Performance of the final, optimized implementation on a Core 2 Duo CPU at 2GHz. Image dimensions are 450×375 and disparity search range is 50. Optimization speedup factors are derived from a comparison against Table 1.

Function

Image Cycles Function dimensions per pixel calls

Hamming distance WTA minimum search Cost aggregation Census transform Lens undistort. + rectification Disparity to Z-image calc. Disparity to 3D point cloud calc. LR/RL consistency check Confidence thresholding Initialization stage Complete program run

435 × 50 431 × 50 22396 × 1 435 × 360 450 × 375 450 × 375 333 × 356 431 × 1 450 × 375 450 × 375 450 × 375

7.2 2.9 2.7 38.2 24.8 17.2 22.6 9.3 6.0 2.3 899.9

364 712 356 2 2 1 1 356 1 1 -

Time Speedup factor (ms) 28.53 22.07 10.82 5.98 4.18 1.45 1.34 0.72 0.51 0.2 75.93

276.7 3.6 13.2 56.0 8.3 1.6 2.5 1.0 1.3 7.4 112.1

The run-time per frame of the S 3E depends on the input image dimensions and on the size of the disparity search range. If the same scene shall be sensed with a stereo vision system at a higher input resolution, it is also necessary to

226

C. Zinner et al. 60 d=15

50

d=30 d=50

Frame rate / fps

40

d=80 30

d=120

20 10

x6 00 80 0

x4 80 64 0

x3 60 48 0

x2 40 32 0

24 0

x1 80

0

Input image dimensions

Fig. 4. Frame rates of S 3E achieved on a 2GHz Core 2 Duo CPU at various image dimensions and disparity ranges

raise the number of disparities in order to keep the depth-range of perception constant. In this case we can express the behavior of the run-time per frame tf according to the input resolution r as tf ∈ O(r3 ). The results of test runs over a variety of image dimensions and disparity ranges are depicted in Fig. 4. E.g., on QVGA input images (320×240) and a disparity range of 30, S 3E achieves 42fps on an Intel Core 2 Duo at 2GHz. The current implementation still leaves some potential for further optimization. For instance, the WTA minimum search, which is now the second most costly function in Table 3, is not heavily optimized yet. Another option is making the hamming distance calculations faster for CPUs that provide the POPCNT instruction. Using the Intel C/C++ compilers, which are available for Windows and Linux, will probably yield a further speedup. Slight modifications will be necessary to enable the program using more than two cores efficiently. A big topic will be the planned migration of the software on an embedded DSP platform. We expect that this should be possible with less additional work because of the extensive use of PfeLib, which is multi-platform capable, and it contains already many optimized functions for C64x DSPs [14].

References 1. Cyganek, B.: Comparison of Nonparametric Transformations and Bit Vector ˇ c, J. (eds.) IWCIA 2004. Matching for Stereo Correlation. In: Klette, R., Zuni´ LNCS, vol. 3322, pp. 534–547. Springer, Heidelberg (2004)

Optimized Software-Based Implementation of Census-Based Stereo Matching

227

2. Woodfill, J.I., Von Herzen, B.: Real-time stereo vision on the PARTS reconfigurable computer. In: Proceedings of the 5th IEEE Symposium on FPGAs for Custom Computing Machines (1997) 3. Kuhn, M., Moser, S., Isler, O., Gurkaynak, F.K., Burg, A., Felber, N., Kaeslin, H., Fichtner, W.: Efficient ASIC Implementation of a Real-Time Depth Mapping Stereo Vision System. In: Proceedings of the 46th IEEE International Midwest Symposium on Circuits and Systems (2004) 4. Murphy, C., Lindquist, D., Rynning, A.M., Cecil, T., Leavitt, S., Chang, M.L.: LowCost Stereo Vision on an FPGA. In: Proceedings of the 15th IEEE Symposium on FPGAs for Custom Computing Machines (2007) 5. Woodfill, J.I., Gordon, G., Jurasek, D., Brown, T., Buck, R.: The Tyzx DeepSea G2 Vision System, A Taskable, Embedded Stereo Camera. In: Proceedings of the 2006 Conference on Computer Vision and Pattern Recoginition - Workshops (2006) 6. Zabih, R., Woodfill, J.I.: Non-parametric Local Transforms for Computing Visual Correspondence. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 151–158. Springer, Heidelberg (1994) 7. Khaleghi, B., Ahuja, S., Wu, Q.M.J.: An Improved Real-Time Miniaturized Embedded Stereo Vision System (MESVS-II). In: Proceedings of the 2008 Conference on Computer Vision and Pattern Recoginition - Workshops (2008) 8. Intel Corporation: Intel Core2 Duo Processors and Intel Core2 Extreme Processors for Platforms Based on Mobile Intel 965 Express Chipset Family, Document Number:316745-005 (January 2008) 9. Texas Instruments: TMS320C6414T, TMS320C6415T, TMS320C6416T FixedPoint Digital Signal Processors Lit. Number: SPRS226K, http://www.ti.com 10. OpenMP Architecture Review Board: OpenMP Application Program Interface (May 2008), http://openmp.org 11. Zinner, C., Kubinger, W., Isaacs, R.: Pfelib: A Performance Primitives Library for Embedded Vision. EURASIP J. on Embed. Syst. 2007(1), 14 pages (2007) 12. Intel Corporation: Intel Integrated Performance Primitives for Intel Architecture. Document Number:A70805-021US (2007) 13. Kuznetsov, A.: BitMagic Library: Document about SSE2 Optimization (July 2008), http://bmagic.sourceforge.net/bmsse2opt.html 14. Zinner, C., Kubinger, W.: ROS-DMA: A DMA Double Buffering Method for Embedded Image Processing with Resource Optimized Slicing. In: RTAS 2006: Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2006), pp. 361–372 (2006)