VISUAL INFORMATION PROCESSING AND CONTENT ...

c 2007 Institute for Scientific

Computing and Information

INTERNATIONAL JOURNAL OF INFORMATION AND SYSTEMS SCIENCES Volume 3, Number 3, Pages 349–364

VISUAL INFORMATION PROCESSING AND CONTENT MANAGEMENT: AN OVERVIEW PHILIP O. OGUNBONA Abstract. Visual information processing and the management of visual content has become a significant part of contemporary economy. The visual information processing pipeline is divided into several modules including, (i) capture and enhancement, (ii) efficient representation for storage and transmission, (iii) processing for efficient and secure distribution, and, (iv) representation for efficient archiving and retrieval. Advances in semiconductor technology and optimum signal processing models and algorithms, provide tools to improve each module of the processing pipeline. Insight from other areas of study including psychology augments and informs the models being developed to understand and design efficient visual content management systems. The paper provides a brief overview of the modules in the pipeline in one place for easy reference. Key Words. active pixel sensor, CMOS image sensor, CCD, image processing, image coding, video coding, image retrieval, image watermarking.

1. Introduction Visual information capture, processing, storage, transmission and distribution have become a viable commercial workflow due to significant advances in semiconductor technology and the development of intelligent digital signal processing algorithms. The ability to capture and process digital visual information has found application in diverse areas including, space exploration, video surveillance, industrial monitoring, medical imaging for diagnosis, mining, advertising, filming, etc. More recently, the ready availability of professional and consumer image and video capture devices have created a community of visual information content creators. At the same time the digital nature of the visual information and, availability of numerous powerful processing algorithms and computers have placed the power of digital image processing in the hands of professionals and consumers. In this paper we present an overview of the theoretical principles of visual information processing and the management of visual content. The visual information processing pipeline can be divided into several modules including, (i) capture and enhancement, (ii) efficient representation for storage and transmission, (iii) processing for efficient and secure distribution, and, (iv) representation for efficient archiving and retrieval. The paper is divided accordingly, into four sections. Section 2 explores the principles of image capture and the technical problems associated with the process. We present and review some of the solutions available in contemporary literature. We also present possible future trends. In Section 3, we proceed to explore the problem of image and video coding. In particular we review some of the classical methods of signal representation and the more contemporary Received by the editors January 1, 2004 and, in revised form, March 22, 2004. 2000 Mathematics Subject Classification. 35R35, 49J40, 60G40. 349

350

P. O. OGUNBONA

multi-resolution approach. Orthogonal transform and wavelet techniques of signal representation are reviewed. In Section 4, we review the problem of visual content distribution from the viewpoint of the enabling technologies. In particular, we review techniques developed for copyright protection, tamper detection, etc. The important problem of image and video archiving and retrieval is reviewed in Section 5. Particularly, we present the problem from the viewpoint of multilevel description that starts out at the pixel level and ends at the conceptual or semantic level of description. The problem of semantic gap is articulated and techniques available to bridge the gap are reviewed. In the concluding section we summarize the trends that have emerged from this overview. 2. Image Capture and Enhancement The pipeline of processes in a simplified single sensor digital still camera as shown in Figure 1 depicts an image sensor followed by a series of image enhancement modules that generates the finished image.

Figure 1. A simplified single sensor camera. The sensor employs colour filter array (CFA) to separate incoming light into a specific spatial arrangement of the color components. One of the possible patterns is the Red-Green-Blue (RGB) Bayer CFA mosaic [1] shown in Figure 2. Colour interpolation is employed to estimate the missing colours in the mosaic - a process referred to as de-mosaicing. The image obtained through the colour filters need to be corrected for the white-point and colour saturation in order to reproduce the colour of the original scene with high fidelity and expected human visual response. It is interesting to note that the white-point and colour correction are dependent on the illuminant. Gamma correction is performed to compensate for the nonlinearity of viewing and printing devices. The automatic exposure module is coupled with the sensor and used to dynamically adjust the integration time of the sensor and produce correctly exposed images. R G R G ···

G B G B ···

R G R G ···

G B G B ···

··· ··· ··· ··· ···

Figure 2. A Bayer pattern of pixels. Image or video capture devices are developed using CMOS- or CCD-based image sensor technology. The development of the active pixel sensor (APS) as a replacement for the passive CMOS imager promoted CMOS to the position of competition

VISUAL INFORMATION PROCESSING AND CONTENT MANAGEMENT: AN OVERVIEW 351

with CCD based imagers especially in terms of low power dissipation, scaling and the possibility of camera system integration [5, 6]. In CMOS technology, both photodiode and photogate (also PIN photodiode) have been employed as the photodetector. The pixel consists of the photodetector in addition to the readout, amplification, row select and reset transistors. CCD-based image sensors are typically classified into, frame-transfer (FT), interline-transfer (IT), and virtual-phase (VP) architectures [8]. In general the CCD image sensor is made up of an image area, a horizontal CCD, and an output amplifier. The image area includes photodiodes and vertical CCD that capture the illumination and transfer charges to the horizontal CCD and detecting amplifiers [7]. Essentially the photodetector and the associated electronics are responsible for amplification, analog-to-digital conversion and digital readout. In the photodetector, photons are absorbed in the semiconductor material and the absorbed energy releases an electron-hole pair. The integrated charge at the site of minority carrier collection is proportional to the product of the amount of light falling on the pixel and the exposure time [2]. Advances in both CMOS and CCD based image sensors have been made in terms of improvements in the basic performance indicators. Some of these include, dark current noise, conversion gain, optical sensitivity, dynamic range, fixed-pattern noise, quantum efficiency and pixel cross-talk. The improvements have been gained from better circuit design and process innovations [3, 4, 5, 6, 12]. Early passive CMOS sensors were beset with fixed pattern noise caused by device mismatch. These have been suppressed with the development of active pixel sensor architecture and the use of correlated double sampling to suppress the so-called 1/f and and kT/C noise figures. Correlated double sampling as a signal processing method has been used successfully in both CMOS and CCD readout circuits. The dark current noise is the statistical fluctuation of the electrons created in the photodetector independent of the light falling on the detector but linearly correlated with the integration time. There is also a portion of the fixed pattern noise that is attributable to the dark current nonuniformity. The dark current noise performance of CMOS sensors is poorer than those of CCD [5]. Some of the techniques used to improve dark current noise figure include the use of charge pumping and surface pinning. The fill factor (FF) is the ratio of the sensitive area to the total pixel area. In CMOS active pixel sensor where there are more circuitry than in CCD based sensors, achieving a high fill factor is a challenge. The effective quantum efficiency (QE) is reduced by the fill factor, QEef f = F F QE. It is expected that the quantum efficiency can be improved as feature dimensions of CMOS technology shrinks and increased fill factor is attained [5]. Conversion gain refers the charge-to-voltage conversion efficiency of the sensor and design of the photodiode based CMOS sensors lead to a lower figure of merit than photogate sensors. In general, photodiode designs are more sensitive to visible light, especially in the short-wavelength region of the spectrum. Photogate devices usually have larger pixel areas, but a lower fill factor and much poorer short-wave-length light response than photodiodes. It is possible for photons incident on one pixel to generate carriers that are collected by a different pixel - a phenomenon referred to as pixel cross-talk. The effect of pixel cross-talk is to reduce image sharpness and degrade colorimetric accuracy. Dynamic range of the CMOS and CCD are determined by different factors. In CCD, factors that enter the defining equation include capacity of the CCD stage, conversion gain, dark current noise and total r.m.s read noise. Whereas the CMOS dynamic range is dependent on the threshold voltage of the n-device, combined gain of the pixel and column source followers and conversion gain.

352

P. O. OGUNBONA

3. Efficient Representation for Storage and Transmission Typically, the spatial resolution of images generated from the capture process is on the order of several million pixels with each pixel having a depth of 8 - 12 bits. Such size of images dictates large storage requirement and high transfer rate to be useful in a practical imaging system. The theory of source coding provides a rich set of techniques that have led to the development of efficient compression algorithms. In this paper we are interested in source coding techniques that allow lossy compression because they provide high compression ratio. The performance of such algorithms is not only measured in term of the compression ratio (or bit rate) but is also based on some measure of the degree of fidelity retained in the reconstructed image. The rate-distortion theorem (cite) gives bounds on the design choice available in selecting the parameters of a practical compression system. The starting point is the modelling of the captured image and this can be conveniently achieved by representing it as a sample of a stochastic process. Such a process is completely described by the knowledge of its joint probability density. One of the major stochastic models that has found application in image processing are the covariance models in one- and two-dimensions [15, 16]. The covariance model has formed the basis of very efficient compression algorithms that employ unitary transforms. We refer to these algorithms as transform coders. For a one-dimensional sequence, {u(n), 0 ≤ n ≤ N − 1}, of image pixels represented as a vector u of size N , a unitary transformation is given by [15], (1)

v = Au;

v(k) =

N −1 X

a(k, n)u(n),

0≤k ≤N −1

n=0

where the unitary property of the transformation, A, implies A−1 = A∗T , and thus the inverse transformation is conveniently given as,

(2)

u = A∗T v;

u(n) =

N −1 X

v(k)a∗ (k, n)

0≤n≤N −1

k=0

It is the energy compaction property of unitary transforms that make them very useful for image and video compression. If µu and Ru denote the mean and covariance of the vector of image pixels, u, the transformed vector v has mean and covariance given by, (3)

µv = Aµu

(4)

Rv = ARu A∗T

Furthermore, if the components of the vector are highly correlated, the coefficients of the transformation are usually uncorrelated and the structure of the covariance matrix is such that the off-diagonal terms are small compared to the diagonal terms. The Karhunen-Lo`eve (KL) transform is the optimum in terms of energy packing property in that the representation of the vector in the transformed domain has most of the energy packed in the fewest number of coefficients. This representation presents opportunity for compression since any truncation of the coefficients yields an optimum representation based on the retained coefficients. Ahmed et al. [17] have shown that the performance of the discrete cosine transform (DCT) compares favourably with that of the Karhunen-Lo`eve transform, in terms of energy packing for data modeled as Markov source with high inter-sample correlation. Also, the


Figure 3. A typical transform coding system. rate-distortion performances are found to be comparable [17, 15]. In [18] it was also shown that the DCT performs best compared to the Walsh-Hadamard and Haar transforms when encoding step changes with arbitrary phase in images. Smooth ares of images have very high correlation and are amenable to high compression because the resulting ac components of the DCT coefficients barely carry much energy and can be truncated. The edges in images are well represented and thus leads to reasonable reconstructed images at moderate bit rates. There is noticeable ringing artifacts around edges at very low bit rates [18]. The DCT has been successful as the transform of choice in JPEG, MPEG and H.263 coders because of the good performance at most practical bit rates and the availability of a fast implementation. A typical transform coding system is shown in Figure 3. It is important to stress the fact that the compression is achieved through efficient quantization of the transform coefficients. The problem of signal representation for subsequent encoding is related to the efficient representation of the features of the signal. A multiresolution analysis allows the representation of the details of an image at different scales and location. The development and usefulness of wavelet transform stem from the inadequacy of analysis techniques based on Fourier series or Fourier transform. We begin the introduction of the wavelet transform with the idea of a scaling function. We define a set of scaling functions in terms of integer translates of a basic scaling function [19], (5)

ϕk (t) = ϕ(t − k) k ∈ Z and ϕ ∈ L2

where L2 is the space of square integrable functions and Z is the set of integers. For −∞ ≤ k ≤ ∞ these functions generates functions in a subspace of L2 and in general the scaling functions can be made to span arbitrarily sized spaces by scaling and translating, as ϕj,k (t) = 2j/2 ϕ(2j t − k). In a multiresolution development the nesting of the spaces spanned by the scaling function gives rise to a scaled representation of the functions [19]. It turns out that the differences between the spaces spanned by the scaling functions can also provide a description of the signal. A wavelet, ψj,k (t), spans the differences between the spaces spanned by the scaling functions. We note that the wavelet themselves can be described in terms of the scaling functions. Thus the scaling function and the wavelet provides a means of representing any function g(t) ∈ L2 (R) [19], (6)

g(t) =

∞ X k=−∞

c(k)ϕk (t) +

∞ X j=0

d(j, j)ϕj,k (t)

354

P. O. OGUNBONA

where the coefficients c(k) and d(j, k) are appropriately determined by inner products if the expansion of Equation(6) represents an orthonormal basis. There is also an alternative development of wavelets in terms of filterbank [19, 20, 21] and this has formed the basis of several practical implementation and presentation of many properties of the wavelet. The representation given by wavelets form an unconditional basis and thus leads to a sparse representation in which the coefficients drops off rapidly as j and k increases. It is important to note that both DCT and wavelet transform capture different features of the image. The DCT is good at capturing the oscillating features of an image while the wavelet is better at capturing the point singularities. Wavelets have form the basis of more recent image compression standard such as JPEG2000 [22]. Several properties of wavelets have been exploited to a great advantage to achieve embedded bitstream presentation of encoded image and provide scalability in both spatial resolution and bit rate (signal-to-noise ratio). The fact that wavelets are able to capture point singularity leaves room for further exploration and there has been continued activity in the research community [23]-[29] to better capture other features including lines and curvatures in the image. The sparse representation provided by both DCT and wavelet family of transforms allows efficient quantization and compression of images. In general the multiresolution representation allows a mapping of the wavelet coefficients in a quadtree structure that captures the respective image features. Wavelet-based compression algorithms such as embedded zerotree wavelet (EZW)[30], SPIHT [31], and space frequency quantization (SFQ)[32] exploit this sparsity to achieve compression without incurring significant distortion at very low bit rates. 4. Visual Content Distribution Efficient compression of images has facilitated storage and transmission across heterogenous networks. In particular, the embedded bitstream of encoded JPEG2000 images and the progressive encoding of JPEG images allow images to be transmitted and viewed over low bandwidth networks. Despite the utility provided by compression, the distribution of visual content for commercial purposes requires guarantees on the preservation of the copyrights of the owners at different levels. Encryption techniques are only useful in guaranteeing the reception of transmitted images by the intended recipient. Once received the images can be easily replicated and re-distributed. We can identify at least three areas of rights management and these include authentication (or verification), proof of ownership and covert communication (or steganography). Efforts to meet these requirements through the development of various digital watermarking algorithms have engaged the research community over the last decade. It is interesting to note that digital watermarks are inserted in images or video prior to compression and must at least survive various types and levels of compression. This is in addition to other image processing operations that the image or video might undergo. In [33], digital watermark was defined as a set of secondary digital data that is embedded into a primary digital image (also called host signal) by modifying the pixel values of the primary image. Digital watermarks were further classified according to their appearance and application domain. Three classes were identified [33], (i) visible, (ii) invisible-robust and (iii) invisible-fragile. Each of these categories impose some requirements on the process of embedding, visibility (or imperceptibility) and robustness. A visible watermark should be obvious in both colour and monochrome images and even visible to people with colour blindness. However, the watermark should not be such that it obscures the image being watermarked. Additionally, the watermark


Figure 4. A Generic Watermark Encoder [34].

Figure 5. A Generic Watermark Decoder [34]. must be difficult ro remove. Both invisible-robust and invisible-fragile watermarks must not introduce noticeable artifacts into the watermarked images. Robustness to standard image processing operations is a very important requirement for invisible-robust watermark while security of the watermark is the most essential requirement in the class of invisible-fragile watermarks. The amount of watermark that can be embedded in an image is related to the robustness and also correlates with the degree of impairment and consequently visibility of watermark. These three constraints, viz robustness, capacity and imperceptibility have guided guided the design and evaluation of watermarks. From the viewpoint of extraction we can categorize watermarking as, (i) blind, (ii) semi-blind and (iii) non-blind techniques. More formally digital watermarking (Figure 4) is the embedding of a given digital signal, w, into another signal, Co , called the cover signal or host signal, such that the presence is imperceptible. A secret key, K, is employed to ensure security of the output watermarked signal, Cw . The watermark decoder (Figure 5) can employ the marked and possibly manipulated signal, Cˆw , the original host signal, Co , the watermark, w, and the key, K, to produce an estimate of the watermark, w0 . The manipulation is often thought of as the attack on the watermark to render it useless for the intended purpose. This formulation can be written as [34], • EK : O × K × W → O ; EK (co , w) = cw • DK : O × K → W 1, if c≥τ • Cτ : W 2 → {0, 1} ; Cτ (w0 , w) = 0, if c