Image Computing for Visual Information Systems 1

Image Computing for Visual Information Systems Ishwar K. Sethi Vision and Neural Networks Laboratory Department of Computer Science Wayne State University Detroit, MI 48202, USA

1 . Introduction The last few years have seen a remarkable growth of information in digital form due to a combination of several technological factors that include the availability of high speed electronic networks and affordable yet powerful desktop computing. This coupled with advances in scientific and consumer imaging technology - digital radiography, digital still cameras, desktop scanners, video capture boards, has created an environment where a large fraction of digital information is being created and disseminated in visual form - graphics, images, and video. This trend towards information in visual form is not surprising considering that we use vision more than any other sense to acquire and communicate information. The information in visual form differs from traditional database information in many important ways. First, it requires more space for storage. For example, a color image of size 640x480 pixels requires almost one million bytes for storage in uncompressed form. Second, visual information is highly unstructured and needs some kind of a decoding process to determine its semantic content - a twodimensional array of pixel values representing an image must be processed through stages of segmentation and recognition to determine its information content [1]. Finally, information in visual form is often ambiguous, difficult to describe, and subject to several interpretations. As a result of these characteristics, the management of visual information poses several challenges with respect to its storage and retrieval, its delivery and presentation, and its querying and search. Various issues related to these challenges are currently the subject of active research and this trend is expected to continue as more and more improvements are made in the acquisition, storage, retrieval, and delivery of the visual information [2-3]. The incessant growth of visual information systems is impacting research in several fields and image processing and computer vision is no exception to it. A large body of current research efforts in image processing and computer vision are driven by the needs of visual information systems. These efforts can be grouped into three distinct categories. The first category, known as manipulation and visual compositing, consists of efforts aimed at manipulating existing images and videos stored in a visual information system to produce a set of new images or videos. The main applications of manipulation and compositing are in advertising, education and training, and entertainment. Many of the visual manipulation operations are also needed for the second category, content characterization and indexing, of efforts which are aimed at providing suitable means of visual information retrieval - images and video clips of specified content. These content characterization and indexing efforts are fundamental to visual information systems irrespective of any application. The third category of efforts consists of development of new image and video compression techniques. To keep the scope of this paper with managable limits, only the first two types of efforts will be discussed further in this work.

2 . Manipulation and Visual Compositing Manipulation and visual compositing means combining one or more images or videos for the purposes of creating new visual presentations. Basic manipulation and compositing operations are translation, scaling, overlaying - opaque or semitransparent, and filtering operations of different kinds. These operations have been traditionally performed in the spatial domain. However, there is an increasing realization that manipulation and visual compositing if done on compressed visual data can offer two important advantages. First, the data rate in the compressed domain is much lower than in the uncompressed domain and thus the needed computational cost can be reduced. Second, the processing on compressed data avoids the decompression and compression processes as visual information is generally stored and transmitted in compressed form. Methods for manipulation and visual compositing using compressed data make use of several interesting properties of the Discrete Cosine Transform (DCT) which is the basis of spatial redundancy removal in compression standards such as JPEG, MPEG, and H.261. In one of the early efforts on processing of compressed visual data, Smith and Rowe [4] developed a set of algorithms that showed computational speedups of 50 or more over the corresponding processing of the uncompressed data. However, these algorithms were limited to local manipulations and compositions - operations such as contrast manipulation or image subtraction where a pixel value in the output image at position p depends solely on the pixel value at the same position p in the input image. Recently, Smith [5] has extended this work to global manipulations and visual compositions - operations such as low-pass and high-pass filtering where the value of a pixel in the output image is a function of an arbitrary linear combination of pixels in the input image. Through this kind of processing, he has shown that it is possible to process MPEG video stream at near real-time rates through software means on present day workstations. In related work, Chang and Messerschmitt [6-7] have developed a complete set of algorithms to manipulate images directly in the compressed domain. Some of the interesting algorithms developed by them include the translation of images by arbitrary amounts, and rotation and shearing. These algorithms can be applied to general orthogonal transforms such as Discrete Fourier Transform (DFT) and orthogonal Discrete Wavelet Transform (DWT). In addition to the above works, efforts at manipulating compressed data are now on-going at several places. However, the focus of these efforts is shifting towards feature extraction and change detection. It is expected that these efforts will only grow with the availability of MPEG video capture boards. It is also expected that the advantages of manipulating compressed data will ultimately lead to compression algorithms that are not only good at compression but provide efficient means of data manipulation in compressed form. From this perspective, wavelet transform-based compression schemes offer a great promise.

3 . Content Characterization and Indexing This is the most active area of current image processing and computer vision efforts driven by the growth of visual information systems. Since a manual characterization of information content in images and videos is almost impossible due to the sheer amount of effort involved and the subjective bias, there is a very clear need for automatic or semi-automatic methods to generate content descriptors for images and videos. Another factor against manual characterization is that to perform it, either a judgement must be made on the nature of all possible future

queries or all future queries must be limited to those envisioned at the characterization step. This obviously leads to restrictive use of archieved images and videos. Two kinds of content based retrieval methods have been proposed: (1) Similarity retrieval, and (2) Semantic retrieval. In similarity retrieval, the user shows an example image to retrieve a small set of images that show strong similarity with the example image in terms of one or more image characteristics, e.g. shape, texture, and color. This kind of retrieval scheme is also known as retrieval through iconic or pictorial queries. In semantic retrival, the images are retrieved on the basis of occurrence of certain objects or events. This kind of retrieval implies good pattern recognition or image understanding capabilities on the part of visual information system. Since the current pattern recognition and image understanding methods have only limited capabilities in narrow domains, the semantic retrieval of images and videos is possible only in restrictive application domains. Most of the content characterization efforts are aimed at videos where several research groups have developed techniques for shot isolation in videos [8-10]. While earlier efforts were made using uncompressed video, there appears to be a shift towards using compressed video for this task with apparent good reasons. One example of this kind of effort is the recent statistical shot isolation approach [11] which looks at the histgram of every I frame in an MPEG video and compares two successive histograms by the Chi-Square test to determine shot transitions of both type - camera breaks and optical cuts. The characterization of shots in terms of the underlying camera motion has also received considerable attention. There is some very interesting work also going on in terms of using image-to-image motion information for panoramic view construction as well as for background scene representation to characterize video clips [12]. In terms of content representation, the present efforts are mainly centered on the use of very low-level image features. One major difference in image processing for visual information systems in contrast with image processing for automation and perception is the reliance on color information for indexing and retrieval. The use of shape and texture features is currently gaining ground. Overall, the impression is that we must move to high-level image features to obtain true semantic retreival. This will require domain knowledge models for image characterization [13]. Another interesting approach to characterize and index the contents is to use the features present in the compressed data itself. Following this approach, recently the use of vector-quantization(VQ) has been explored for indexing and retrieval with good success [14]. These kinds of approaches are likely to play a significant role provided semantics could be attached to the features in the compressed data.

4 . Conclusion Being mainly a service technology, the field of image processing and computer vision has always been driven by powerful applications. In the sixties, the driving application was space exploration which changed to automation in the seventies followed by autonomous land navigation in the eighties. Now it is the turn of visual information systems. Compared to previous driving forces, the current driving force has some different characteristics. First, it is by nature interactive. This means that there is a scope as well as a need to put an operator in the loop. Second, it is the qualitative characterization of content that is generally needed. Therefore, it is good enough to determine whether the camera is panning or tracking rather than computing its actual motion parameters. Finally, all the image

analysis built into a visual information system has to be of a general nature rather than of a specific nature to be able to provide answers to a broad range of user queries. I believe that these characteristics make the challenges posed by visual information systems interesting as well as require some paradigm shifts on the part of image processing and computer vision researchers.

5 . References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

W.I. Grosky, "Multimedia Information Systems," IEEE Multimedia, Vol. 1, pp.12-24, Spring 1994. Proceedings of the ACM Multimedia 94, San Francisco, California, 1994. Proceedings of the IS&T/SPIE Symposium on Multimedia Computing and Networking, San Jose, California, 1995. B.C. Smith and L.A. Rowe, "Algorithms for Manipulating Compressed Images," IEEE Computer Graphics & Applications, Vol. 13, pp. 34-42, 1993. B.C. Smith, "Fast Software Processing of Motion JPEG Video," Proc. ACM Multimedia 94, pp. 77-88, San Francisco, California, 1994. Shih-Fu Chang and D. G. Messerschmitt, "Video Compositing in the DCT Domain," IEEE Workshop on Visual Signal Proc. & Comm., pp. 138-143, Raleigh, NC, 1992. Shih-Fu Chang and D. G. Messerschmitt, "A New Approach to Decoding and Compositing Motion Compensated DCT-Based Images," IEEE Int'l Conf. ASSP, pp. V421-V424, Minneapolis, MN, 1993. S. Smoliar and H. Zhang, "Content-Based Video Indexing and Retrieval," IEEE Multimedia, Vol. 1, pp. 62-72, 1994. P. Aigrain and P. Joly, "The Automatic Real-Time Analysis of Film Editing and Transition Effects and Its Applications," Computers and Graphics, Vol. 18, pp. 93-103, 1994. A. Hampapur, R. Jain and T. Weymouth, "Digital Video Segmentation," Proc. ACM Multimedia 94, pp. 357-364, San Francisco, California, 1994. I.K. Sethi and N. Patel, "A Statistical Approach to Scene Change Detection," Proc. IS&T/SPIE Symposium on Storage and Retrieval for Image and Video Databases III, San Jose, CA, 1995. A. Akutsu and Y. Tonomura, "Video Tomography: An Efficient Method for Camerawork Extraction and Motion Analysis," Proc. ACM Multimedia 94, pp. 349-356, San Francisco, California, 1994. D. Swanberg, C.F. Shu and R. Jain, "Knowledge Guided Parsing in Video Databases," Proc. IS&T/SPIE Symposium on Storage and Retrieval for Image and Video Databases, San Jose, CA, 1993. F.M. Idris and S. Panchanathan, "Image Indexing Using Vector Quantization," Proc. IS&T/SPIE Symposium on Storage and Retrieval for Image and Video Databases III, San Jose, CA, 1995.