SUMMARIZING WEARABLE VIDEO Kiyoharu ... - Semantic Scholar

2 downloads 0 Views 1MB Size Report
structured by objective features of video, and the shots are rated by subjective ... contain 70 years of video in a single lunch box size HDD. As for sensing ...
SUMMARIZING WEARABLE VIDEO Kiyoharu Aizawa, Ken-Ichiro Ishijima, Makoto Shiina Dept. of Elec. Eng., University of Tokyo 7-3-1 Hongo, Bunkyo, Tokyo, 113-8656 [email protected] tel: +81-3-5841-6651 fax: +81-3-5841-6693 ABSTRACT ”We want to record our entire life by video” is the motivation of this research. Recently developing wearable devices and huge storage devices will make it possible to keep entire life by video. We could capture 70 years of our life, however, the problem is how to handle such a huge data. Automatic summarization based on personal interest should be required. In this paper we propose an approach to automatic structuring and summarization of wearable video. (Wearable video is our abbreviation of ”video captured by a wearable camera”.) In our approach, we make use of a wearable camera and a sensor of brain waves. The video is firstly structured by objective features of video, and the shots are rated by subjective measures based on brain waves. The approach were very successful for real world experiments and it automatically extracted all the events that the subjects reported they had felt interesting. 1. INTRODUCTION Personal experiences are usually maintained by such media as diaries, pictures and movies. For example, we memorize our daily experiences in a diary, and we use photos and movies to keep our special events such as travels etc. However, those existing media, so far, keeps only a small part of our life. Even though we make use of photos and movies, we always miss the best moments because the camera is not always ready. Significant progress has been being made in digital imaging technologies: cameras, displays, compression, storage etc. Small size integration also advances so that computational devices and imaging devices become intimate as wearable computers and wearable cameras [1, 2]. These small wearable devices will provide fully personal information gathering and processing environments. We believe that by using such wearable devices, an entire personal life will be able to be imaged and reproduced as video. Then, we will never miss the moment that we want to maintain

0-7803-6725-1/01/$10.00 ©2001 IEEE

398

forever. Such a long-term imaging requires automatic summarization. Imagine we have a year-long or much longer video recording, how do we handle such huge volume of data in order to watch the scene that we felt interesting when it was captured? Manual handling is nonsense because manual operation takes longer than the length of recording. We would like to extract the scenes that we felt something at the moment. Then, summarization should be based on our subjective sensation. In the previous works, automatic summarization has been applied to broadcasting TV programs (ex. [3]). Their summarization were based on objective visual features of video and audio. Then, methodology of summarization of wearable video has to differ because it needs to take into account subjective feeling of the person. In this paper, we first discuss the potentials and the feasibility of life-long imaging, and propose our approach to automatic summarization that segments video based on the visual objective features and evaluates the shots making use of brain waves. Use of physiological signals was also attempted for wearable cameras. [4] used skin conductivity and heart rate for turning on and off the wearable camera. We use brain waves that show clearly the status of the person whether he pays attention or not. 2. FEASIBILITY AND POTENTIAL OF LIFE LONG IMAGING Image we would wear a single camera and constantly record what we would see, how huge would be the amount of images that we could capture during 70 years? (The captured video would be first stored in the wearable device and then occasionally moved to a huge storage.) Video quality depends on the compression. Assuming 16 hours per day is captured for 70 years, the amount of video data are listed below.

quality TV phone quality VCR quality Broadcasting quality

rate 64kbps 1Mbps 4Mbps

data size for 70 years 11 Tbytes 183 Tbytes 736 Tbytes

Lets take a look at the TV phone quality, it needs only 11 Tbytes for recoding 70 years. Even today, we have a lunch box size of 10GB HDD available with less than $100. Thus, if we have 1000 of them, their capacity is almost enough for 70 years! The progress of capacity improvement of HDD is very fast. it will be in the not too distant future feasible to contain 70 years of video in a single lunch box size HDD. As for sensing devices, CCD or CMOS cameras are getting smaller, too. A glass-type wearable CCD camera is already introduced in the market. (In fig.1, the glass-type CCD camera is shown, that is used in our experiments.) The progress of wearable computers will further drive the imaging devices smaller.

to see. The data size is too huge, then its automatic summarization is the most critical. Almost all the previous works deal with TV programs and motion films [3, 5]. They make use of objective visual and audio features of the video. However, in our wearable application, we also have to take into account the degree of subjective interest of the person, that he felt at the very moment when the scene was captured. Then, the video should be rated by the subjective measure too. In order to measure interest of the person, we make use of physiological signal, that is brain waves captured simultaneously. The sensor is also wearable enough. As shown in fig. 2, the sensor we use is head-band size. (It is still noticeable, but it could be much smaller.) The data from the sensor is recorded using one of the audio channel of the VCR of the wearable camera. We utilize α wave and β waves of the brain waves.

Fig. 2. A sensor to capture brain waves Fig. 1. A glass-type CCD camera. CCD and audio microphone are embedded.

The framework of our approach to automatic summarization, thus, comprises two major steps: In the first step, the video stream is analyzed and segmented based on the objective visual features, and in the next step, video segment is evaluated based on the subjective signals. The whole frame work is shown in Fig.3. First, the video stream is segmented into shots, using motion information. Shots are integrated into scenes using color information. Then, the scenes are evaluated based on the status of α or β waves. The summary is finally produced by using the rated segments.

Then, from the hardware points of view, storage and sensing, we believe in not too remote future, life long video will be able to be captured and maintained in personal environment. Lets take a look at potential advantages and disadvantage. They are listed below. * Advantages - we can catch the best moment that we want to keep forever. - we can vividly reproduce and recollect our experiences by video. - we can remember what we almost forget. - we can see what we did not see. - we can prove what we did not do. * Disadvantages - we may see what we do not want to remember. - we may violate privacy of other person. 3. SUMMARIZATION OF WEARABLE VIDEO As described in the previous sections, constantly recording will be feasible from hardware points of view. The most difficult issue is how to extract and reproduce what we want

Fig. 3. The framework of summarization of wearable video.

399

4. VIDEO SEGMENTATION INTO SHOTS Video segmentation using visual features rather differs from the ordinary segmentation applied to the TV programs because contents of the video stream captured by the wearable camera is very continuous and it does not have explicit shot changes. However, the motion information reflects the users movement in the environments. Color information also reflects the environment information. Thus, we first estimate the global motion of the images which shows the person’s movement; whether he stands still, moving forward/backward, turing left/right. The global motion model comprises zoom Z, pan H and tilt V . They are estimated from motion vectors (u, v) determined for the blocks of the image by the block matching. The estimation of the global motion parameter is done by two steps; in the first step, the global motion parameters are estimated by the least squares using all the motion vectors using the model of eq.(1). In the second step, excluding the motion vectors that differs too much from that caused by the estimated global motion parameters, the global motion parameters are estimated again using the remaining motion vectors. The estimated parameters are also filtered by median filtering over 50 frames, because the variation is large. 

u v



 =Z

x − x0 y − y0



 +

H V



Fig. 5. Estimation of the pan parameter stripe and black areas corresponds to ’still’, ’move forward’, ’turn right’ and ’turn left’, respectively.

Fig. 6. Segmentation Results: video is captured by a wearable camera while the person is walking around in the University The shots will be finally grouped into scenes using color information. Shot grouping is not described here. In the next section, the shots are rated by the status of brain waves. Use of additional signals such as audio is now under investigation.

(1)

5. SUBJECTIVE RATING OF THE SHOTS: USE OF BRAIN WAVES The goal of this research is to extract video shots which are supposed to be paid much interest when the person saw the scene. Brain waves can very effectively work. According to the literatures of psychology, brain waves has a feature such that α wave attenuation or β wave continuous activation occurs when the person feel awake, interested, excited etc. The status of α wave and β wave is a good index for the arousal of the subject. In our experiments, the power of α or β waves were simply thresholded. The length of the attenuation and activation were also taken into account in order to reduce artifacts. The system we developed is shown in the fig.7. our system shows the video and the brain waves on the screen. The top left is the image window and the plots shows the brain waves. The bottom one is α wave (7.5 Hz) and the other are β waves (15Hz, 22.5Hz and 30Hz). The fig.7 shows the moment when the person got in the paddock in the race track. α wave (bottom) shows quick attenuation (changes

Fig. 4. Estimation of the zoom parameter: changes of 1/Z is plotted. The zoom factor is chaining as shown in the fig.4. Note 1/Z is plotted, then the peak shows when the person stands almost still. Pan and tilt parameters show horizontal and vertical change of the direction of the person’s head. Because the person often looks up and down, tilt parameter changes too often. Then, we make use of the zoom and pan parameters to segment the video into shots. An example of the shot segmentation is shown in fig.6. In fig.6, white, gray,

400

are extracted and displayed in fig.9. Among 175 shots of an hour-long video, only 35 shots are extracted.

high to low), and it is very visible that the subject paid high attention to the scene.

Fig. 8. The shots rated by α wave. Darker means more interest and whiter means less interest.

Fig. 7. The system shows video and brain waves.

In the experiments, persons wore the camera (fig.1) and the simple brain wave sensor (fig.2). In the experiment, the subject persons walked around the university and visited an amusement park and a horse race track, etc. The subjects reported what were interesting, impressive to them during the experiments. In the first experiments, group of frames are extracted only by using brain waves. One example of the experiments in the amusement park: The original video stream is about from an hour. On the basis of α wave, β wave or both, the number of groups of frames extracted are 347, 217, 334 respectively. The summarized video were about 8min long. The scene that the subjects had reported interesting were all extracted. In second experments, the video stream is first segmented into shots based on the global motion as described earlier. Then, the α wave is averaged in each shot and the shots are rated five levels according to the value of the averaged α waves, that reflects the degree of interest the person felt. Smaller value means more interest. The result for the race track experiment is shown in fig.8. It displays the shots with five levels of degree of interest of the subject. Darker color shows more interest, and whiter color shows less interest. In the fig.8 the length of the shots is normalized so that they have a same width. Roughly speaking, darker region appears four times in fig.8: The first one is when the experiment started and he was not pleasant. He got gradually accustomed to the worn devices. The second one is when he bought a ticket. The third one is when he was in the paddock. The forth one is when the races started. It appears that the evaluation well reflects subjective status. The extracted shots rated as level three or higher

401

Fig. 9. The shots extracted: level three or higher

6. CONCLUSION In this paper, we proposed an novel approach to summarizing video captured by a wearable camera. We discussed potential and feasibility of wearable imaging for life long time. Our summarization proceeds in two steps; in the first step, the video stream is segmented by using objective visual features such as motion, and in the second step the shots are evaluated by physiological signals such as brain waves. In the experiments, the proposed method works well for wearable video. 7. REFERENCES [1] S.Mann, WearCam (The Wearable Camera), ISWC, pp.124131, 1998 [2] The PC goes ready-to-wear, IEEE Spectrum, pp.34-39, Oct. 2000 [3] M.A.Smith, T.Kanade, Video skimming and characterization through the combination of image and language understanding techniques, CVPR97, pp.775-781, 1997 [4] J.Healey, R. W. Picard, StartleCam: A Cybernetic Wearable Camera, ISWC, pp.42-49, 1998 [5] A.W.M Smeulders etc., Content-based image retrieval at the end of the early years, Vol.22, No.12, pp.1349-1380, Dec. 2000