Photo Stream Alignment for Collaborative Photo ... - CiteSeerX

5 downloads 300 Views 993KB Size Report
Nov 30, 2011 - Figure 1: Collaborative photo collection and sharing is com- mon in social media websites such as Facebook and Picasa. capturing was done ...
Photo Stream Alignment for Collaborative Photo Collection and Sharing in Social Media Jianchao Yang

Jiebo Luo and Jie Yu

Thomas Huang

University of Illinois at Urbana-Champaign Urbana, Illinois, USA

Kodak Research Labs Rochester, New York, USA

University of Illinois at Urbana-Champaign Urbana, IL, USA

[email protected]

[email protected]

[email protected]

ABSTRACT With the popularity of digital cameras and camera phones, it is common for different people, who may or may not know each other, to attend the same event and take pictures and videos from different spatial or personal perspectives. Within the realm of social media, it is desirable to enable these people to share their pictures and videos in order to enrich memories and facilitate social networking. However, it is cumbersome to manually manage these photos from different cameras, of which the clocks settings are often not calibrated. In this paper, we propose an automatic algorithm to accurately align different photo streams or sequences from different photographers for the same event in chronological order on a common timeline, while respecting the time constraints within each photo stream. Given the preferred similarity measure (e.g. visual, temporal, and spatial similarities), our algorithm performs photo stream alignment via matching on a sparse representation graph that forces the data connections to be sparse in a explicit fashion. We evaluate our algorithm on real-world personal online albums for thirty-six events and demonstrate its efficacy in automatically facilitating collaborative photo collection and sharing.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models; I.4 [Image Processing and Computer Vision]: Feature Measurement, Image Representation

General Terms Algorithms, Experimentation

Keywords Sparse representation graph, kernel sparse representation, photo stream alignment, photo sharing, graph matching, collaborative media collection.

1. INTRODUCTION Today, millions and millions of users worldwide capture images and videos to record various events in their lives. Such image data

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSM’11, November 30, 2011, Scottsdale, Arizona, USA. Copyright 2011 ACM 978-1-4503-0989-9/11/11 ...$10.00.

Figure 1: Collaborative photo collection and sharing is common in social media websites such as Facebook and Picasa. capturing was done for two important reasons: first for the participants themselves to relive the events of their lives at later points of time, and second for them to share these events in their lives with other friends and family, who were not present at the events but still interested in knowing how the event (e.g. the vacation, or the wedding, or the trip) went. When it comes to sharing, it is also common for people who have been to the same event to share pictures taken at the same event by different people who had different viewpoints, timing, or subjects. The importance of sharing is indeed underscored by the billions of image uploads per month on social media sites like Facebook, Flickr, Picasa, and so on. In this study, we consider a very common scenario for many photo-worthy events: for example, you and several friends took a trip to the Yellowstone National Park, and each of you took your own cameras to record the trip. At the end, collectively you ended up with several photo albums, each created by a different camera and composed of hundreds or even thousands of photos. One natural problem arises: How do you share these photo albums or collections among the friends in an effective and organized manner? Such a scenario occurs often for events that involve many people, such as trips, excursions, sports activities, concerts and shows, graduations, weddings, and picnics. Many photo sharing sites now provide functions or apps to facilitate photo sharing. As shown in Figure 1, for example, Picasa now allows people one shares photos with to contribute to one’s album, while a Facebook app called Friends Photos

Figure 2: Overview of our collaborative photo collection and sharing system. searches in one’s network to present an overview of the friends’ photo albums. However, these functions do not automatically align photos of the same event from different contributors. Currently, people must view the individual photo collections separately, and few are willing to invest time to augment their own collections with photos taken by others. It is clearly not a solution to simply merge all the photos from different albums into one super collection. Because different albums use different photo naming conventions, putting those photos together will result in either ordering the photos into disjoint groups bearing no semantic meanings, or worse, disorder due to naming conflicts. Second, it is unlikely that one can merge these photos based on their timestamps. Within each photo collection from one camera, the photos can be arranged in chronological order based on their timestamps, which forms the photo stream. However, since people rarely bother to calibrate their camera clocks before taking photos, the timestamps from different cameras are typically out of sync and not reliable for aligning the photos in different streams. Typically, the camera clocks can be offset by minutes, hours and even days when people travel through different time zones. In fact, this is true for all the real-world photo collections unintentionally gathered for experiments in this work. Therefore, it is desirable to develop an automatic algorithm that can facilitate different people, who may or may not know each other, attend the same event and take photos from different spatial or personal perspectives, to share their pictures and videos in an effective way, especially for online albums, in order to enrich memories and promote social networking. In recent years, the explosion of consumer digital photos has drawn growing research interest in the organization and sharing of community photo collections, due to the popularity of web-based user-centric multimedia social networks, such as Facebook, Flickr, Picasa, and Youtube. While many efforts have been devoted to photo organization [12], annotation [6], [11], [18], [19], [8] summarization [14], [5], browsing [9], [15], and search [10], [13], little has been done to relate media collections that are about the same event for effective sharing. As a practical need we encounter everyday, intelligent photo collection and sharing is one on many people’s wish lists. This paper attempts to address the problem raised in the beginning and aims to develop an automatic algorithm to facilitate the sharing of multiple photo albums captured for the same event. Figure 2 illustrates the diagram of our system, where multiple albums are aligned in the chronological order of the event to create a master stream, which captures the integrity of the whole event for sharing among different users. Our algorithm relies on the kernel sparse representation graph that is constructed by explicitly sparsi-

fying the photo connections using ℓ1 -norm minimization, in generalization of the ℓ1 -graph [3], [4] in the kernel space for broader application purposes. For customer albums, most photos by different cameras are visually uncorrelated, but some of them overlap in content and thus form the base for aligning different photo streams. Therefore, the visual correlation links between different photo streams are sparse. As we will see, by explicitly accounting for this sparseness in the graph construction via ℓ1 -norm minimization, our matching algorithm is more robust than conventional methods. The remainder of the paper is organized as follows. Section 2 describes the kernel sparse representation graph, given the preferred photo similarity measures. Tailored to our problem, Section 3 introduces a specific sparse bipartite graph for robust alignment of multiple photo streams. In Section 4 we report the experimental results on a total of 36 real-world photo datasets, each of which involves two or more cameras and was collected from the Picasa Web Album. Finally, Section 5 concludes our paper with future work.

2.

KERNEL SPARSE GRAPH CONSTRUCTION

In pervasive computer vision and machine learning tasks, finding the correct data relationship, typically represented as a graph, is essential to the success of many algorithms. Due to the limitations of the existing similarity measures, sparse graphs usually offer certain advantages because they can reduce spurious connections between data points and thus tend to exhibit high robustness [20]. Recently, Cheng et al.[3] proposed a new graph construction method via ℓ1 norm minimization, where the graph connections are established based on the sparse representation coefficients of the current datum in terms of the rest data points. Robust to noise and adaptive for neighbor selection, the ℓ1 -graph demonstrates substantial improvements on clustering and subspace learning over the conventional graph construction methods, such as kNN and ϵ-ball graphs. However, the method in [3] is limited to applications where data can be roughly aligned, e.g. faces and digits. In this section, we propose to generalize the concept of ℓ1 -graph for exploring data relationships in a general kernel space, thus making our new graph applicable to much broader applications.

2.1

Similarity Measure

To construct the graph, we first define the similarity measure for photos. With the associated meta-data of consumer photos, we represent each photo as {x, g}, where x denotes the image itself, and g its geo-location. To keep the notation uncluttered, we simply use x instead of the duplet in the following presentation. We define the photo similarity as S(xi , xj ) = Sv (xi , xj ) · ·Sg (xi , xj ),

(1)

where Sv and Sg are the visual and geo-location similarities between photos xi and xj respectively. Other information, e.g. photo tags for online albums, can also be incorporated if available. Visual similarity Sv is the most important cue for our tasks. In this paper, we choose the following three visual features to compute the visual similarity, due to their simplicity and effectiveness: 1. Color Histogram, an evidently important cue for consumer photos; 2. GIST [16], a simple and popular feature to capture the visual global shape; 3. LLC [17], a state-of-the-art appearance feature for image classification.

We concatenate these features with equal weights, normalize them into unit length, and simply use their inner products as our visual similarity. Photos taken close in location are probably about the same content. Given the geo-locations g i and g j for photos xi and xj , a geo-location similarity can be defined, e.g., using a Gaussian kernel: ∥g i − g j ∥22 ). (2) σ Finally, we assume that the similarity measure S defines a valid kernel κ(·, ·) with Φ(·) being the implicit feature mapping function, i.e. Sg (xi , xj ) = exp(−

κ(xi , xj ) = Φ(xi )T Φ(xj ) = S(xi , xj ).

(3)

In this paper, we mainly rely on the visual similarity (Geo-location similarity was not used even when GPS information is recorded, which is left for future work.).

2.2 Graph Construction The basic question of graph construction is, given one datum xt , how to connect it with many other data points {xi }n i=1 based on some given similarity measure. The graph construction based on sparse representation is formulated as min ∥α∥0 s.t. ∥Φ(xt ) − Dα∥22 ≤ ϵ,

(4)

where D = [Φ(x1 ), Φ(x2 ), ..., Φ(xn )] serves as the dictionary to represent Φ(xt ). The connection between xt and each xi is determined by the solution α∗ : if α∗ (i) = 0, there is no edge between them; otherwise, the edge weight is defined as |α∗ (i)|. Eqn. 4 is a combinatorial NP-hard problem, whose tightest convex relaxation is through ℓ1 -norm minimization [2], min ∥α∥1 + β∥α∥22 s.t. ∥Φ(xt ) − Dα∥22 ≤ ϵ,

(5)

where we further add a small ℓ2 -norm regularization term to stabilize the solution [21]. In many scenarios, we can easily define the similarity measure between data points whereas explicit feature mapping Φ(·) may not be available, i.e. we only have the kernel function κ(·, ·). Eqn. 5 can be solved implicitly in the kernel space by expanding the constraint [7], min ∥α∥1 + β∥α∥22 s.t. 1 + αT κ(D, D)α − 2κ(xt , D)α ≤ ϵ,

(6)

where κ(D, D) is a matrix, with the (i, j)-th entry κ(D(:, i), D(: , j)) = S(xi , xj ), and κ(xt , D) is a vector with k-th entry κ(xt , D(: , k)) = S(xt , xk ). Eqn. 6 can be solved efficiently in the similar way of Eqn. 5. Co-event photo collections by different cameras usually cover a wide variety of contents with some amount of redundancy, i.e. most photos are visually uncorrelated while some of them overlap in content, because different photographers may have captured correlated contents at same times. Consequently, the visual correlation links between photo streams are usually sparse with respect to all possible edges between photos. By explicitly incorporating the sparsity constraint in the graph construction process, we can adaptively select the most visually correlated photos for each node in the graph and thus discovering the most informative links between different photo streams of the same event. In the following, with the basic kernel sparse graph construction procedure, we propose a principled approach for aligning photo streams based on a sparse bipartite graph in Section 3.2.

3.

PHOTO STREAM ALIGNMENT

In this Section, we describe our approach for aligning multiple photo streams from different cameras whose time settings are not calibrated. For each pair of photo streams, our alignment algorithm is based on matching on a bipartite graph constructed based on the kernel sparse representation graph discussed in previous Section. A max linkage selection procedure is further introduced for photo link competition for robust matching.

3.1

Problem Statement

Suppose we are given two photo streams X 1 = [x11 , x12 , ..., x1m ] and X 2 = [x21 , x22 , ..., x2n ] about the same event, associated with which are their own camera timestamps T1 = [t11 , t12 , ..., t1m ] and T2 = [t21 , t22 , ..., t2n ], where xji denotes the i-th photo with tji its camera timestamp in stream j ∈ {1, 2}. In most cases, we can assume that the relative time within both T1 and T2 is correct, but the relative time shift between T1 and T2 is unknown. Our goal is to estimate the correct time shift ∆T between the two time sequences. To make accurate photo stream alignment possible, we make the following assumption: A SSUMPTION 1. The photo streams to be aligned contain a certain amount of temporal-visual correlations. By finding such temporal-visual correlations between photos from different streams, though in many cases sparse, we can align the two photo streams in chronological order to describe the complete event. Although there is only one parameter ∆T to infer, robust and accurate alignment turns out to be nontrivial due to the following reasons: 1. Limited effectiveness of the visual features, i.e. semantically similar photos may be distant by the visual similarity, and vice versa. For example, in Figure 3, the left two photos have low visual similarity, but they are related to the same moment and same scene of the event. 2. Photos are not taken consciously to facilitate alignment, e.g. different photographers may capture largely different contents. This is very common since different photographers have different spatial and personal perspectives about the same event. 3. Misleading data may exist for alignment, e.g. similar scenes may be captured at different times. For example, in Figure 3, the two photos are about the same scene, but they were taken at different times by the two photographers respectively. As such, consumer photo streams are extremely noisy for accurate alignment, and decisions made on an isolated pair of images can be incorrect without proper context of the corresponding photo streams. In fact, as discussed later in our experiments in Section 4, heuristic approaches are not reliable and often run into contradictions. Therefore, we propose a principled approach for robust and accurate alignment by matching two photo streams on a sparse bipartite graph between each pair of photo streams, constructed based on kernel sparse representation.

3.2

Sparse Bipartite Graph

Different photo streams for the same event usually share some similar photo contents. If we can build a bipartite graph G = (X 1 , X 2 , E) linking the informative pairs—the distinctive photo pairs from two streams that share large visual similarities—from the two streams, with assumption 1, we will be able to find the correct ∆T . Consumer albums typically contain photos diverse in

Figure 3: The left two photos are visually distant, but semantically they are about the same scene (from Horse Trail dataset). The right two photos are about the same scene, but were taken at different times (from Lijiang dataset). contents and appearance, the informative pairs are only a few compared with the album sizes. Therefore, the bipartite graph G, which includes the links of informative pairs between photo streams as its edges, should be sparse, i.e. |E| ≪ |X 1 | · |X 2 |. In this case, we can only use the visual information and possible GPS information for measuring the photo similarities. Based on the basic technique presented in Section 2, Algorithm 1 shows the procedures of constructing the bipartite graph between two photo streams X 1 and X 2 , where E 12 records the directed bipartite graph edges from X 1 to X 2 , and E 21 records the reverse graph edges. The final affinity matrix simply averages the two directed affinity matrices 1 12 21 + Eji ). (7) Eij = (Eij 2 Using “average" of the two directed edge weights makes the bi12 21 partite graph linkage more distinctive. If Eij and Eji are both 1 2 nonzero, i.e. both xi and xj choose the other one as one of its informative neighbors among many others, then x1i and x2j are strongly connected and are more likely to be the informative pair desired for the alignment task. Algorithm 1 Sparse Bipartite Graph Construction 1: Input: photo streams X 1 and X 2 , kernel function κ. 2: for each x1i ∈ X 1 do 3: Solve the following optimization:

a procedure called max linkage selection to prune the candidate matches: if a photo has multiple links with other nodes, we only keep the edge with maximum weight and break the rest. In this way, the remained matched pairs are more informative for the alignment task, as verified by our experiments. Note that max linkage selection is not equivalent to finding the most similar photo in the first place: 1) finding the most similar photo does not have a competing procedure; one still needs this max linkage selection to prune the false matches; 2) finding most similar photo has the problem of assigning weights, which is essential for robust matching. Denote the set of pruned matched pairs as { } M = (x1i , t1i ; x2j , t2j )|Eij ̸= 0 . (9) The correct time shift ∆T (in seconds) is found by ∑ ∆T = arg max Eij δ(|t1i − t2j − ∆t| ≤ τ ), ∆t

where δ is the indicator function, and τ is a small time displacement tolerance for reliable matching (chosen as 60s in our experiments). Once we have ∆T ’s for each pair of photo streams, we can merge multiple streams into a master photo stream in chronological order for sharing among different users.

3.4

T ϵpq = h(wpq (Tp∗ + ∆Tp − Tq∗ − ∆Tq )),

α

12 4: Assign Eij = |αi1 (j)| for j = 1, 2, ..., n. 5: end for 6: for each x2j ∈ X 2 do 7: Solve the following optimization:

s.t. 1 + αT κ(X 1 , X1 )α − 2κ(x2j , X1 )α ≤ ϵ. 21 8: Assign Eji = |αj1 (i)| for i = 1, 2, ..., m. 9: end for 10: [Output: sparse ] bipartite graph affinity matrix E E 12 + (E 21 )T /2.

(11)

where h is the Huber function to allow matching outliers. The consistent time alignments can thus be found by

s.t. 1 + αT κ(X 2 , X 2 )α − 2κ(x1i , X 2 )α ≤ ϵ.

α

Multiple Sequence Adjustment

In practice, we usually have more than two photo streams, which can provide complementary visual matching information for alignment. Since pair-wise stream matching does not ensure time consistency as a whole, we need to combine the matching results for multiple stream pairs. Suppose we have s streams in total, for each pair of matched photo streams, we have the matched photo pair set M∗pq , 1 ≤ p, q ≤ s found by Eqn. 10. Let Tp∗ and Tq∗ denote the timestamp sequences of the matched photo pair set, and wpq be their matching scores. Our goal is, for a chosen reference times∗ tamp sequence Tref , infer ∆Tp for each time stamp sequence Tp∗ , so that multiple photo streams will be mapped onto the common time axis. We define the matching error for two sequences as

α1i = arg min ∥α∥1 + β∥α∥22

αj2 = arg min ∥α∥1 + β∥α∥22

(10)

(i,j)∈M

mins

{∆Tl }l=1

4. (8)

=

3.3 Max Linkage Selection for Robust Matching The above sparse bipartite graph construction is based on the similarity measure only, without respecting the chronological order constraint within each photo stream. Yet these sparse links provide the candidate photo matches critical for alignment. However, due to the limitations of the photo similarity measures, these candidate matches may be too spurious for precise alignment. We propose

s ∑ ∑

ϵpq .

(12)

p=1 q̸=p

EXPERIMENTAL EVALUATION

To evaluate the performance of our algorithm, we collected a total of 36 real-world consumer photo datasets, each corresponding to one event and containing several personal photo albums. The photographers were not aware of this project at of the time of creating their photo albums, and therefore the photos were taken unconsciously and thus avoid the bias for later alignment. The number of photos in each dataset ranges from several dozens to several hundreds or even over a thousand. The entire collection of datasets is rather diverse: the content ranges from traveling (numerous natural and urban scenes) to social events (wedding, car racing, sports, stage shows, etc.); tens of photographers were involved and tens of different models of cameras were used. In the following, we will present our photo stream alignment results and compare with several baseline algorithms. For all the photo datasets we have collected, the time settings of the cameras were not calibrated with each other in situ, and therefore we do not have the absolute ground truth for the camera clocks.

Table 1: The photo stream alignment accuracy on the 36 photo datasets with different algorithms. Alg. Acc.

DNN 25/36

SIFT 25/36

kNN 27/36

R-kNN 29/36

SRG 32/36

R-SRG 34/36

However, we obtained a ground truth accurate enough to reflect the correct sequential order of the merged photo stream through verification with the first-party photographers. This ground truth is sufficient for evaluating the algorithms, as one only cares about the sequential order of the photos in the event.

Figure 4: Example photos from the “soccer" dataset, which hardly shows informative visual-time correlations.

4.1 Alignment Results The key to the photo stream alignment is to find the informative photo pairs. One can come up with many possible heuristic approaches for this problem. However, heuristic approaches often run into contradiction and fail to obtain robust and accurate alignment, suggesting that a principled approach is needed for robust alignment. In the following, we describe and compare with the three best performing heuristic methods among what we have tried. 1. Distinctive nearest neighbor search (DNN). For photo x1 from the first stream, x2 in the second stream is its distinctive nearest neighbor, if the similarity between x1 and x2 is at least r (r > 1) times larger than those between x1 and any other photos in the second stream; otherwise, there is no match for photo x1 . There are also other ways to define DNN, e.g. only link those nearest neighbors with similarities larger than some threshold µ. However, we find that our definition of DNN is more robust for different datasets, since it introduces a competing procedure instead of relying on a certain fixed threshold. 2. SIFT feature matching. Another straightforward way to find the informative pairs is to use near-duplication detection techniques, such as local SIFT feature matching by RANSAC [1]. However, on one hand, SIFT feature matching tends to miss many visually similar but not quite duplicate photos, leading to too few detections of the informative photo pairs in some cases. On the other hand, this method tends to be mislead by strong outliers, e.g. near-duplicate scenes that in some cases actually occurred at different times. After all, the photographers do not always walk in locked steps and take pictures. In practice, this approach also runs too slow. 3. R-kNN graph matching. Instead of the proposed sparse graph, one can use the conventional kNN graph to establish the sparse links, and assign the edge weights with the calculated similarities between the photo nodes. To reject spurious links, one can also apply the max linkage selection procedure as in our algorithm for robust matching, referred as R-kNN. We evaluate the alignment results by checking whether the merged super stream is in the same sequential order as the verified ground truth—if so, we count it as correct; otherwise, we count it as a failure no matter how large the actual alignment error is. In Table 1, we list the alignment accuracies of the different algorithms. By using the proposed max linkage selection procedure, “R-kNN" performs better than kNN graph matching thanks to spurious linkage rejection. Directly using the sparse representation graph, “SRG" already outperforms all the heuristic methods, and it is further improved by using the competing procedure of max linkage selection (referred as “R-SRG"). Overall, our algorithm can achieve excellent alignment results in merging different photo streams in chronological order on most

Figure 5: Alignment example by our proposed algorithm on the Lijiang Trip dataset. Photos from different cameras are indicated by different border colors. of the datasets, in a fashion comparable with human observers. In very few other datasets (2 out of 36), such as the “soccer" dataset, where Asumption 1 is violated, our algorithm fails as do unrelated human observers. In the “soccer" dataset, the photographers merely sat around the same location, taking photos that were visually very similar but at different times. Figure 4 shows some example photos from two streams (indicated by different border colors) in this dataset, where the photos are visually similar across different times. Alignment on this dataset is also very challenging for a human: only by a pair of photos can a very observant human roughly align these two streams with careful examination of the semantic content of the photos (i.e. the positions and moving directions of the soccer players). Figure 5 shows an alignment example for the “Lijiang" trip photo dataset by our algorithm. There is one particular difficulty with the alignment for this dataset—visually similar photos do not always occur at the same time, which causes a problem for SIFT feature matching. The SIFT feature matching method links strongly the photo pairs connected by yellow dotted lines shown in the figure, and eventually produces incorrect alignment. In contrast, by utilizing more graph links from other photos, our method can ultimately identify the correct time shift. Figure 6 shows another alignment example for the “Wedding" event, which occurred in the courthouse. Many of those photo are visually very similar (same persons with same backgrounds). All the baseline algorithms fail in this case, since many matched pairs found by these algorithms are false links. By adaptively selecting the most relevant photos via sparsity constraint followed by a competing procedure of max linkage selection, our algorithm can effectively reject those spurious links and correctly identify the true time shift. Figure 7 shows the curve of matching score versus time shift on three of the datasets. For the first two cases, our algorithm can successfully locate the accurate time shift ∆T by picking the sharp peak from the curve. However, for the third “soccer" dataset, the algorithm could not locate a clear peak. Compared with the previous two cases, the curve has high entropy and multiple peaks, which are strong indications of poor matching. Finally, we note that the proposed algorithm is also computation-

Figure 6: Alignment examples by our proposed algorithm on the Wedding dataset. Photos from different cameras are indicated by different border colors. 6

2

8

6

4

5

Matching Score

Matching Score

Matching Score

10

4 3

1.5

1

2

0.5

2

0 −8

2.5

7

12

1

−6

−4

−2

0

Time Shift

2

4

6

8

0 −4

−3

5

x 10

−2

−1

0

Time Shift

1

2

3 4

x 10

0 −6000

−4000

−2000

0

2000

4000

6000

8000

Time Shift

Figure 7: The matching score vs. time shift. Left: Grand Canyon; middle: Lijiang day 4; right: Soccer. ally as efficient as heuristic methods DNN and kNN (R-kNN), and much faster than SIFT matching.

5. CONCLUSIONS AND FUTURE WORK In this paper, we address the practical problem of photo alignment for collaborative photo collection and sharing in social media. Since people have similar photo taking interests and viewpoints, there are photos with overlapping visual content when several cameras (photographers) capture the same event. Based on such visual information overlap, we are able to align multiple photo streams along a common chronological timeline of the event, employing a sparse bipartite graph to find the informative photo pairs and a max linkage selection competing procedure to prune false links. Compared with several baseline algorithms, our alignment algorithm can achieve satisfactory that are comparable to human performance. The proposed framework also lends itself to many other applications, such as geo-tag or user-tag transfer between the aligned photo streams, and photo summarization on the master stream for the event, which we investigate in our future work.

Acknowledgement This work is supported in part by Eastman Kodak Research, U.S. Army Research Laboratory and U.S. Army Research Office under grand number W911NF-09-1-0383.

6. REFERENCES [1] M. Brown and D. G. Lowe. Automatic panoramic image stitching using invariant features. International Journal of Computer Vision, 74:59–73, 2007. [2] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52:489–509, Feb. 2006. [3] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang. Learning with ℓ1 -graph for image analysis. IEEE Transactions on Image Processing (TIP), 19(4):858–866, 2010. [4] H. Cheng, Z. Liu, and J. Yang. Sparsity induced similarity measure for label propagation. In IEEE International Conference on Computer Vision, 2009.

[5] W. T. Chu and C.-H. Lin. Automatic summarization of travel photos using near-duplicate detection and feature filtering. In Proceedings of the ACM International Conference on Multimedia, 2009. [6] S. Gammeter, L. Boassard, T. Quack, and L. V. Gool. I know what you did last summer: object-level auto-annotation of holiday snaps. In IEEE International Conference on Computer Vision, pages 614–621, 2009. [7] S. Gao, I. W.-H. Tsang, and L.-T. Chia. Kernel sparse representation for image classification and face recognition. In European Conference on Computer Vision (ECCV), 2010. [8] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In IEEE International Conference on Computer Vision, 2009. [9] D. Huynh, S. Drucker, P. Baudisch, and C. Wong. Time quilt: scaling up zoomable photo browsers for large, unstructured photo collections. In SIGCHI Conference on Human factors in computing systems, pages 1937–1940, 2005. [10] D. Kirk, A. Sellen, C. Rother, and K. Wood. Understanding photowork. In SIGCHI Conference on Human factors in computing systems, 2006. [11] L.-J. Li, S. R., and L. Fei-Fei. Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In IEEE International Conference on Computer Vision, pages 2036–2043, 2009. [12] A. C. Loui and A. Savakis. Automated event clustering and quality screening of consuer pictures for digital albuming. IEEE Transactions on Multimedia, 2003. [13] T. Quack, B. Leibe, and L. V. Gool. World-scale mining of objects and events from community photo collections. In International Conference on Content-based Image and Video Retrieval, 2008. [14] I. Simon, N. Snavely, and S. M. Seitz. Scene summarization for online image collections. In IEEE 11th International Conference on Computer Vision, pages 1–8, 2007. [15] G. Strong and M. Gong. Organizing and browsing photos using different feature vectors and their evaluations. In ACM International Conference on Image and Video Retrieval, 2009. [16] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place and object recognition. In Proceedings of International Conference on Computer Vision, 2003. [17] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010. [18] X.-J. Wang, L. Zhang, M. Liu, Y. Li, and W.-Y. Ma. Arista-image search to annotation on billions of web photos. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2987–2994, 2010. [19] S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, and M. D.N. Antomatic image annotation using group sparsity. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3312–3319, 2010. [20] X. Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin Madison, 2008. [21] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.