01-10-2016, 02:47 PM
1457277391-acmcs12.reranking.survey.final2014.DOCX (Size: 1.03 MB / Downloads: 6)
1. INTRODUCTION
The proliferation of capture devices and the explosive growth of online social media have led to the countless private image and video collections on local computing devices such as personal computers, cell phones, and personal digital assistants, as well as the huge yet increasing public media collections on the Internet [Boll 2007]. For example, the most popular photo sharing site—Flickr [Flickr], reached five billion photo uploads in 2011, as well as 3-5 million new photos uploaded daily [Kennedy et al. 2007]. Facebook held more than 60 billion photos shared by its communities as of 2011 [Facebook], while Youtube streams more than one billion videos per day worldwide [YouTube].
Such explosive growth and widespread accessibility of visual content have led to a surge of research activity in visual search. The key problem is retrieving visual documents (such as images, video clips, and Web pages containing images or videos) that are relevant to a given query or user search intention from a large-scale database. In the last decade, visual search has attracted a great deal of attention,
though it has been studied since the early 1990s (referred to as content-based image/video retrieval [Lew et al. 2006], [Li et al. 2007], [Rui et al. 1999]). Many research demonstrations and commercial applications have been developed. Due to the great success of text search, most popular image and video search engines, such as Google [Google], Bing [Bing], Yahoo! [Yahoo!], and so on, build upon text search techniques by using the non-visual information (such as surrounding text and user-provided tags) associated with visual content. This kind of multimedia search approach cannot always achieve satisfying results as it entirely ignores the visual content as a ranking signal [Chang et al. 2007], [Datta et al. 2008], [Hauptmann et al. 2008a], [Hsu et al. 2007], [Snoek and Worring 2009].
To address the problems of visual search approaches, multimedia search re-ranking has received increasing attention in recent years. It is defined as the reordering of visual documents based on the information manifested in the initial search results or a knowledge base to improve the search per-formance. This information actually consists of multimodal cues that can be the knowledge or specific patterns mined from the initial ranked list, query examples, or any available auxiliary knowledge. From another perspective, re-ranking can be viewed as a post-process of core search. Figure 1 shows a general process for multimedia search re-ranking. A visual document might be an image, a video clip, or a Web page containing images or videos. Given an initial ranked list of visual documents returned by any search approach, visual search re-ranking improves search performance by reordering these visual documents based on the multimodal cues. For example, in a real image search engine, the ini-tial text search results can be re-ranked, according to the visual similarity to a given example [Bing], [Google] or color style (e.g., color or grey). Another example in an image retrieval system is to re-rank or filter the images accord-ing to some predefined categories [Fergus et al. 2004], [Wnuk and Soattoh 2008]. In the settings of object retrieval, a geometric verification step is usually introduced to re-rank the results returned from the bag-of-words (BoW) model based on checking the spatial configurations [Philbin et al. 2007], [Jegou et al. 2010]. The challenges associated with multimedia search re-ranking can be attributed to the following factors:
Many existing commercial search engines have developed different re-ranking schemes to improve the search experience. Figure 2 shows some examples of re-ranking designs. For example, Google [Google], [Jing and Baluja 2008a], [Jing and Baluja 2008b] and Bing [Bing], [Cui et al. 2008a], [Cui et al. 2008b] support retrieving similar or near-duplicate objects in their text-only image search re-sults. Yahoo! [Yahoo!] and Bing [Bing], [Wang and Hua. 2011] integrate a set of content filters (also called attributes) such as “image size,” “layout,” “style,” and “popular queries,” in their image searches, while Videosurf [VIDEOSURF] uses a face detector as the filter for re-ranking keyframes. Although these content filters can facilitate some specific search tasks, most of them rely on simple features and do not directly represent the relevant information associated with the query. The key problem—the relevance between the search results and the query—still remains a challenging and open research problem. There is no generic approach that can deal with all kinds of query and search intent. On the other hand, “type less and search more” is the most desired feature in most search engines. It is thus not practical to let users spend considerable time and perform several rounds of interactions to look for desired search results. Therefore, it remains an open issue to re-rank the visual search results according to the query and user intent. In other words, there is a gap between the user search intent and the results from existing search systems.
While numerous approaches have been proposed for multimedia search re-ranking, we are unaware of any survey on this particular topic. Clearly, multimedia search re-ranking has been an importan-t and hot topic in both academia and the industry. We have observed that almost all notepapers in recent TRECVID proceedings have adopted re-ranking techniques for improving search task perfor-mance [TRECVID], not to mention many works on image and object retrieval in the computer vision community where re-ranking has become a key post-processing step [Philbin et al. 2007], [Jegou et al. 2010], [Chum et al. 2011]. It is worthwhile to re-visit and categorize the current techniques and explore the potential benefits and challenges that both the multimedia and vision communities offer to visual search re-ranking and vice versa.
The research on visual search re-ranking has proceeded along four paradigms from the perspective of the knowledge exploited for mining relevant information: 1) self-re-ranking which only uses the initial search results; 2) example-based-re-ranking which leverages user-provided query examples; 3) crowd-re-ranking which explores the crowdsourcing knowledge available on the Internet, e.g., the mul-tiple image and video search engines or sites, the user-contributed online encyclopedia like Wikipedia [Wikipedia], and so on; and 4) interactive-re-ranking which involves user interaction to guide the re-ranking process. The scope of this survey is as follows. We will first introduce typical multimedia information retrieval systems and the role of re-ranking in these systems. We will then review the re-ranking methodologies in terms of these dimensions. Since these data-driven methods rely heavily on the data sets and the corresponding knowledge mined from these data, we also discuss the data sets that are suitable for visual re-ranking. A related problem is how to evaluate the performance of a re-ranking method. Therefore, we review the performance metrics and evaluations for visual search re-ranking in this paper. The scope of this survey will cover the approaches inspired from multiple research fields such as computer vision, machine learning, text retrieval, and human-computer inter-action.
As most re-ranking methods for visual search are highly motivated by the re-ranking and rank aggregation methods in the text domain, we first discuss the related techniques in text search, which can provide a comparable analysis of text and multimedia. However, as this paper focuses on the visual domain, we only briefly introduce some representative methods.
1.1 Re-ranking in Text Domain
Similar to visual search re-ranking, the research on text search re-ranking can be also categorized into the following paradigms.
—Self-re-ranking. Analogous to multimedia re-ranking, the self-re-ranking methods for text search al-so include: 1) clustering-based method [Lee et al. 2001], [Na et al. 2008], where the initially retrieved documents are grouped into different clusters, 2) pseudo relevance feedback [Cao et al. 2007], [Tseng et al. 2008], where the top ranked documents are regarded as “positive” when learning a ranking model, and 3) graph-based method [Bendersky and Kurland 2008], [Brin and Page 1998a], [Deng et al. 2009], [Kurland and Lee 2005], [Lin 2008], where a graph is built locally from the initial top ranked documents or globally from the entire document collection.
—Example-based-re-ranking [Bogers and Bosch 2009], [Zloof 1975a], [Zloof 1975b]. Query-by-Example (QBE) was first proposed by Zloof in the 1970s [Zloof 1975a], [Zloof 1975b]. The motivation is to parse a user’s query into a structured statement expressed in a database manipulation language. Later, researchers investigate ways to understand the query provided by users using accompany examples. For example, Bogers et al. propose dividing the IR test collections into different subcollections, and applying a linear fusion of the search results from disparate baseline results [Bogers and Bosch 2009]. The weights for fusion are determined by the authoritative scores, which reflect the expertise between authors on certain topics.
—Crowd-re-ranking [Carbonell 1997], [Chen et al. 2011], [Dwork et al. 2001], [Liu et al. 2007], [Ren-da and Straccia 2003], [SavvySearch], [White et al. 2008]. For example, translingual information retrieval (TIR) is characterized by providing a query in one language and searching documents in one or more different languages [Carbonell 1997]. This is similar to the setting of TRECVID search tasks, where a video document is probably associated with different machine translated languages [TRECVID]. Metasearch is a prominent approach to combine the search result lists returned by
multiple search engines [Dwork et al. 2001], [Renda and Straccia 2003], [SavvySearch]. Each doc-ument in a returned list is ordered with respect to a search engine and a relevance score. Liu et al. propose to leverage user-labeled data to perform metasearch in a supervised manner [Liu et al. 2007], while Chen et al. suggest a semi-supervised approach to ranking aggregation by leveraging the large amounts of unlabeled data [Chen et al. 2011]. Different from metasearch where the final search results are the combination of multiple lists, White et al. propose to only provide the best single list from multiple search engines [White et al. 2008]. A learning-based approach is adopted to support switching between search engines.
—Interactive-re-ranking [Rohini and Varma 2007], [Yamamoto et al. 2007]. For example, a user is enabled to edit a part of the search results (i.e., delete and emphasis operations) in [Yamamoto et al. 2007]. The operations are then propagated to all of the results to re-rank them. Rohini et al. propose to learn the profiles of the users using machine learning techniques by making use of past browsing histories, and then re-rank the results based on collaborative filtering techniques [Rohini and Varma 2007].
In addition, text search re-ranking also involves the analysis of query logs [Teevan et al. 2007], [Zhuang and Cucerzan 2006], as well as the consideration of the diversity of search results [Carbonell and Goldstein 1998]. A more comprehensive study on the combination approach for information re-trieval can be found in [Croft 2000].
With the aim of providing a comprehensive and critical survey of current approaches to visual search re-ranking, this paper is organized as follows. Section 2 gives a detailed review of techniques to visual search re-ranking, including general visual search framework, as well as an overview of re-ranking from the perspective of Bayesian formulation and methodologies for re-ranking. Moreover, we offer a brief survey on re-ranking for text search, from which visual search re-ranking is motivated. Section 3 discusses benchmarking datasets and evaluation criteria. We conclude this paper with a discussion of several promising directions in Section 4.
2. MULTIMEDIA SEARCH RE-RANKING: METHODOLOGIES
In this section, we review existing approaches to visual search re-ranking. We first position re-ranking as a key component in a typical visual search system and provide a Bayesian formulation for overview-ing re-ranking problem. We then classify re-ranking approaches into four categories and discuss each of them in details. We also discuss other recent techniques for re-ranking, such as query suggestion and user interface, as well as the re-ranking methods in text domain.
2.1 The Role of Re-ranking in Multimedia Search
A typical visual search system consists of several components, including query analysis, an index module, uni-modal search (e.g., text, visual, and concept searches), and re-ranking. Figure 3 shows a generic visual search framework. Usually, the query in a visual search system consists of a piece of a textual query (e.g., “find shots in which a boat moves past”) and/or probably a set of query examples (e.g., objects, images, or video keyframes or clips) 2. Via query analysis, the meaningful or important keywords and their expanded related terms are obtained based on the textual query. Meanwhile, the visual examples can be mapped to some relevant high-level concepts by the pre-trained classifiers for
concept-based search (e.g., boat, water, and outdoor). Specifically, the confidence scores from those clas-sifiers can be treated as the weights for the corresponding concepts (i.e., hidden text), and further used in a text-alike search (e.g., inverted index based on term and document frequency) or used as a feature vector in a concept space for searching via query-by-example (QBE). Moreover, a set of low-level visual features (global and local features) is extracted to represent these query examples for visual-based search. These multimodal queries are fed into individual search models, such as text, concept, and visual-based searches, respectively. For example, a text-based search may use the speech recognition transcript, the closed caption available from the program channel, or the recognized captions embed-ded in video frames through Optical Character Recognition (OCR). The confidence vectors from concept detectors or low-level feature vectors can be used in same way as the QBE or tf-idf scheme [Baeza-Yates and Ribeiro-Neto 1999] for searching. More comprehensive introductions of content-based image and video retrieval can be found in [Datta et al. 2008], [Kennedy et al. 2008a], [Lew 2000], [Lew et al. 2006], [Smeulders et al. 2000], [Snoek and Worring 2009], [Yan and Hauptmann 2007b], [Philbin et al. 2007]. Based on these initial search results, as well as some knowledge, a re-ranking module is applied to aggregate the search results and reorder the initial document list to improve the search performance. In this paper, re-ranking is defined as improving the initial text-based search accuracy by considering the other search results and the prior knowledge. We can observe from Figure 3 that re-ranking plays a key in the visual search framework to improve the initial search performance.
2.2 A Bayesian View of Multimedia Search Re-ranking
Visual re-ranking problem can be traced back to the late 1990s when researchers started focusing on improving content-based image retrieval (CBIR) results via relevance feedback techniques [Benitez et al. 1998], [Rui et al. 1998], [Zhou and Huang 2002], [Zhou and Huang 2003]. It emerged as an in-dependent research topic and attracted increasing intention beginning in the early 2000s. In the most common formulation, the re-ranking problem can be reduced to a problem of re-estimating relevance for each document that has been ranked in an initial search result (e.g., a ranked list of documents searched by a text-only approach, a ranked list of objects returned by a bag-of-words model, etc.). In-tuitively, this estimation is usually based on some knowledge mined from the initial search results or the queries, prior knowledge from the Web, some domain-specific knowledge databases, or the interactions of users. Once we can estimate the relevance of each document, we can re-rank these documents according to their relevance scores and obtain the best ranked list of these documents.
More formally, the re-ranking problem can be formulated as finding the optimal ranked list from the perspective of Bayesian theory as follows. Let D (D = {di}Ni=1) denote the collection of documents to be ranked or reranked, where di is a single document (such as an object, an image, a video, or a clip) and N is the number of documents in D. Let R0 denote the initial ranked list and R the best ranked list. In these lists, each document di has a corresponding ranking order or relevance score ri with respect to the query Q. Note that Q may be a piece of terms, or visual examples (i.e., a set of objects/images or video clips), or any combination of them. Let R denote the set of all possible ranked lists (R0, R ∈ R) and R = [r1, r2, . . . , rN ]T , where ri (0 6 ri 6 1) is the relevance score for the i-th visual document di. If we only consider the rank order of each document in the R, then the space of R is N !, which can be huge if N is big enough. Let K denote the knowledge for guiding the re-ranking process. From a probabilistic view, given the initial ranked list R0 and prior knowledge K, re-ranking can be formulated as to derive the optimal list R∗ with the maximum a posterior probability,