Motivation
Map-based street-level imagery, such as Google Street-view provides a comprehensive visual record of many cities worldwide. For example, the visual appearance of the city of Paris is captured by almost 100,000 publically available Street-view images[1]. We estimate there are 60 million Street-view images in France alone, covering all major cities. Additional visual sensors are likely to be wide-spread in near future: cameras will be built in most manufactured cars[2] and (some) people will continuously capture their daily visual experience using wearable mobile devices such as Google Glass[3]. All these data will provide large-scale, comprehensive and dynamically updated visual record of urban environments.
Goals
The goal of this effort is to develop automatic data analytic tools for large-scale quantitative analysis of such dynamic visual data. The aim is to provide quantitative answers to questions like: What are the typical architectural elements (e.g., different types of windows or balconies) characterizing a visual style of a city district? What is their geo-spatial distribution (see Figure 1)? How does the visual style of a geo-spatial area evolve over time? What are the boundaries between visually coherent areas in a city? Other types of interesting questions concern distribution of people and their activities: How do the number of people and their activities at particular places evolve during a day, over different seasons or years? Are there tourists sightseeing, urban dwellers shopping, elderly walking dogs, or children playing on the street? What are the major causes for bicycle accidents?
State-of-the-art and challenges
Computer vision research has focused on 3D reconstruction from ground level or aerial imagery [Musialski12]. In visual recognition, the typical goal is to name and localize a set of predefined objects (e.g., a car or a bicycle) in images, using local and/or global distributions of simple image features as models [Felzenszwalb10]. The current visual recognition models require accurate and time-consuming manual annotation of large amounts of training data. Impressive demonstrations of large-scale data analysis have started to appear in other scientific domains. In natural language processing, an analysis of a corpus of more than 5 million books published between 1800 and 2000 has revealed interesting linguistic, sociological and cultural patterns [Michel11]. In speech processing, a temporal analysis of several years of data from a home environment has demonstrated the possibility of predicting the child’s word acquisition age from the caregiver word usage [Roy06]. However, a similar large-scale analysis has yet to be demonstrated in the visual domain. As the visual data and computational resources become widely available, the main scientific challenge now lies in developing powerful models for spatio-temporal, distributed and dynamic visual data. For example, while natural text vocabulary and grammar are rather well defined, there is no accepted visual equivalent that captures subtle but important visual differences in architectural styles, or that differentiates fine changes in human behavior leading to vastly different scene interpretations.
Methodology
This project will build on the considerable progress in visual object, scene and human action recognition achieved in the last ten years, as well as the recent advances in large-scale scale machine learning that enable optimizing complex structured models using massive sets of data. The project will develop a general framework for finding, describing and quantifying dynamic visual patterns, such as architectural styles or human behaviors, distributed across many dynamic scenes from urban environments. The models will be automatically learnt from visual data with different forms of readily available but noisy and incomplete metadata such as text, geotags, or publicly available map-based information (e.g., the type or use of buildings). Our initial results in this direction on static Street-view images have been published in [Doersch12] and are illustrated in Figure 1.
Figure 1: Quantitative visual analysis of urban environments from street-view imagery.
a: Examples of architectural visual elements characteristic for Paris, Prague and London automatically learnt by analyzing thousands of Street-view images. b: An example of a geographic pattern (shown as red dots on the map of Paris) for one visual element. Here balconies with cast-iron railings are concentrated on the main boulevards. Figures from [Doersch12].
[2]kschang.hubpages.com/hub/How-Many-Cameras-Will-Your-Next-Car-Have-How-Cameras-are-Making-Cars-Smarter
[3]http://en.wikipedia.org/wiki/Project_Glass
References
[Doersch12] C. Doersch, S. Singh, A. Gupta, J. Sivic, A. Efros. What makes Paris look like Paris? ACM Transactions on Graphics (SIGGRAPH 2012).
[Felzenszwalb10] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence. 32(9). 2010.
[Michel11] J.B. Michel et al. Quantitative analysis of culture using millions of digitized books. Science, 331(6014). 2011.
[Musialski12] P. Musialski et al. A Survey of Urban Reconstruction. Eurographics 2012-State of the Art Reports. 2012.
[Roy06] D. Roy et al. The human speechome project. Symbol Grounding and Beyond. 2006.