Lluís Castrejón*, Yusuf Aytar*, Carl Vondrick, Hamed Pirsiavash, Antonio Torralba
Massachusetts Institute of Technology

CMPlaces is designed to train and evaluate cross-modal scene recognition models. It covers five different modalities: natural images, sketches, clip-art, text descriptions, and spatial text images. Each example in the dataset is annotated with one of the 205 scene labels from Places, which is one of the largest scene datasets available today. Hence, the examples in our dataset span a large number of natural situations. Examples in our dataset are not paired between modalities, encouraging researches to develop methods that learn strong alignments from weakly aligned data.

We chose these modalities for two reasons. Firstly, since the goal of the dataset is to study transfer across significantly different modalities, we seek modalities with different statistics to those of natural images (such as line drawings and text). Secondly, these modalities are easier to generate than real images, which is relevant to applications such as image retrieval. In total, contains more than ~1 million images comprising 205 unique scene categories and 5 modalities.

Download our paper

Please cite one of the following papers if you use this service:

Learning Aligned Cross-Modal Representations from Weakly Aligned Data

Ll. Castrejón*, Y. Aytar*, C. Vondrick, H. Pirsiavash and A. Torralba
CVPR 2016

Cross-Modal Scene Networks

Y. Aytar*, Ll. Castrejón*, C. Vondrick, H. Pirsiavash and A. Torralba
In Submission

Try our demo

Cross-Modal Retrieval Demo

Notice: Please do not overload our server by querying repeatedly in a short period of time. This is a free service for academic research and education purposes only. It has no guarantee of any kind. For any questions or comments regarding this demo, please contact the authors.


We thank TIG for managing our computer cluster. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research. This work was supported by NSF grant IIS-1524817, by a Google faculty research award to A.T and by a Google Ph.D. fellowship to C.V.