See, Hear, and Read: Deep Aligned Representations

Yusuf Aytar, Carl Vondrick, Antonio Torralba
Massachusetts Institute of Technology

We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities. Moreover, although our network is only trained with image+text and image+sound pairs, it can transfer between text and sound as well, a transfer the network never observed during training. Visualizations of our representation reveal many hidden units which automatically emerge to detect concepts, independent of the modality.

Download Paper

Hidden Unit Visualizations

We visualize some units in the upper shared layers to analyze to what they respond. We found several units that automatically emerge to activate on some objects independent of the modality.

The videos show images, texts, and sounds that activate one particular hidden unit. Click them to hear their sounds!

Baby-like

Church-like

Dog-like

Engine-like

Music-like

Sports-like

Water-like

Wind-like


Example Retrievals

We show examples for cross-modal retrieval between sounds, images, and text. On the top of each panel, we show the input, and below it we show the top retrieved outputs across different modalities.

Although our network was trained using only image/sound and image/text pairs, a strong enough alignment emerges that it can transfer between sound/text.

Retrieval 1

Sound Input


Sound Output


Image Output


Text Output

  • a green soccer field
  • A soccer field.
  • the soccer field
  • big soccer field covered in grass

Retrieval 2

Sound Input


Sound Output


Image Output


Text Output

  • Water is crashing down
  • water is color green
  • Waterfall
  • this is a nature setting

Retrieval 3

Sound Input


Sound Output


Image Output


Text Output

  • people in a concert
  • he holds on to the bar
  • people at a concert event
  • two men at night

Retrieval 4

Sound Input


Sound Output


Image Output


Text Output

  • three walk signals are green
  • the blurred people in the audience
  • a big screen on stage
  • group posing at night time

Retrieval 5

Sound Input


Sound Output


Image Output


Text Output

  • a man in a suit stands near a table with computer equipment on it
  • a man in a suit standing next to a control board and computer
  • man in suit giving a speech
  • african american male speaking at a press conference

Retrieval 6

Text Input

  • the choppy water the man is riding

Sound Output


Image Output


Text Output

  • a person stands on water skis in the water
  • a couple of kayakers paddling through the water
  • a paddle boarder is navigating his paddle board through the surf
  • a man in the water with a sail and skis

Retrieval 7

Text Input

  • an airplane wing in the air with a lot of clouds

Sound Output


Image Output


Text Output

  • photo of airplane wing as the plane flies in the blue sky
  • looking out from a jet plane over the mountains
  • an airplane ascends through a cloud into a blue sky
  • a cloudy sky and an airplane wing with its tip up is flying by

Retrieval 8

Text Input

  • some baseball players a pitcher catcher and an umpire

Sound Output


Image Output


Text Output

  • three baseball players a batter a catcher and an umpire
  • a batter swinging at a ball with a catcher and umpire behind the plate
  • a pitcher is throwing the ball in a baseball game
  • a baseball player swinging at a pitch with catcher and umpire behind him

Retrieval 9

Text Input

  • brown bear in pool of water

Sound Output


Image Output


Text Output

  • a brown bear in the water
  • brown bear in the water
  • two bears are in the water
  • A brown bear in the water

Retrieval 10

Image Input


Sound Output


Image Output


Text Output

  • darkness of night sky
  • white and orange curtains behind the stage
  • Fireworks in the night sky
  • It is rather dark room.

Retrieval 11

Image Input


Sound Output


Image Output


Text Output

  • The dog belongs to the homeowner
  • the dog is on the beach
  • The dog is guarding the house
  • The dog is guarding the house

Retrieval 12

Image Input


Sound Output


Image Output


Text Output

  • A baby chewing on a tooth brush
  • A baby chewing on a tooth brush
  • A baby chewing on a tooth brush
  • a young baby chewing on a yellow hair brush