What’s in a picture? Google algorithms attempt to describe real-world photos in plain English


posted Monday, November 24, 2014 at 4:37 PM EST


It's the holy grail of image databases, and it looks like Google might be getting closer to making a breakthrough. For years now, researchers have attempted to tackle the thorny task of creating an algorithm that can identify (and then describe) the contents of a photograph without user intervention. It's one of those tasks that is so easy for people that we almost don't think about it, and yet it turns out to be incredibly challenging for a computer.

Consider the image below, courtesy of Google, for example. For most people living in the western world, it takes no thought at all to identify that you're looking at a half-finished glass of wine and slices of three separate pizzas, sitting somewhat jumbled on top of a Frigidaire gas stovetop. And with only a little consideration, you could probably make a reasonable guess as to the ingredients of those pizzas, not to mention identifying the salt, sunflower oil, and other condiments around the stove.

Google algorithms can now automatically describe this image in English with reasonable accuracy.

For a computer, though, describing an image with this kind of detail is next to impossible. In fact, even giving a vague approximation of a photo's contents -- with the exception of faces looking near the camera, and certain other more easily-recognized features -- is surprisingly difficult. Yet courtesy of Google research, a computer has managed to describe that image above with a reasonable level of detail: "Two pizzas sitting on top of a stove top oven."

And it's not just pizzas and stoves the algorithm can recognize. From sports and nature to photos around the house, it's managing to describe images with fair accuracy at least some of the time, as you can see in the examples below. It does, however, sometimes get things pretty badly wrong, as well, as you can see in the columns at the right of the image.

More examples -- good and bad -- of Google's image description algorithms at work.

Algorithms like these, if they can be made accurate most of the time, could be absolutely revolutionary for image databases, both in the cloud and potentially on the desktop as well. Imagine getting home from a shoot, and all your images being tagged and captioned automatically, freeing up your time to work on the photos in other ways, shoot more photos, or perhaps just put your feet up and relax for a bit! There's obviously still quite some way to go, but it's thrilling to see the progress that has already been made.

For more details on how it all works (warning: it's pretty technical), read the post "A picture is worth a thousand (coherent) words: building a natural description of images" on the Google Research blog.