Don’t Laugh: Yahoo’s Open Source AI Has a Secret Weapon
Yet another tech giant is sharing its artificial intelligence know-how with the world. Today Yahoo published the source code to its CaffeOnSpark AI engine so that anyone from academic researchers to big corporations can use or modify it.
Yahoo may not be known as much for its technological prowess these days. But it did incubate Hadoop, an open source, wildly popular data crunching platform used by Facebook, Twitter and scores of other companies. And when it comes to AI, it has a unique asset. When training artificial intelligence systems, the data matters as much as the algorithms. And Yahoo has one of the more interesting data sets around in the form of Yahoo-owned photo site Flickr.
Like so many other new open source AI project, CaffeOnSpark is based on deep learning, a branch of artificial intelligence particularly useful in helping machines recognize human speech, or the contents of a photo or video. Yahoo, for example, uses it to improve search results on Flickr by determining the contents of different photos. Instead of relying on the descriptions and keywords entered by the people who upload photos to the site, Yahoo teaches its computers to recognize certain characteristics of a photo, such as specific colors or even objects and animals.
In recent months Google has open sourced its deep learning framework TensorFlow, Microsoft opened up its similar framework CNTK, Facebook shared its AI hardware designs, and Chinese search giant Baidu unveiled its deep learning training software.
Each of these pieces of open source technology scratches a different itch. For Yahoo, it’s the desire to run deep learning processes on existing systems without the need to move data from one place to another. Training a deep learning system to recognize images requires huge amounts of data, Yahoo vice president of architecture Andy Feng explains. You feed an algorithm as many examples as you can of, say, a cat, and eventually the machine will “learn” the common features of cats and and be able to tell photos that include cats from those that don’t.
Flickr hosts billions of photos, a plentiful selection of images with which to train an AI. But the team didn’t want to have to copy all those images from the primary Flickr servers to a new cluster of servers running deep learning software. So they invented a way to run deep learning software on their existing infrastructure.
CaffeOnSpark, as the name suggests, combines two existing technologies: the popular deep learning framework Caffe and the up-and-coming data-crunching system Spark that can run on top of the even more popular big data platform Hadoop. What Yahoo did was simply create a way to run Caffee atop Spark clusters. It can be run either on Spark alone or atop Hadoop. Besides making it easy for AI developers to use familiar tools and avoid moving data around, Feng says CaffeOnSpark also makes it relatively easy to distribute deep learning processes across multiple servers, something that the open source version of Google’s TensorFlow can’t yet do.
Feng says a number of companies asking Yahoo to open source CaffeOnSpark last year after the team published a blog post about the software. It turns out a lot of organizations having data sitting on server clusters that they don’t want to move it around.