Skip to content

Yelp’s mrJob: Powering Recommendations and now Open Source

yelp1.jpegYelp has a few nifty features on its network that gives it that special sauce. It’s what you see with most world-class social networks. Features provide context and allow for discovery. The features also make it simple to use the service with such features as review highlights; autocomplete; spelling suggestion and top searches.

“People Who Viewed this Also Viewed…” is one of its popular features. It shows you photos by other people who also have similar viewing habits.

Take the King Burrito page on Yelp. It is a favorite Mexican spot in North Portland. The food rocks. On Yelp, the sidebar shows what visitors to the King Burrito page are also viewing.



Yelp once used its own Hadoop cluster to power these types of services. But they had a few issues. Now they use what they call mrJob.

On Friday, they opened the distributed computing service for anyone to use.

According to the imformation on GitHub, mrJob “supports Amazon’s Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. It also works with your own Hadoop cluster.”

MrJob emerged after Yelp had issues with Hadoop. It would sometimes get in the way of other jobs. From the Yelp engineering blog:

“We had a dozen or so machines that we otherwise would have gotten rid of, and whenever we pushed our code to our webservers, we’d push it to the Hadoop machines.

This was kind of cool, in that our jobs could reference any other code in our code base.

It was also not so cool. You couldn’t really tell if a job was going to work at all until you pushed it to production. But the worst part was, most of the time our cluster would sit idle, and then every once in a while, a really beefy job would come along and tie up all of our nodes, and all the other jobs would have to wait.”

The Yelp team heard about EMR and decided to move its Hadoop cluster to the AWS platform. It took some time to move the code base but in May they retired their Hadoop cluster and switched its production to AWS. It’s that framework which became mrJob.

The Yelp team is encouraging developers to give mrJob a try. Details can be found on the Yelp Engineering blog.

Hadoop is for big data but the elasticity on a cluster can be minimal compared to what AWS can provide. It shows the value of using a service like AWS when the requirements go way beyond what an enterprise data center can handle with the best possible efficiencies.


Posted in General, Technology, Web.

Tagged with .

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

or, reply to this post via trackback.