Chaos Monkey: How Netflix Uses Random Failure to Ensure Success

Netflix logo 150x150 In a post last week about lessons learned using Amazon Web Services, Netflix‘s John Ciancutti revealed that the company built something called “Chaos Monkey” to ensure that individual components work independently. Chaos Monkey randomly kills instances and services within Netflix’s AWS infrastructure to help developers to make sure each individual component returns something even when system dependencies aren’t responding.

Sponsor

For example, if the recommendation system is down Netflix will display popular titles instead of personalized picks. The quality of the response is degraded, but least there is a response. Ciancutti explains it this way: “If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.”

Here are the lessons Ciancutti writes that Netflix has learned:

Dorothy, you’re not in Kansas anymore (“You need to be prepared to unlearn a lot of what you know”)
Co-tenancy is hard
The best way to avoid failure is to fail constantly
Learn with real scale, not toy models
Commit yourself

Chaos Monkey fits into number three.

For more advice on migrating to the cloud from Netflix, check out our article Netflix’s Advice on Moving to Amazon Web Services.

Discuss

Posted in General, Technology, Web.

Tagged with Cloud Computing.

No comments

By Klint Finley – December 21, 2010

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

0 Responses

About Developers Arena

Recent Posts

Categories

Recent Comments

Chaos Monkey: How Netflix Uses Random Failure to Ensure Success

0 Responses

Subscribe

About Developers Arena

Recent Posts

Categories

Tags

Recent Comments