Well I have nothing constructive to add, but it certainly makes me rethink using...

dmytton · on Oct 5, 2010

You can get commercial support from 10gen (MongoDB developers) as we do for my company. It's just like MySQL providing support for their enterprise DB. Of course MySQL is much more widely used than MongoDB so there is more community help and knowledge.

marclove · on Oct 5, 2010

Right, I'm aware. Not everybody can afford a contract though, nor should they have to in order to avoid major outages like this.

But I would like to know if Foursquare has a commercial support contract with 10Gen. If they didn't, why not? especially for a service that big? If they did, how was it that 10Gen took that long to fix the problem?

sachinag · on Oct 5, 2010

They share USV as an investor. I'd be astonished if they didn't.

marclove · on Oct 5, 2010

Heh...that's one hell of a commercial support contract then. Should be interesting to read 10gen's own detailed post-mortem.

cyu · on Oct 6, 2010

If I was 10Gen, it would be in my best interest to offer support to 4sq, contract or not. People (like us) are watching.

steveklabnik · on Oct 6, 2010

I'm sure that if you're the size of Foursquare, you can afford the contract.

Then again, it's no excuse for bad software. I've only used Mongo on small sites so far, and have been loving it.

marclove · on Oct 6, 2010

Its all fun and games until someone gets a shard in the eye.

fizx · on Oct 6, 2010

I'm sure it was diagnosed quickly. Sometimes you have to copy data, rebuild indexes, etc.

fauigerzigerk · on Oct 6, 2010

According to that blog post it hasn't actually been properly diagnosed yet. They see the symptoms but they don't know why they're seeing those symptoms.

rbranson · on Oct 5, 2010

Honestly though, would MySQL or PostgreSQL really helped out in this situation? Sharded or not sharded, there's really not much one can do once a server (or set of replicating servers comprising the shard) starts to become overloaded. Increasing the capacity of the shard by adding more hardware will induce a significant amount of load by itself. Of course, that's just one piece of the puzzle. We still don't know what actually brought the site down completely, hopefully they'll be able to trace it down and fill us in on that.

ora600 · on Oct 6, 2010

Here's what I do in these situations (I'm an Oracle DBA, but this should apply to most loaded shards):

1) Use connection pooling at the application layer to prevent overloading the DB of any specific shard. This means that if a shard has 16 CPUs, having 16 connections sounds reasonable. Additional connections will not give you more performance. This means you need to queue and throttle requests at the application layer and with some thought you can probably figure out what to do with the waiting users - show partial results? show a nice whale? A "loading please wait" sign?

2) If you didn't do #1 and the DB is getting overloaded, my normal response is to start shooting down connections. Oracle has separate unix process per connection. MySQL has its own way of shooting connections down. Put up a small script that will kill the correct percentage of sessions to prevent overload on shared resources. This will generates lots of errors and will cause a percentage of the users to hate you, but you won't be down.

rbranson · on Oct 6, 2010

1) Application connection pooling won't scale. In a scenario like FourSquare, there is likely a 4:1 ratio of app server to DB shard server. Further, connections don't necessarily equal load.

2) This sounds like a great way to create data inconsistencies, unless you've got very tight constraints on your database, which is impossible in a sharded scenario.

I agree though, that ultimately they should have had some way to "fail whale" instead of getting overloaded.

Devilboy · on Oct 6, 2010

The whole point of MongoDB is that you SHOULDN'T have to worry about a server becoming overloaded! You're giving up a lot for this privilege too, so if that doesn't even work properly then... back to PostgreSQL in my opinion.

rbranson · on Oct 6, 2010

MongoDB is designed to help ease the pain of scaling, but what you're asking for is magic. If all the sudden a large amount of requests start coming through that overload a shard (think very popular users, like if several very popular users ended up colocated on a shard), how is MongoDB going to anticipate this? Any kind of shard scaling will only work well in scenarios where you have a reasonable increase in load, not a crippling surge.

In addition, it's very, VERY difficult to scale writes against a single object, such as Justin Bieber's profile data, say, if you've got a view counter on it. You can either serialize writes on read like Cassandra does, which has it's own drawbacks (the more writers an object has, the more expensive reads become), or you can have single-master-for-an-object sharding like MongoDB employs and most other production sites (Facebook, Flickr, etc) use.

fauigerzigerk · on Oct 6, 2010

How is this primarily an issue of overlaod? One particular shard got overlaoded, yes, but the reall issue is how that brought the whole system down.

Devilboy · on Oct 6, 2010

There was no sudden crippling surge in this case as far as I can tell. There was no mass-updating of a single object either. It failed even though it fits your ideal situation pretty much perfectly.