I started to compile a list of guidelines for developing big systems. So, what is a big system?
- "Big" means 5k - 100k concurrent users.
- "Massive" 50k up to 1 Mio. concurrent and
- "Mainstream" 500k up to 10 Mio. concurrent and beyond.
Examples:
- "Big": Xing, major online newspapers, Wikipedia, many browser games,
- "Massive": Second Life, EVE-Online (barely), Runescape (I guess), Habbo, Facebook,
- "Mainstream": WoW (barely), major IM networks (QQ, ICQ, MSN), Google search, mobile networks.
- "Big" ranges from "substantially over the top of your single server" to "the complete cluster would appear in the Top 500 Supercomputers list". EVE-Online is at the top of "Big" by sheer numbers, but it is on the brink to "Massive" because of its complexity. Xing: more than 10k click around on the web site at any time, sometimes twice as much.
- "Massive" is really large. Think about thousands or ten times more servers. That's a decent farm. Second Life is at the lower end. Their server numbers are a bit high for the user count, because of their architecture. Facebook is somwhere in the middle with 50 TB cache memory alone. This number is already 20,000 times more than your PC and growing. We are talking about the large web services, companies with 10 B market capitalization. These are "Massive". The big guys.
- "Mainstream" is unthinkable. Can you imagine half a million servers? If you stack them, they scratch the International Space Station. If you plan a system like this for everyone, then better avoid the hazzle and make it distributed like the Web. It's made for 10% of the world population (registered users not concurrent). There are few such systems on the planet. Few people managed to build in this order of magnitude. I am sure they grew into it with the system. Nothing that you can buy.
But there are also factors, which make the complexity similar or at least in the same order of magnitude.
- As an example, 10k HTTP/HTML readers make more network connections (easily 10k conns per sec.) than 10k players (100 conns per sec. when people log in).
- The data delivered may be in the same order of magnitude. Lets face it: a web site consists of min 100 items and active readers fetch 0,1 pages per second. This makes 10k x 0,1 x 100 = 100k items/sec. I doubt, that MMOGs deliver more than 100 items/sec to clients. This is 10 times more than what the web server does, but most MMOG items are just IDs of data pre-installed at the client. The web-items are all completely transferred. This reduces the difference. SecondLife transfers really many items by data. For a virtual world it is still an exception and it shows in the bad performance when discovering new areas.
- Then, there is a stupid reason, why web servers may feel the same load as MMOG servers: average web server technology is much worse, especially if they use scripting languages. Nobody would make MMOG servers in PHP. They are compiled code, C++, C#, compiled Java, compiled Python. But many "Big" web sites run on (uncompiled) byte code. Better than the script, but still byte code.
- Also, apart from good and bad technologies, there is good and not so good (=stupid) code. This happens to all types of systems and can make an MMOG more responsive than a web site at the same number of users.
- Across all types of services, some are sharded, some are unique worlds. EVE-Online is unique, which means everyone can play with everyone. WoW is sharded into 1,000 (?) servers of up to 3,000 (?) concurrent players. Your friend is on a different server? you do not even chat to him. Sharding largely reduces communication and DB load. If shard=host then communication overhead disappears. A DB per shard allows for 100 or 1000 times less DB load per instance. This helps a lot.
- After all, each category covers an order of magnitude. There is much room for "small" or "large" inside a category.
Distributed systems do not really count here. This is about operating the resources required to support concurrent users. E.g. the Web is not operated by a single entity (though you could count the DNS). The Skype network has a (cool) architecture, that lets Skype, the company manage the network without the need to provide all required resources. Skype is also not an issue here. But most systems are operated by someone with a server farm and this is about what these people do.
So, these lessons are for "Big" systems. Actually primarily for web servers but also some messaging issues and general remarks. Some lessons are obvious, some are obvious in hindsight, some are not so obvious, some may be interesting even if you are one of the few (thousand) architects of big systems worldwide. Read the (far from complete but growing) list of lessons here.