DistributedHyperActive
Version 9 (yossarian, 04/23/2009 11:46 pm)
| 1 | 1 | h1. DistributedHyperActive |
|
|---|---|---|---|
| 2 | 1 | ||
| 3 | 5 | yossarian | For a while now, we've been quietly talking about ways to make Hyperactive run as a distributed application - that is, to make it run across more than one computer in more than one data center. There are a lot of possible approaches to making this happen, and each of them has its own strengths and weaknesses. |
| 4 | 5 | yossarian | |
| 5 | 1 | h2. Why? |
|
| 6 | 1 | ||
| 7 | 5 | yossarian | Basically, we need resilience and reliability. What happens if a server gets taken? Can people still read content? Add and edit content? How much work is it to get it up and running again? Can we keep it working while under heavy load by spreading the load across multiple computers? |
| 8 | 2 | mish | |
| 9 | 1 | h2. How? |
|
| 10 | 5 | yossarian | |
| 11 | 6 | yossarian | This is where things get complex. There are quite a number of different ways to distribute a web application across multiple servers. For the purposes of news production in an increasingly hostile legal environment, we cannot put all our boxes in the same datacenter, which is how many "distributed" setups run. We assume that a warrant can be served on a datacenter rather than on a single machine, and in fact that it will be generally desirable to have boxes in multiple legal jurisdictions. This means that we need to be able to run Hyperactive, or at least some important parts of it, on machines which are physically distant from each other, separated by the Internet. |
| 12 | 5 | yossarian | |
| 13 | 1 | ||
| 14 | 7 | yossarian | h3. Static HTML producer and rsync |
| 15 | 1 | ||
| 16 | 7 | yossarian | This is the way Mir works - the production server makes static HTML of all files which can then be copied to other "mirror" servers. These mirror servers can do most of the work of serving the content, but can't update it. If the publish server goes, nothing new can be added, and a new server needs to be added. Mirror servers also get copies of all uploaded media files (photos, videos, audio) and serve those too. |
| 17 | 1 | ||
| 18 | 8 | yossarian | *Current status:* Hyperactive is already designed with easy cacheability of HTML pages in mind, to the extent that during the G20 summit demonstrations in London, 97.7% of all requests were served as static HTML (via Apache) rather than as dynamic requests requiring the Ruby executable and Rails framework to be loaded. This approach has the disadvantage of constraining us slightly in our user interface design, and the advantage that we can actually set the site up with a static HTML producer very easily. The site has extremely good performance on crappy hardware. |
| 19 | 1 | ||
| 20 | 8 | yossarian | *What it would take to make it:* We'd probably need to override some methods in the normal Sweeper classes, which are part of the Rails framework. Currently Hyperactive uses the normal Rails full-page caching mechanism, which works more or less as follows. Let's take the example of a published article as an example. |
| 21 | 8 | yossarian | |
| 22 | 8 | yossarian | # A user publishes an article. The title, body, and other necessary stuff gets saved to the database. The user is happy. |
| 23 | 8 | yossarian | # A (potentially different) user views the article. Since there is no HTML page existing on disk to show this user, Rails fires up, grabs the data, formats the page, and sends it to the user's browser. *As a byproduct of this*, Rails also saves the generated HTML output as a file on disk. |
| 24 | 8 | yossarian | # Another user views the article. Because there is a cached HTML file on disk, the web server will give that back to the user without firing up Rails at all. It should be noted that serving a page as a static file like this is roughly 100 times faster than hitting Rails for the same thing. Put another way, the same server is likely to be able to handle 100x as many viewers using static HTML as using a Rails application without this sort of caching. |
| 25 | 9 | yossarian | # If the article is edited by someone (let's say a site administrator turns it into a feature), the act of saving the page destroys the cached HTML file on disk. |
| 26 | 9 | yossarian | # The next time it gets viewed, the article will again be cached as an HTML file on disk. |
| 27 | 9 | yossarian | |
| 28 | 9 | yossarian | The main problem with all of this from the standpoint of using Hyperactive as a static HTML producer is that the HTML caching comes at the wrong point for our purposes. Ideally, we'd like the HTML file to be generated *whenever the article is saved*, rather than *when somebody views the article*. |
| 29 | 7 | yossarian | |
| 30 | 7 | yossarian | h3. Master-Slave MySQL replication |
| 31 | 7 | yossarian | |
| 32 | 2 | mish | This would have multiple servers able to act as the publish server, though only one (the master) is running at any one time. The master sends all changes to the database to the others (the slaves). Then if the master goes offline, one of the slaves can be started and off we go again. |
| 33 | 2 | mish | |
| 34 | 1 | Anything not stored in the database would need to distributed in another way - eg rsync of media files. |
|
| 35 | 1 | ||
| 36 | 2 | mish | (though I wonder if the 'no slaves, no masters' crowd would object ;) |
| 37 | 7 | yossarian | |
| 38 | 7 | yossarian | h3. Master-Master MySQL replication |
| 39 | 2 | mish | |
| 40 | 1 | h3. reverse proxies |
|
| 41 | 1 | ||
| 42 | 2 | mish | Other servers can be set up to serve the content, spreading the load in times of high usage. The first time the proxy is asked for a page, it asks the publish server, and after that it just returns its copy of the page, until the expire time has passed, at which point it asks the publish server again. |
| 43 | 2 | mish | |
| 44 | 2 | mish | It spreads the load, but does not provide a full copy of the original. |
| 45 | 2 | mish | |
| 46 | 1 | h3. couchdb - distributed database |
|
| 47 | 1 | ||
| 48 | 2 | mish | Having a distributed database such as "couchdb":http://couchdb.apache.org/ means that the rails code can run on multiple servers. It can be used from rails by using "activecouch":http://github.com/arunthampi/activecouch/tree/master - see "these tutorials":http://barkingiguana.com/tag/couchdb/ |
| 49 | 2 | mish | |
| 50 | 1 | h3. distributed filesystem |
|
| 51 | 2 | mish | |
| 52 | 2 | mish | Such as "mogilefs":http://danga.com/mogilefs/ - This stores files across multiple computers and could be used in combination with a distributed database or mysql replication to provide a full copy. Some points about mogilefs: |
| 53 | 2 | mish | |
| 54 | 2 | mish | * From the website: "It's meant for archiving write-once files and doing only sequential reads. (though you can modify a file by way of overwriting it with a new version)" |
| 55 | 2 | mish | * It does have a "ruby library":http://seattlerb.rubyforge.org/mogilefs-client/ |
| 56 | 2 | mish | * Found a "tutorial for storing uploaded images":http://barkingiguana.com/2008/10/31/scaling-using-mogilefs-for-storing-uploaded-images |
| 57 | 3 | mish | |
| 58 | 3 | mish | Should be fairly easy to modify it to include audio and video uploads. Adding the cached html files would be more difficult though as this is done within the depths of rails, and the way the file system works is quite different. But then again the cached files are less of an issue - they are pretty cheap to regenerate. |
| 59 | 3 | mish | |
| 60 | 3 | mish | We should look into if it works well across the internet, or if it uses a lot of bandwidth and should only be used within a datacentre. |
| 61 | 4 | mish | |
| 62 | 4 | mish | Setup guides |
| 63 | 4 | mish | * http://www.imvu.com/blogs/index.php?blog=12&title=how_to_setup_mogilefs&more=1&c=1&tb=1&pb=1 |
| 64 | 4 | mish | * http://mogilefs.pbwiki.com/Another+How+to+Install+MogileFS+-+Debian+Sarge |