July 2008 Archives
The one thing that has always puzzled me when I've tried to figure out how the Google search engine works is results ordering. Everyone knows that Google works with massive clusters of low-end hardware, and that their algorithms are uniquely designed for this architecture. That's the big clue, and I can figure out how (at a conceptual level) every aspect of a search can be parallelized except for the final and crucial result-ordering step. It's been bothering me.
But today in the course of doing something entirely unrelated I stumbled over a layman-level summary of the MapReduce algorithm that somehow made the whole thing click in my head. What's funny there is that I've read about MapReduce plenty in the past; I just never made the connection until reading Wired's decidedly nontechnical summary.
So, long story short, I think I have a general idea how Google works now. I'm never really sure I understand something until I can explain it though, so in a fit of irony I'm going to explain my guesses about Google on a page that will probably only be read by Google's indexing bot. So hurray for antiphrasis!
But first, some context is in order:
Calling what you do with Google a 'search' is somewhat misleading; it evokes the image of Google's servers actively navigating the internet looking for you content. With a little thought it becomes obvious that that's not how it works; not only would individual Google searches take hours, but the traffic generated by even a small subset of Google's normal users would grind the entire internet to a standstill.
Someone far more clever than I once pointed out that "almost all programming can be viewed as an exercise in caching", and modern search has taken that maxim to heart. Just about every search engine -- from Google all the way down to Spotlight -- works by generating an index of the content and conducting actual queries by examining the index. Think of the index as a great big map (or hash or dictionary, whatever term you prefer) between tokens and sets of documents. If you look in the index under 'marklar' you'll get back a set of every document containing that 'word'.
Notice that I'm using the word 'set' instead of 'list'. There are two reasons for that -- first, it is indeed a set rather than a list, in the strict mathematical sense that each item (a document in this case, or more accurately a URL, ID, GUID, or other form of document identifier) appears in the collection exactly zero or one times (five is right out!). The second reason I'm being careful to say 'set' has to do with complex queries.
So if you search for 'marklar' you look in the index for the set of documents containing that term, you order them (more on that later!) and you show them to the user. Bada-bing, bada-boom, Bob's your uncle, and so on and so forth. But what if you want to search for 'smurfing marklar'? Ideally that would show you every document that contains both 'smurfing' and 'marklar', right? Well, if our index can give us the result sets for each individual term, then we can find the desired combined result set by taking the intersection of those two sets.
Finding the intersection of two lists is a pain in the butt -- O(n*m) in general, O(n+m) if you can guarantee both are sorted, but either way not the sort of thing that scales well to searching large numbers of documents (such as, for instance, the entire freaking internet). But if you store your sets in something clever called a treap (which is, exactly as it sounds, the bastard child of a tree and a heap) you can get it to work in O(m log n/m), and you can even parallelize things to work in O(log m/n) if you happen to have m cores lying around, which is, in a word, fast as hell. So: Sets? Important.
Once you look at search as an exercise in set combination, you start to see that the search term is itself a statement in a very loose and forgiving programming language: "smurfing marklar OR foo NOT bar" can be parsed into a syntax tree which can then be expressed as a series of set operations: ( (smurfing n marklar) u foo ) n ~bar, for instance. If you know the performance characteristics well enough to devise appropriate heuristics, you can even write an optimizer to tweak the abstract syntax tree for your query into a more performant form. Yet another victory for the thesis that everything in computer science is a compiler, I guess.
There's a problem there, though: how do you distribute it? Merging sets can actually be parallelized pretty nicely on a single machine (like I said, m cores get you O(log m/n) time), but there's no good way to find the intersection of a set on machine A and a set on machine B; the communication overhead would dwarf any gain you got by distributing the problem. So you really need to have the full index available on the same machine to run a query, which might not work so well when you're indexing, say, the freaking entire internet. It also means your scaling solution is going to be 'adding more and more powerful machines' instead of 'adding lots of wimpy machines', so even if that approach were viable it's not explaining Google. I'll come back to this in a few paragraphs. Wait for it.
So anyway, now you've got a set: a great big pile of documents that match your query. That's actually the 'easy' part -- Yahoo was doing that long before Google showed up and ate their lunch, and they'll probably just keep on doing it until the end of time if Microsoft doesn't buy them and turn them into Facebook 2 (subtitle: 'The Facebookening') first. The really tricky part, and the part that Google got so right, is putting that pile of search results into a meaningful order. There are traditional ways of calculating the relevance of a search result, and most of them involve looking at the incidence of a term within a document and trying to calculate how 'important' the word is to that semantic meaning of the document, and then making the assumption that if 'marklar' is heavily important to document A (40% of the tokens in the document are marklar or related terms) and minimally important to document B (the word 'marklar' only shows up once) then A is a more relevant result than B (there's more to it, but that's the conceptual gist, at least as I choose to understand it).
There are two big problems with that approach.
Problem one is that it just doesn't work very well; it's easy to game the heuristic by just repeating the target keyword over and over again through the page, a manipulation that spawned the field of search engine 'optimization' and lead to a cartload of annoying crap on the internet. You can make your heuristic smarter, but ultimately you're in an arms race against a whole host of people with strong economic incentive to break your system, and you're not going to win.
The second big problem with the sets-then-sorts approach is performance. To do this effectively you need to complete the entire set operation phase prior to sorting and displaying results, and you need to sort the entire result set before you can reliably return the first page of results to the user. That's clearly suboptimal; ideally you'd like to front-load the effort and get the first page of results out there ASAP, and then finish page 2 at your leisure while the user looks through page 1. That's the part that's always given me a headache, because I know (or at least, think I know) that Google doesn't compute the whole result set before I see page one; if that were case you'd see a vast performance disparity between loading page 1 and page 2 of the results; the first page would take longer, and subsequent pages would be almost instant, which I certainly haven't ever noticed. It would also be the case that Google knew on page 1 exactly how many pages of results there would be, which often isn't the case (more on this later, too!).
So how do I think Google does it? They cheat! Sort of.
I don't think Google actually uses MapReduce for searching (it's used for indexing, but not searching), but the design of MapReduce is the key; it's a two-stage algorithm. As the first step each distributed node produces an intermediate result set, in the form of a map. That's the important bit -- the first phase produces not a set, not a list, but a map. So back to searching: What if each page had an absolute relevance score that has nothing to do with the search, but is in fact and intrinsic property of the page -- some sort of a Page Rank, let's say. That way instead of generating a result set as one big set, you could create a bunch of different sets: the rank 1 set, the rank 0.9 set, and so on. Just break things down by rank ranges and throw results into buckets.
What does that get you? Distribute your index to multiple machines, such that every document is on exactly one node (in reality there'd be redundant, backup and failover nodes, but that's not relevant to the core algorithm). Further, have each node maintain multiple indices, one for all the rank 1 pages, one for all the rank 0.9 pages, et cetera (this could also be done by having multiple indices to the same data, but I don't think it makes a conceptual difference). Then sprinkle in a handful of master/controller nodes to manage all these index nodes. When a master node gets a search request, broadcast the search down to every slave and let them get started.
Program slaves to front-load the high-ranking working, and have them signal the controller as each bucket is filled. When a controller has heard a completion signal for rank 1 pages from every slave node, have it pull back just those rank 1 nodes, order them based on old-fashioned relevance ranking, and feed them back to the user. Meanwhile the slave nodes can keep chugging away on the rank 0.9 - 0.1 indices.
This would mean a couple of things: firstly, you wouldn't know exactly how many total results you had when your first page of results were ready; you could approximate it by checking the number of matches vs. documents checked on each node (node A has found 12 matches after scanning 10,000 pages, so since node A contains a total of 100,000 results we probably have about 120 matches on node A). If that guess is off your pager will be incorrect; a user might click page seven only to find himself on page five, with no indication six or seven exist, or a user might click the last link in the pager to find there are suddenly many more pages. Secondly, this model would mean that relevance is a second-order consideration when ranking results; a highly-ranked page that mentions 'marklar' in passing might outrank a minimally-linked post on a free hosting site that happens to be entirely about marklars. Both of those predicted outcomes match behavior I've seen from Google.
So is this how Google works? Probably not. And even if I got some of the stuff right at a high level, there are doubtless innumerable details I've missed out on.
But it was fun to think about.
 Antiphrasis is a word that means irony; of course my usage here is not according to the standard definition of irony (as Bender explained, "The use of words expressing something/Other than their literal intention") but according to the far-more-common yet far-less-correct Alanis Morissette meaning ("It's like rain/On your wedding day"). Ironically enough (Alanis), that means that my incorrect usage of the word antiphrasis here is in fact ironic (Bender). So by using it incorrectly, I'm using it correctly. Noodle on that one for a while, Gödel.
 Terje Mathisen. He doesn't seem to have a blog or home page, or I'd link to him. Sorry, semantic web. I'll never do it again.