Matrices and Google

Or, how the biggest company in the world is built on matrices.

As with most Flying Colours Maths Blog articles, I don’t claim any kind of historical accuracy. The details are likely to be wrong, but I’m not one to let the truth get in the way of the story; all the same, feel free to correct anything I have wrong. —

One day in late 1998, two students at Stanford — Sergei Brin and Larry Page — came up with an idea for a class project in Linear Algebra.

Those were dark days, my friends. If you wanted to find something on the internet, you had to use portals like AltaVista and Yahoo — there wasn’t a Google — until Page and Brin thought “You know what? Search sucks — can we make it better with maths ((They’d probably have said math. It’s not a direct quote.))?”

Spoiler alert: the answer was yes.

Their idea was this: to rank all of the pages on the internet by how likely you’d be to land on them by randomly surfing around. Important pages, like BBC and the UN and Flying Colours Maths ((Actually, FCM didn’t exist at that time. I was studying Linear Algebra in French, and it’s depressing to think that these Californian clowns were only a couple of years ahead of me.)), would have lots of links pointing to them, so you’d be more likely to land on them; rubbish pages like MyMaths and the Daily Mail wouldn’t have so many, so you’d be less likely to end up there.

As ideas go, it’s not bad. They could probably have earned some money from it, frankly. The question is: where do the matrices come in?

Brin and Page borrowed an idea from graph theory: something called the adjacency matrix. It’s a big, square matrix, with every column representing a page a surfer is on, and every row representing a page they might end up on — and every entry in the grid representing a link. If one page contains a link to another, there’d be a 1 in the appropriate column and row — and if not, there would be a 0. That’s a bit tricky to see at first, but let’s have an example: how about Alice’s site links to Bob’s and Colin’s, Bob’s links to Colin’s and Colin’s links to Alice’s. The matrix would look like this:

$\\left( \\begin{array}{cccc} & \\textrm{From A} & \\textrm{From B} & \\textrm{From C} \\\\ \\textrm{To A} & 0 & 0 & 1 \\\\ \\textrm{To B} & 1 & 0 & 0 \\\\ \\textrm{To C} & 1 & 1 & 0 \\\\ \\end{array} \\right)$

However, we’re dealing with probabilities, so each ‘from’ column needs to add up to 1. Happily, Bob’s and Colin’s already do, but we need to divide Alice’s column by 2.

$\\left( \\begin{array}{cccc} & \\textrm{From A} & \\textrm{From B} & \\textrm{From C} \\\\ \\textrm{To A} & 0 & 0 & 1 \\\\ \\textrm{To B} & \\frac12 & 0 & 0 \\\\ \\textrm{To C} & \\frac12 & 1 & 0 \\\\ \\end{array} \\right)$ How does the matrix help?

Well, here’s how. If you start on a random one of these sites, your probability of being on A is $\frac{1}{3}$, same for B, same for C — which you can write as a column matrix with three elements. If you multiply the adjacency matrix by the column, that gives you the new probability for being on each site. By hand:

You’ve got a $\frac{1}{3}$ chance of starting at A, which turns into $\frac{1}{6}$ for B and a $\frac{1}{6}$ for C. You’ve got a $\frac{1}{3}$ chance of starting at B, after which you’re certain to go to Colin’s, making an extra $\frac{1}{3}$ for Colin — who’s now up to $\frac{1}{2}$. You’ve got a $\frac{1}{3}$ chance of starting at C, after which you’re certain to go to Alice’s, making a total of $\frac{1}{3}$ for Alice.

You can check, if you like, that

$\\left( \\begin{array}{ccc} 0 & 0 & 1 \\\\ \\frac12 & 0 & 0 \\\\ \\frac12 & 1 & 0 \\\\ \\end{array} \\right) \\left( \\begin{array}{c} \\frac13 \\\\ \\frac13 \\\\ \\frac13 \\end{array} \\right) = \\left( \\begin{array}{c} \\frac13 \\\\ \\frac16 \\\\ \\frac12 \\end{array} \\right)$

So far, so ingenious. The thing is, though, we want to know where the surfer ends up in the long term. And that’s where the matrix gets powerful. Instead of working out the sums by hand each time, you just need to keep multiplying the matrix by itself until it settles down ((And, with this kind of matrix, it’s bound to settle down: it’s something called Markov chain theory and I don’t fully understand it. But I embrace its results.)). You’d use a computer for that.

For this matrix, you’d work out

$\\left( \\begin{array}{ccc} 0 & 0 & 1 \\\\ \\frac12 & 0 & 0 \\\\ \\frac12 & 1 & 0 \\\\ \\end{array} \\right)^N \\left( \\begin{array}{c} \\frac13 \\\\ \\frac13 \\\\ \\frac13 \\end{array} \\right)$ for a large number $N$ — say, 1000. It turns out to be $(\frac25, \frac15, \frac25)^T$ — which tells you that (according to Google), Alice’s and Colin’s pages are about as important as each other, and Bob’s is half as important ((These numbers form an eigenvector. At some point in the future, someone in your LinAlg class will ask ‘what are eigenvectors for?’ and you’ll have a ready answer.)).

Even though Colin’s page has more inbound links than Alice’s, one of those is from Bob’s relatively lacklustre page, so it counts for less that Alice’s sole inbound link from Colin’s. That’s a headache if you’re just looking at the links (even with a simple network of three pages), but this algorithm — the PageRank algorithm — works like an aspirin, even on a practically infinite network like the internet.

And that’s what the original version of Google was: a massive matrix calculation that said “page A is about as important as page C, and twice as important as page B”. It’s developed a wee bit since then, but one of the biggest companies on the planet is built completely on matrices.

So that’s what they’re for ((among other things.))!

* Edited 14/12/2013 for formatting.

Or, how the biggest company in the world is built on matrices.

A selection of other posts