This morning there was an intriguing post on
arXiv/math.RA
entitled A Note on
the Eigenvalues of the Google Matrix. At first I thought it was a
joke but a quick Google revealed that the
PageRank algorithm really
is at the heart of Google technology, so I simply had to find out more
about it. An extremely readable account of it can be found in
The PageRank Citation Ranking: Bringing Order to the Web which is really the
start of Google. It is coauthored by the two founders :
Larry Page and
Sergey Brin. A quote from the introduction
“To test the utility of PageRank for search, we built a web search engine called Google (Section 5)”
Here is an intuitive idea of PageRank : a page has high rank if the sum of the ranks of its backlinks (that is, pages linking to the page in question) is high and it is computed by the Random Surfer Model (see sections 2.5 and 2.6 of the paper). More formally (at least from my quick browsing of some papers, maybe the following account is slightly erroneous and I’ll have to spend some more time reading) let N be the number of webpages (estimated between 3 and 4 billion) and consider the N x N matrix A the so called GoogleMatrix where
A = cP + (1-c)(v x
vec(1)) where P is the column-stochastic matrix (meaning : all entries are zero or positive and the sum of all entries in each column adds up to 1) with entries
P(i,j) = 1/N(i) if i->j and 0
otherwise where i and j are webpages and i->j denotes that page i has a link to page j and where N(i) is the total number of pages linked to in page i (all this information is available once we download page i). c is a constant 0 < c < 1 and corresponds to the fraction of webpages containing an outlink (that is, a link to another page) by all webpages (it seems that Google uses c=0.85 as an estimate). Finally, v is a column vector with zero or positive numbers adding up to 1 and vec(1) is the constant row vector (1,…,1). The idea behind this term is that in the _Random Surfer Model to compute the PageRank the Googlebot (normally following links randomly in pages it enters) jumps every (1-c)x100% links randomly to an entirely different webpage where the chance that it will end up at page i is given by the i-th entry of v (this is to avoid being trapped in a web-loop). So, in Googles model the bot teleports itself randomly every 6th link or so. Now, the PageRank is a column-eigenvector for the GoogleMatrix A with eigenvalue 1 which can be approximated by the RandomSurfer model and the rate of convergence of this process depends on the second largest eigenvalue for A (the largest being 1). Now, in the paper posted this morning a simple proof is given that this eigenvalue is c (because the matrix P has multiple eigenvalues equal to 1). According to a previous paper on the subject The Second Eigenvalue of the Google Matrix, this statement has implications for the convergence rate of the standard PageRank algorithm as the web scales, for the stability of PageRank to perturbations to the link structure of the web, for the detection of Google spammers, and for the design of algorithms to speed up PageRank. But I’ll have to read more to understand the Google spammers bit…
arxiv, google
2 comments
Posted in general
Written on Fri, 16 January 2004 at 12:55 pm
Tags: arxiv, google
If you liked this post, then consider subscribing to our full RSS feed.
December 28th, 2007 at 1:09 pm
[...] the GoogleMatrix I tried to understand the concept of the PageRank algorithm that Google uses to list pages [...]
September 1st, 2008 at 4:40 am
Hi, I’m trying to find a sample of the Google adjacency matrix, in order to do my thesis, but I haven’t found anything; I don’t know if you could know where I could get it.
Thanks,
Alejandro