Skip to content →

Tag: arxiv

arxiv RSS feeds available

If
you are interested in getting daily RSS-feeds of one (or more) of the
following arXiv
sections : math.RA, math.AG, math.QA and
math.RT you can point your news-aggregator to
www.matrix.ua.ac.be. Most of the solution to my first
Perl-exercise I did explain already yesterday but the current program
has a few changes. First, my idea was to scrape the recent-files
from the arXiv, for example for math.RA I would get http://www.arxiv.org/list/math.RA/recent but this
contains only titles, authors and links but no abstracts of the papers.
So I thought I had to scrape for the URLs of these papers and then
download each of the abstracts-files. Fortunately, I found a way around
this. There is a lesser known way to get at all abstracts from
math of the current day (or the few last days) by using the Catchup interface. The syntax of this interface is
as follows : for example to get all math-papers with
abstracts
posted on April 2, 2004 you have to get the page with
URL

http://www.arxiv.org/catchup?smonth=04&sday=02&num=50&archive=
math&method=with&syear=2004

so in order to use it I had
to find a way to parse the present day into a numeric
day,month,year format. This is quite easy as there is the very
well documented Date::Manip-module in Perl. Another problem with
arXiv is that there are no posts in the weekend. I worked around
this by requesting the Catchup starting from the previous
business day
(an option of the DateCalc-function. This means
that over the weekend I get the RSS feeds of papers posted on Friday, on
Monday I\’ll get those of Friday&Monday and for all other days I\’ll get
those of today&yesterday. But it is easy to change the script to allow
for a longer period so please tell me if you want to have RSS-feeds for
the last 3 or 4 days. Also, if you need feeds for other sections that
can easily be done, so tell me.
Here are the URLs to give to
your news-aggregator for these sections :

math.RA at
http://www.matrix.ua.ac.be/arxivRSS/mathRA/
math.QA at
http://www.matrix.ua.ac.be/arxivRSS/mathQA/
math.RT at
http://www.matrix.ua.ac.be/arxivRSS/mathRT/
math.AG at
http://www.matrix.ua.ac.be/arxivRSS/mathAG/

If
your news-aggregator is not clever then you may have to add an
additional index.xml at the end. If you like to use these feeds
on a Mac, a good free news-aggregator is NetNewsWire Lite. To get at the above feeds, click on the Subscribe
button
and copy one of the above links in the pop-up window. I
don\’t think my Perl-script breaks the Robots Beware rule of the arXiv. All it does it to download one page a day
using their Catchup-Method. I still have to set up a cron-job to
do this daily, but I have to find out at which (local)time at night the
arXiv refreshes its pages…

Leave a Comment

my first scraper

As
far as I know (but I am fairly ignorant) the arXiv does not
provide RSS feeds for a particular section, say mathRA. Still it would be a good idea for anyone
having a news aggregator to follows some weblogs and
news-channels having RSS syndication. So I decided to write one as my
first Perl-exercise and to my own surprise I have after a few hours work
a prototype-scraper for math.RA. It is not yet perfect, I still
have to convert the local URLs to global URLs so that they can be
clicked and at the moment I have only collected the titles, authors and
abstract-links whereas it would make more sense to include the full
abstract in the RSS feed, but give me a few more days…
The
basic idea is fairly simple (and based on an O\’Reilly hack).
One uses the Template::Extract module to
extract the goodies from the arXiv\’s template HTML. Maybe I am still
not used to Perl-documentation but it was hard for me to work out how to
do this in detail either from the hack or the online
module-documentation. Fortunately there is a good Perl Advent
Calendar
page giving me the details that I needed. Once one has this
info one can turn it into a proper RSS-page using the XML::RSS-module.
In fact, I spend far
more time trying to get XML::RSS installed under OS X than
writing the code. The usual method, that is via

iMacLieven:~
lieven$ sudo /usr/bin/perl -MCPAN -e shell Terminal does not support
AddHistory. cpan shell -- CPAN exploration and modules installation
(v1.76) ReadLine support available (try \'install
Bundle::CPAN\') cpan> install XML::RSS 

failed and even a
manual install for which the drill is : download the package from CPAN, go to the
extracted directory and give the commands

sudo /usr/bin/perl
Makefile.pl sudo make sudo make test sudo make
install

failed. Also a Google didn\’t give immediate results until
I did find this ADC page which set me on the right track.
It seems that the problem is in installing the XML::Parser for which one first need expat
to be installed. Now, the generic sourceforge page contains a
version for Linux but fortunately it is also part of the Fink
project
so I did a

sudo fink install expat

which worked
without problems but afterwards I still was not able to install
XML::Parser because Fink installs everything in the /sw
tree. But after

sudo perl Makefile.pl EXPATLIBPATH=/sw/lib
EXPATINCPATH=/sw/include

I finally got the manual installation
going. I will try to tidy up the script over the weekend…

One Comment

robots.txt

I
just finished the formal lecture-part of the course Projects in
non-commutative geometry
(btw. I am completely exhausted after this
afternoon\’s session but hopeful that some students actually may do
something with my crazy ideas), springtime seems to have arrived and
next week the easter-vacation starts so it may be time to have some fun
like making a new webpage (yes, again…). At the moment the main
matrix.ua.ac.be page is not really up to standards
and Raf and Hans will be using it soon for the information about the
Liegrits-project (at the moment they just have a beautiful logo). My aim is to make the main page to be the
starting page of the geoMetry site
(guess what M stands for ?) on which I want
to collect as much information as possible on non-commutative geometry.
To get at that info I plan to set some spiders or bots or
scrapers loose on the web (this is just an excuse to force myself
to learn Perl). But it seems one has to follow strict ethical guidelines
in doing so. One of the first sites I want to spider is clearly the arXiv but they have
a scary Robots Beware page! I don\’t know whether their
robots.txt file will allow me to get at any of
their goodies. In a robots.txt file the webmaster can put the
directories on his/her site which are off limits to robots and as I
don\’t want to do anything that may cause that the arXiv is no longer
available to me (or even worse, to the whole department) I better follow
these guidelines. First site on my list to study tomorrow will be The
Web Robots Pages

Leave a Comment

Borcherds’ monster papers


Yesterday morning I thought that I could use some discussions I had a
week before with Markus Reineke to begin to make sense of one
sentence in Kontsevich’ Arbeitstagung talk Non-commutative smooth
spaces :

It seems plausible that Borcherds’ infinite rank
algebras with Monstrous symmetry can be realized inside Hall-Ringel
algebras for some small smooth noncommutative
spaces

However, as I’m running on a 68K RAM-memory, I
didn’t recall the fine details of all connections between the monster,
moonshine, vertex algebras and the like. Fortunately, there is the vast
amount of knowledge buried in the arXiv and a quick search on Borcherds gave me a
list of 17 papers. Among
these there are some delightful short (3 to 8 pages) expository papers
that gave me a quick recap on things I once must have read but forgot.
Moreover, Richard Borcherds has the gift of writing at the same time
readable and informative papers. If you want to get to the essence of
things in 15 minutes I can recommend What
is a vertex algebra?
(“The answer to the question in the title is
that a vertex algebra is really a sort of commutative ring.”), What
is moonshine?
(“At the time he discovered these relations, several
people thought it so unlikely that there could be a relation between the
monster and the elliptic modular function that they politely told McKay
that he was talking nonsense.”) and What
is the monster?
(“3. It is the automorphism group of the monster
vertex algebra. (This is probably the best answer.)”). Borcherds
maintains also his homepage on which I found a few more (longer)
expository papers : Problems in moonshine and Automorphic forms and Lie algebras. After these
preliminaries it was time for the real goodies such as The
fake monster formal group
, Quantum vertex algebras and the like.
After a day of enjoyable reading I think I’m again ‘a point’
wrt. vertex algebras. Unfortunately, I completely forgot what all this
could have to do with Kontsevich’ remark…

Leave a Comment

the google matrix

This morning there was an intriguing post on arXiv/math.RA
entitled A Note on
the Eigenvalues of the Google Matrix
. At first I thought it was a
joke but a quick Google revealed that the PageRank algorithm really
is at the heart of Google technology, so I simply had to find out more
about it. An extremely readable account of it can be found in The PageRank Citation Ranking: Bringing Order to the Web which is really the
start of Google. It is coauthored by the two founders : Larry Page and
Sergey Brin. A quote from the introduction

“To test the utility of PageRank for search, we built a web
search engine called Google (Section
5)”

Here is an intuitive idea of
_PageRank_ : a page has high rank if the sum of the ranks of its
_backlinks_ (that is, pages linking to the page in question) is
high and it is computed by the _Random Surfer Model_ (see
sections 2.5 and 2.6 of the paper). More formally (at least from my
quick browsing of some papers, maybe the following account is slightly
erroneous and I’ll have to spend some more time reading) let
N be the number of webpages (estimated between 3 and 4
billion) and consider the N x N matrix
A the so called GoogleMatrix where

A = cP  + (1-c)(v x
vec(1)) 

where P is the
column-stochastic matrix (meaning : all entries are zero or positive and
the sum of all entries in each column adds up to 1) with
entries

P(i,j) = 1/N(i) if i->j and 0
otherwise 

where i and j are webpages and i->j
denotes that page i has a link to page j and where N(i) is the total
number of pages linked to in page i (all this information is available
once we download page i). c is a constant 0 < c < 1 and
corresponds to the fraction of webpages containing an _outlink (that
is, a link to another page) by all webpages (it seems that Google uses
c=0.85 as an estimate). Finally, v is a column vector with zero or
positive numbers adding up to 1 and vec(1) is the constant row vector
(1,…,1). The idea behind this term is that in the _Random Surfer
Model_ to compute the PageRank the Googlebot (normally following
links randomly in pages it enters) jumps every (1-c)x100% links randomly
to an entirely different webpage where the chance that it will end up at
page i is given by the i-th entry of v (this is to avoid being trapped
in a web-loop). So, in Googles model the bot _teleports_ itself
randomly every 6th link or so. Now, the PageRank is a
column-eigenvector for the GoogleMatrix A with eigenvalue 1 which can be
approximated by the RandomSurfer model and the rate of convergence of
this process depends on the _second_ largest eigenvalue for A
(the largest being 1). Now, in the paper posted this morning a simple
proof is given that this eigenvalue is c (because the matrix P has
multiple eigenvalues equal to 1). According to a previous paper on the
subject The
Second Eigenvalue of the Google Matrix
, this statement has
implications for the convergence rate of the standard PageRank algorithm
as the web scales, for the stability of PageRank to perturbations to the
link structure of the web, for the detection of Google spammers, and for
the design of algorithms to speed up PageRank. But I’ll have to
read more to understand the Google spammers bit…

2 Comments