Skip to content →

my first scraper

As
far as I know (but I am fairly ignorant) the arXiv does not
provide RSS feeds for a particular section, say mathRA. Still it would be a good idea for anyone
having a news aggregator to follows some weblogs and
news-channels having RSS syndication. So I decided to write one as my
first Perl-exercise and to my own surprise I have after a few hours work
a prototype-scraper for math.RA. It is not yet perfect, I still
have to convert the local URLs to global URLs so that they can be
clicked and at the moment I have only collected the titles, authors and
abstract-links whereas it would make more sense to include the full
abstract in the RSS feed, but give me a few more days…
The
basic idea is fairly simple (and based on an O\’Reilly hack).
One uses the Template::Extract module to
extract the goodies from the arXiv\’s template HTML. Maybe I am still
not used to Perl-documentation but it was hard for me to work out how to
do this in detail either from the hack or the online
module-documentation. Fortunately there is a good Perl Advent
Calendar
page giving me the details that I needed. Once one has this
info one can turn it into a proper RSS-page using the XML::RSS-module.
In fact, I spend far
more time trying to get XML::RSS installed under OS X than
writing the code. The usual method, that is via

iMacLieven:~
lieven$ sudo /usr/bin/perl -MCPAN -e shell Terminal does not support
AddHistory. cpan shell -- CPAN exploration and modules installation
(v1.76) ReadLine support available (try \'install
Bundle::CPAN\') cpan> install XML::RSS 

failed and even a
manual install for which the drill is : download the package from CPAN, go to the
extracted directory and give the commands

sudo /usr/bin/perl
Makefile.pl sudo make sudo make test sudo make
install

failed. Also a Google didn\’t give immediate results until
I did find this ADC page which set me on the right track.
It seems that the problem is in installing the XML::Parser for which one first need expat
to be installed. Now, the generic sourceforge page contains a
version for Linux but fortunately it is also part of the Fink
project
so I did a

sudo fink install expat

which worked
without problems but afterwards I still was not able to install
XML::Parser because Fink installs everything in the /sw
tree. But after

sudo perl Makefile.pl EXPATLIBPATH=/sw/lib
EXPATINCPATH=/sw/include

I finally got the manual installation
going. I will try to tidy up the script over the weekend…

Published in web

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.