on April 2, 2004 by lieven in general, Comments (0)

my first scraper

As far as I know (but I am fairly ignorant) the arXiv does not provide RSS feeds for a particular section, say mathRA. Still it would be a good idea for anyone having a news aggregator to follows some weblogs and news-channels having RSS syndication. So I decided to write one as my first Perl-exercise and to my own surprise I have after a few hours work a prototype-scraper for math.RA. It is not yet perfect, I still have to convert the local URLs to global URLs so that they can be clicked and at the moment I have only collected the titles, authors and abstract-links whereas it would make more sense to include the full abstract in the RSS feed, but give me a few more days…
The basic idea is fairly simple (and based on an O\’Reilly hack). One uses the Template::Extract module to extract the goodies from the arXiv\’s template HTML. Maybe I am still not used to Perl-documentation but it was hard for me to work out how to do this in detail either from the hack or the online module-documentation. Fortunately there is a good Perl Advent Calendar page giving me the details that I needed. Once one has this info one can turn it into a proper RSS-page using the XML::RSS-module.
In fact, I spend far more time trying to get XML::RSS installed under OS X than writing the code. The usual method, that is via

iMacLieven:~
lieven$ sudo /usr/bin/perl -MCPAN -e shell Terminal does not support
AddHistory. cpan shell -- CPAN exploration and modules installation
(v1.76) ReadLine support available (try \'install
Bundle::CPAN\') cpan> install XML::RSS 
failed and even a manual install for which the drill is : download the package from CPAN, go to the extracted directory and give the commands
sudo /usr/bin/perl
Makefile.pl sudo make sudo make test sudo make
install
failed. Also a Google didn\’t give immediate results until I did find this ADC page which set me on the right track. It seems that the problem is in installing the XML::Parser for which one first need expat to be installed. Now, the generic sourceforge page contains a version for Linux but fortunately it is also part of the Fink project so I did a
sudo fink install expat
which worked without problems but afterwards I still was not able to install XML::Parser because Fink installs everything in the /sw tree. But after
sudo perl Makefile.pl EXPATLIBPATH=/sw/lib
EXPATINCPATH=/sw/include
I finally got the manual installation going. I will try to tidy up the script over the weekend…

No Comments

Leave a comment

XHTML: Allowed tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>