on April 2, 2004 by lieven in general, Comments (0)
my first scraper
As
far as I know (but I am fairly ignorant) the arXiv does not
provide RSS feeds for a particular section, say mathRA. Still it would be a good idea for anyone
having a news aggregator to follows some weblogs and
news-channels having RSS syndication. So I decided to write one as my
first Perl-exercise and to my own surprise I have after a few hours work
a prototype-scraper for math.RA. It is not yet perfect, I still
have to convert the local URLs to global URLs so that they can be
clicked and at the moment I have only collected the titles, authors and
abstract-links whereas it would make more sense to include the full
abstract in the RSS feed, but give me a few more days…
The
basic idea is fairly simple (and based on an O\’Reilly hack).
One uses the Template::Extract module to
extract the goodies from the arXiv\’s template HTML. Maybe I am still
not used to Perl-documentation but it was hard for me to work out how to
do this in detail either from the hack or the online
module-documentation. Fortunately there is a good Perl Advent
Calendar page giving me the details that I needed. Once one has this
info one can turn it into a proper RSS-page using the XML::RSS-module.
In fact, I spend far
more time trying to get XML::RSS installed under OS X than
writing the code. The usual method, that is via
iMacLieven:~ lieven$ sudo /usr/bin/perl -MCPAN -e shell Terminal does not support AddHistory. cpan shell -- CPAN exploration and modules installation (v1.76) ReadLine support available (try \'install Bundle::CPAN\') cpan> install XML::RSSfailed and even a manual install for which the drill is : download the package from CPAN, go to the extracted directory and give the commands
sudo /usr/bin/perl Makefile.pl sudo make sudo make test sudo make installfailed. Also a Google didn\’t give immediate results until I did find this ADC page which set me on the right track. It seems that the problem is in installing the XML::Parser for which one first need expat to be installed. Now, the generic sourceforge page contains a version for Linux but fortunately it is also part of the Fink project so I did a
sudo fink install expatwhich worked without problems but afterwards I still was not able to install XML::Parser because Fink installs everything in the /sw tree. But after
sudo perl Makefile.pl EXPATLIBPATH=/sw/lib EXPATINCPATH=/sw/includeI finally got the manual installation going. I will try to tidy up the script over the weekend…








No Comments
Leave a comment