my first scraper

As

far as I know (but I am fairly ignorant) the arXiv does not

provide RSS feeds for a particular section, say mathRA. Still it would be a good idea for anyone

having a news aggregator to follows some weblogs and

news-channels having RSS syndication. So I decided to write one as my

first Perl-exercise and to my own surprise I have after a few hours work

a prototype-scraper for math.RA. It is not yet perfect, I still

have to convert the local URLs to global URLs so that they can be

clicked and at the moment I have only collected the titles, authors and

abstract-links whereas it would make more sense to include the full

abstract in the RSS feed, but give me a few more days...
The

basic idea is fairly simple (and based on an O\'Reilly hack).

One uses the Template::Extract module to

extract the goodies from the arXiv\'s template HTML. Maybe I am still

not used to Perl-documentation but it was hard for me to work out how to

do this in detail either from the hack or the online

module-documentation. Fortunately there is a good Perl Advent

Calendar page giving me the details that I needed. Once one has this

info one can turn it into a proper RSS-page using the XML::RSS-module.
In fact, I spend far

more time trying to get XML::RSS installed under OS X than

writing the code. The usual method, that is via

iMacLieven:~

lieven$ sudo /usr/bin/perl -MCPAN -e shell Terminal does not support

AddHistory. cpan shell -- CPAN exploration and modules installation

(v1.76) ReadLine support available (try \'install

Bundle::CPAN\') cpan> install XML::RSS 

failed and even a

manual install for which the drill is : download the package from CPAN, go to the

extracted directory and give the commands

sudo /usr/bin/perl

Makefile.pl sudo make sudo make test sudo make

install

failed. Also a Google didn\'t give immediate results until

I did find this ADC page which set me on the right track.

It seems that the problem is in installing the XML::Parser for which one first need expat

to be installed. Now, the generic sourceforge page contains a

version for Linux but fortunately it is also part of the Fink

project so I did a

sudo fink install expat

which worked

without problems but afterwards I still was not able to install

XML::Parser because Fink installs everything in the /sw

tree. But after

sudo perl Makefile.pl EXPATLIBPATH=/sw/lib

EXPATINCPATH=/sw/include

I finally got the manual installation

going. I will try to tidy up the script over the weekend...

Add new comment