CS382:Scraper

From Earlham CS Department
Revision as of 17:33, 29 March 2009 by Edlefma (talk | contribs) (New page: ==NAME== mwscraper - MediaWiki scraper tool builds dynamic MediaWiki pages from template files that scrape data from other pages. ==SYNOPSIS== mwscraper [--host URL] [--user USERNAME] [--...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

NAME

mwscraper - MediaWiki scraper tool builds dynamic MediaWiki pages from template files that scrape data from other pages.

SYNOPSIS

mwscraper [--host URL] [--user USERNAME] [--password PASSWORD] [--print | --stdout] TEMPLATES...

DESCRIPTION

mwscraper allows you to build dynamic MediaWiki pages using an expressive template language with built-ins for parsing information out of other wiki pages.

You can pass it a host url for the wiki, or by default it uses wiki.cs.earlham.edu. If no username or password is provided they will be prompted. By default, the generated page for each template will be uploaded to the wiki however if --print/stdout is specified it will instead be printed to stdout. If the template does not specify a title for the generated page, one will be prompted for.

DEPENDENCIES

  • Term::ReadKey
  • MediaWiki::API
  • Template

TEMPLATES

The template format is that of the Template perl module. Here is an example:

software.tt:

[% title('CS382:Software') -%]
= Software =
[% FOREACH title IN scrape_all('CS382:Topics Matrix','\| *\[\[([^|]*)\|') # looks for links -%]
[% name = scrape(title, '= *(.*?) *=') # grab first header -%]
[% software = scrape(title,'^==== *Software *==== *\n((?:.|\n)*?)={1,4}') -%]
== [[[% title %]|[% name %]]] ==
[% software %]
[% END -%]

Template code is placed between [% %] blocks and allows a wide range of functionality. Anything not within code blocks is printed verbatim. To allow a code block to exist by itself on a line without a newline being produced, end the block with -%] instead.

The first line tells the scraper what the title of the generated page should be. This is one of a handful of functions provided by the scraper. Blocks with just a single word will replace that block with the value of that variable, likewise blocks in the form of [% word = expression %] will assign a value to a variable to be retrieved later.

The only other major block type is FOREACH. FOREACH has the following form: [% FOREACH word IN expression %] stuff... [% END %]

The effect of this will be to assign each value returned by expression to word in turn and then evaluate the template up until the END block.

FUNCTIONS

The following functions are provided by the scraper

title( TITLE )
Tells the scraper what title to upload the generated page to in the wiki. If it's not called at least once in the template, a title will be prompted for.
prompt( STRING )
Prompts the user in the form 'STRING: ' and then returns the next entered line without the trailing newline.
scrape( TITLE, REGEX )
Finds the first match of REGEX inside the page called TITLE in the wiki and returns the regex captures.
scrape_all( TITLE, REGEX )
Finds the all matches of REGEX inside the page called TITLE in the wiki and returns the regex captures.
subsection( TITLE, START_REGEX, END_REGEX )
Sections off the portion of the page designated by the first match of START_REGEX and the first match of END_REGEX after START_REGEX. Returns a quasi-title that can be used anywhere a title can be used.

COPYRIGHT

2009, Matt Edlefsen

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.