Difference between revisions of "CS382:Scraper"
Line 1: | Line 1: | ||
= Documentation = | = Documentation = | ||
− | ==NAME== | + | ==NAME== |
mwscraper - MediaWiki scraper tool builds dynamic MediaWiki pages from template files that scrape data from other pages. | mwscraper - MediaWiki scraper tool builds dynamic MediaWiki pages from template files that scrape data from other pages. | ||
Line 24: | Line 24: | ||
; Crypt::SSLeay | ; Crypt::SSLeay | ||
:Allows https URI's for the wiki | :Allows https URI's for the wiki | ||
+ | ; Compress::Zlib | ||
+ | :Needed to decompress responses from the wiki. | ||
==OPTIONS== | ==OPTIONS== |
Revision as of 12:02, 8 April 2009
Contents
Documentation
==NAME== mwscraper - MediaWiki scraper tool builds dynamic MediaWiki pages from template files that scrape data from other pages.
SYNOPSIS
mwscraper --help|--usage|--version
mwscraper [--host URL] [--login] [--username USERNAME] ] [--password PASSWORD] [--upload] TEMPLATES...
DESCRIPTION
mwscraper allows you to build dynamic MediaWiki pages using an expressive template language with built-ins for parsing information out of other wiki pages.
The templates are built using the Template Toolkit. In addition to the normal functionality provided by that package, a number of functions have been provided. See the FUNCTIONS section.
By default the generated page will be printed on stdout. If you wish the page to be uploaded directly to the wiki you may use the --upload option.
DEPENDENCIES
- Term::ReadKey
- For prompting.
- MediaWiki::API
- To interact with the Wiki.
- Template
- Provides the template language.
- Crypt::SSLeay
- Allows https URI's for the wiki
- Compress::Zlib
- Needed to decompress responses from the wiki.
OPTIONS
- -h|--host
- Sets the url of the wiki to use. Should start with either http:// or https://.
- -l|--login
- Tells the scraper to attempt to login if it needs to access the wiki (either for reading or uploading). If a username or password is not given it will be prompted for.
- -u|--username
- Provides a username to be used when logging in.
- -p|--password
- Provides a password to be used when logging in. This option only works if --username or MWSCRAPER_USERNAME is set.
- -e|--edit|--upload
- Tells the scraper to upload the generated pages to the wiki. The name of the page it will be prompted for unless the template specifies one using the title function. If the page does not already exist a new one will be created.
- --help, --usage, --version
- Displays help/usage/version information.
ENVIRONMENT
- MWSCRAPER_HOST
- Provides a default host if none is provided on the command line.
- MWSCRAPER_LOGIN
- Tells the scraper to login if it needs to access the wiki.
- MWSCRAPER_USERNAME
- Provides a default username if none is provided on the command line. This variable must be set for MWSCRAPER_PASSWORD to be used.
- MWSCRAPER_PASSWORD
- Provides a password to be used with MWSCRAPER_USERNAME. If MWSCRAPER_USERNAME is not set this value will be ignored.
TEMPLATES
The template format is that of the Template perl module. See the Template Perldoc page for more details.
Template code is placed between [% %] blocks and allows a wide range of functionality. The most basic form is inserting the value of a variable. The form of this is simply
[% variable_name %]
You can also assign a value to variable using the = operator.
[% variable_name = expression %]
EXAMPLE
software.tt:
[% title('CS382:Software') ~%]
[% BLOCK unit_section -%]
[% name = scrape(title, '= *(.*?) *=') # grab first header -%]
[% software = scrape(title,'^==== *Software *==== *\n((?:.|\n)*?)={1,4}') ~%]
== [[[% title %]|[% name %]]] ==
[% software %]
[%- END ~%]
= Software =
[% FOREACH title IN scrape_all('CS382:Topics Matrix','\| *\[\[([^|]*)\|') # looks for links -%]
[% INCLUDE unit_section %]
[% END -%]
COMMANDS
There are also a number of some special commands, here are a few common ones.
- FOREACH
- Allows you to repeat a section of a template for each value in a array.
[% FOREACH var IN expression %]
[% var %]
stuff...
[% END %]
- BLOCK
- Allows to create a named section of a template to be included elsewhere.
[% BLOCK blockname %]
stuff...
[% END %]
- INCLUDE
- Inserts the contents of a block (or external template file).
[% INCLUDE block/filename %]
WHITESPACE
Anything not within code blocks is printed verbatim. This means that any whitespace, including newlines surrounding template code is still there. To remove whitespace around template code you can add either - or ~ to right inside of the % on the side that you want to remove from. The - will remove up until and including the next newline it encounters. The ~ will remove all adjacent whitespace on that side including newlines.
For example:
[% var = "Hey There" -%]
[% var %]
Will just print "Hey There\n" with the newline after the first line removed.
[% var = "Hey There" %]
[%~ var %]
Will also just print "Hey There\n" because all whitespace before the third line will be removed.
FUNCTIONS
The following functions are provided by the scraper
- title( TITLE )
- Tells the scraper what title to upload the generated page to in the wiki. If it's not called at least once in the template, a title will be prompted for.
- prompt( STRING )
- Prompts the user in the form 'STRING: ' and then returns the next entered line without the trailing newline.
- login( [USERNAME [, PASSWORD ] ] )
- Tells the scraper to attempt to login if it needs to access the wiki (either to scrape or upload). If either or both username and password aren't provided they will be prompted for.
- cat( STRING... )
- Concatenates all passed strings together and returns the result.
- scrape( TITLE, REGEX )
- Finds the first match of REGEX inside the page called TITLE in the wiki and returns the regex captures.
- scrape_next( TITLE, REGEX )
- Finds the next match of REGEX inside the page called TITLE in the wiki and returns the regex captures. Starts searching directly after the position of the last match
- scrape_all( TITLE, REGEX )
- Finds the all matches of REGEX inside the page called TITLE in the wiki and returns the regex captures.
- subsection( TITLE, START_REGEX, END_REGEX )
- Sections off the portion of the page designated by the first match of START_REGEX and the first match of END_REGEX after START_REGEX. Returns a quasi-title that can be used anywhere a title can be used.
- subsection_next( TITLE, START_REGEX, END_REGEX )
- Sections off the next portion of the page designated by the first match of START_REGEX and the first match of END_REGEX after START_REGEX. Returns a quasi-title that can be used anywhere a title can be used.
- Starts searching directly after the position of the last match
- subsection_all( TITLE, START_REGEX, END_REGEX )
- Sections off all portions of the page designated by matches of START_REGEX and the first match of END_REGEX after each match of START_REGEX. Returns a list of quasi-titles that can be used anywhere a title can be used.
COPYRIGHT
2009, Matt Edlefsen
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Examples
software.tt
[% title('CS382:Software') -%] = Software = [% FOREACH title IN scrape_all('CS382:Topics Matrix','^\| *\[\[([^|]*)\|') # looks for links -%] [% name = scrape(title, '^= *(.*?) *=') # grab first header -%] [% software = scrape(title,'^==== *Software *==== *\n((?:.|\n)*?)^={1,4}') -%] == [[[% title %]|[% name %]]] == [% software %] [% END -%]
geneds.tt
[% title('CS382:GenEds') -%] [% pages = scrape_all('CS382:Topics Matrix','\| \[\[([^|]*)\|') ~%] [% BLOCK UnitLink -%] [% linkname = scrape(page, '^= *(.*?) *=') -%] [% IF linkname.length == 0 -%] [% linkname = page -%] [% END -%] [[[% page %][% anchor %]|[% linkname %]]] [%- END ~%] [% BLOCK GenEd -%] * ''[% name %]'' [% FOREACH page IN pages -%] ** [% INCLUDE UnitLink %]: [% scrape(page, cat( name.replace('-','.') , '.*?^\** (.*?)$')) %] [% END -%] [% END ~%] [% BLOCK GenEdRow -%] [% FOREACH page IN pages -%] | [% scrape(page, "${name.replace('-','.')}.*?^\\** (.{1,15}?)\\.") %] [% END -%] [% END ~%] == General Education Alignment == <center> {| class="wikitable" border="1" |+ Helpful Total Geneds Coverage Table, Fig. 18c. |- ! Unit [% FOREACH page IN pages %] ! [% INCLUDE UnitLink anchor = '#General Education Alignment' %] [% END %] |- | ARa [% INCLUDE GenEdRow name = 'They focus substantially on properties of classes of abstract models and operations that apply to them.' %] |- | ARb [% INCLUDE GenEdRow name = 'They provide experience in generalizing from specific instances to appropriate classes of abstract models.' %] |- | ARc [% INCLUDE GenEdRow name = 'They provide experience in solving concrete problems by a process of abstraction and manipulation at the abstract level. Typically this experience is provided by word problems which require students to formalize real-world problems in abstract terms, to solve them with techniques that apply at that abstract level, and to convert the solutions back into concrete results.' %] |- | QRa [% INCLUDE GenEdRow name = 'Using and interpreting formulas, graphs and tables.' %] |- | QRb [% INCLUDE GenEdRow name = 'Representing mathematical ideas symbolically, graphically, numerically and verbally.' %] |- | QRc [% INCLUDE GenEdRow name = 'Using mathematical and statistical ideas to solve problems in a variety of contexts.' %] |- | QRd [% INCLUDE GenEdRow name = 'Using simple models such as linear dependence, exponential growth or decay, or normal distribution.' %] |- | QRe [% INCLUDE GenEdRow name = 'Understanding basic statistical ideas such as averages, variability and probability.' %] |- | QRf [% INCLUDE GenEdRow name = 'Making estimates and checking the reasonableness of answers.' %] |- | QRg [% INCLUDE GenEdRow name = 'Recognizing the limitations of mathematical and statistical methods.' %] |- | SIa [% INCLUDE GenEdRow name = 'Develops students\' understanding of the natural world.' %] |- | SIb [% INCLUDE GenEdRow name = 'Strengthens students\' knowledge of the scientific way of knowing - the use of systematic observation and experimentation to develop theories and test hypotheses.' %] |- | SIc [% INCLUDE GenEdRow name = 'Emphasizes and provides first-hand experience with both theoretical analysis and the collection of empirical data.' %] |} </center> === Analytical Reasoning Requirement === ==== Abstract Reasoning ==== From the [[http://www.earlham.edu/curriculumguide/academics/analytical.html Catalog Description]] ''Courses qualifying for credit in Abstract Reasoning typically share these characteristics:'' [% INCLUDE GenEd name = 'They focus substantially on properties of classes of abstract models and operations that apply to them.' anchor = '#Abstract Reasoning' %] [% INCLUDE GenEd name = 'They provide experience in generalizing from specific instances to appropriate classes of abstract models.' anchor = '#Abstract Reasoning' %] [% INCLUDE GenEd name = 'They provide experience in solving concrete problems by a process of abstraction and manipulation at the abstract level. Typically this experience is provided by word problems which require students to formalize real-world problems in abstract terms, to solve them with techniques that apply at that abstract level, and to convert the solutions back into concrete results.' anchor = '#Abstract Reasoning' %] ==== Quantitative Reasoning ==== From the [[http://www.earlham.edu/curriculumguide/academics/analytical.html Catalog Description]] ''General Education courses in Quantitative Reasoning foster students' abilities to generate, interpret and evaluate quantitative information. In particular, Quantitative Reasoning courses help students develop abilities in such areas as:'' [% INCLUDE GenEd name = 'Using and interpreting formulas, graphs and tables.' anchor = '#Quantitative Reasoning' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Representing mathematical ideas symbolically, graphically, numerically and verbally.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Using mathematical and statistical ideas to solve problems in a variety of contexts.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Using simple models such as linear dependence, exponential growth or decay, or normal distribution.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Understanding basic statistical ideas such as averages, variability and probability.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Making estimates and checking the reasonableness of answers.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Recognizing the limitations of mathematical and statistical methods.' anchor = '#Quantitative Reasoning' %] === Scientific Inquiry Requirement === From the [[http://www.earlham.edu/curriculumguide/academics/scientific.html Catalog Description]] ''Scientific inquiry:'' [% INCLUDE GenEd name = 'Develops students\' understanding of the natural world.' anchor = '#Scientific Inquiry Requirement' %] [% INCLUDE GenEd name = 'Strengthens students\' knowledge of the scientific way of knowing - the use of systematic observation and experimentation to develop theories and test hypotheses.' anchor = '#Scientific Inquiry Requirement' %] [% INCLUDE GenEd name = 'Emphasizes and provides first-hand experience with both theoretical analysis and the collection of empirical data.' anchor = '#Scientific Inquiry Requirement' %]