CS382:Scraper
Contents
Documentation
==NAME== mwscraper - MediaWiki scraper tool builds dynamic MediaWiki pages from template files that scrape data from other pages.
SYNOPSIS
mwscraper --help
mwscraper [--host URL] [--login [USERNAME] ] [--password PASSWORD] [--upload] TEMPLATES...
OPTIONS
- -h|--host
- Sets the url of the wiki to use. Should start with either http:// or https://.
- -l|--login
- Tells the scraper to attempt to login if it needs to access the wiki (either for reading or uploading). If a username is not given it will be prompted for. If a template uses the login function, it will override this option.
- -p|--password
- Provides a password to be used when logging in. This option only works if --login was specified with a username.
- -u|--upload
- Tells the scraper to upload the generated pages to the wiki. The name of the page it will be prompted for unless the template specifies one using the title function.
DESCRIPTION
mwscraper allows you to build dynamic MediaWiki pages using an expressive template language with built-ins for parsing information out of other wiki pages.
You can pass it a host url for the wiki, or by default it uses wiki.cs.earlham.edu. If no username or password is provided they will be prompted. By default, the generated page will be printed to stdout but if the --upload flag is provided the page will be directly uploaded to the wiki using the title specified by the title function. If the template does not specify a title for the generated page, one will be prompted for.
You will want to look at CS382:Scraper_Recipes to get the most out of this tool.
DEPENDENCIES
- Term::ReadKey
- MediaWiki::API
- Template
- Crypt::SSLeay (for https)
TEMPLATES
The template format is that of the Template perl module. Here is an example:
software.tt:
[% title('CS382:Software') -%] = Software = [% FOREACH title IN scrape_all('CS382:Topics Matrix','\| *\[\[([^|]*)\|') # looks for links -%] [% name = scrape(title, '= *(.*?) *=') # grab first header -%] [% software = scrape(title,'^==== *Software *==== *\n((?:.|\n)*?)={1,4}') -%] == [[[% title %]|[% name %]]] == [% software %] [% END -%]
Template code is placed between [% %] blocks and allows a wide range of functionality. Anything not within code blocks is printed verbatim. To allow a code block to exist by itself on a line without a newline being produced, end the block with -%] instead.
The first line tells the scraper what the title of the generated page should be. This is one of a handful of functions provided by the scraper. Blocks with just a single word will replace that block with the value of that variable, likewise blocks in the form of [% word = expression %] will assign a value to a variable to be retrieved later.
The only other major block type is FOREACH. FOREACH has the following form: [% FOREACH word IN expression %] stuff... [% END %]
The effect of this will be to assign each value returned by expression to word in turn and then evaluate the template up until the END block.
FUNCTIONS
The following functions are provided by the scraper
- title( TITLE )
- Tells the scraper what title to upload the generated page to in the wiki. If it's not called at least once in the template, a title will be prompted for.
- prompt( STRING )
- Prompts the user in the form 'STRING: ' and then returns the next entered line without the trailing newline.
- login( [USERNAME [, PASSWORD ] ] )
- Logs in to the wiki using USERNAME and PASSWORD. If one or both aren't provided they will be prompted for.
- cat( STRING... )
- Concatenates all passed strings together and returns the result.
- scrape( TITLE, REGEX )
- Finds the first match of REGEX inside the page called TITLE in the wiki and returns the regex captures.
- scrape_next( TITLE, REGEX )
- Finds the next match of REGEX inside the page called TITLE in the wiki and returns the regex captures. Starts searching directly after the position of the last match
- scrape_all( TITLE, REGEX )
- Finds the all matches of REGEX inside the page called TITLE in the wiki and returns the regex captures.
- subsection( TITLE, START_REGEX, END_REGEX )
- Sections off the portion of the page designated by the first match of START_REGEX and the first match of END_REGEX after START_REGEX. Returns a quasi-title that can be used anywhere a title can be used.
- subsection_next( TITLE, START_REGEX, END_REGEX )
- Sections off the next portion of the page designated by the first match of START_REGEX and the first match of END_REGEX after START_REGEX. Returns a quasi-title that can be used anywhere a title can be used.
- Starts searching directly after the position of the last match
- subsection_all( TITLE, START_REGEX, END_REGEX )
- Sections off all portions of the page designated by matches of START_REGEX and the first match of END_REGEX after each match of START_REGEX. Returns a list of quasi-titles that can be used anywhere a title can be used.
COPYRIGHT
2009, Matt Edlefsen
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Examples
software.tt
[% title('CS382:Software') -%] = Software = [% FOREACH title IN scrape_all('CS382:Topics Matrix','^\| *\[\[([^|]*)\|') # looks for links -%] [% name = scrape(title, '^= *(.*?) *=') # grab first header -%] [% software = scrape(title,'^==== *Software *==== *\n((?:.|\n)*?)^={1,4}') -%] == [[[% title %]|[% name %]]] == [% software %] [% END -%]
geneds.tt
[% title('CS382:GenEds') -%] [% pages = scrape_all('CS382:Topics Matrix','\| \[\[([^|]*)\|') ~%] [% BLOCK UnitLink -%] [% linkname = scrape(page, '^= *(.*?) *=') -%] [% IF linkname.length == 0 -%] [% linkname = page -%] [% END -%] [[[% page %][% anchor %]|[% linkname %]]] [%- END ~%] [% BLOCK GenEd -%] * ''[% name %]'' [% FOREACH page IN pages -%] ** [% INCLUDE UnitLink %]: [% scrape(page, cat( name.replace('-','.') , '.*?^\** (.*?)$')) %] [% END -%] [% END ~%] [% BLOCK GenEdRow -%] [% FOREACH page IN pages -%] | [% scrape(page, "${name.replace('-','.')}.*?^\\** (.{1,15}?)\\.") %] [% END -%] [% END ~%] == General Education Alignment == <center> {| class="wikitable" border="1" |+ Helpful Total Geneds Coverage Table, Fig. 18c. |- ! Unit [% FOREACH page IN pages %] ! [% INCLUDE UnitLink anchor = '#General Education Alignment' %] [% END %] |- | ARa [% INCLUDE GenEdRow name = 'They focus substantially on properties of classes of abstract models and operations that apply to them.' %] |- | ARb [% INCLUDE GenEdRow name = 'They provide experience in generalizing from specific instances to appropriate classes of abstract models.' %] |- | ARc [% INCLUDE GenEdRow name = 'They provide experience in solving concrete problems by a process of abstraction and manipulation at the abstract level. Typically this experience is provided by word problems which require students to formalize real-world problems in abstract terms, to solve them with techniques that apply at that abstract level, and to convert the solutions back into concrete results.' %] |- | QRa [% INCLUDE GenEdRow name = 'Using and interpreting formulas, graphs and tables.' %] |- | QRb [% INCLUDE GenEdRow name = 'Representing mathematical ideas symbolically, graphically, numerically and verbally.' %] |- | QRc [% INCLUDE GenEdRow name = 'Using mathematical and statistical ideas to solve problems in a variety of contexts.' %] |- | QRd [% INCLUDE GenEdRow name = 'Using simple models such as linear dependence, exponential growth or decay, or normal distribution.' %] |- | QRe [% INCLUDE GenEdRow name = 'Understanding basic statistical ideas such as averages, variability and probability.' %] |- | QRf [% INCLUDE GenEdRow name = 'Making estimates and checking the reasonableness of answers.' %] |- | QRg [% INCLUDE GenEdRow name = 'Recognizing the limitations of mathematical and statistical methods.' %] |- | SIa [% INCLUDE GenEdRow name = 'Develops students\' understanding of the natural world.' %] |- | SIb [% INCLUDE GenEdRow name = 'Strengthens students\' knowledge of the scientific way of knowing - the use of systematic observation and experimentation to develop theories and test hypotheses.' %] |- | SIc [% INCLUDE GenEdRow name = 'Emphasizes and provides first-hand experience with both theoretical analysis and the collection of empirical data.' %] |} </center> === Analytical Reasoning Requirement === ==== Abstract Reasoning ==== From the [[http://www.earlham.edu/curriculumguide/academics/analytical.html Catalog Description]] ''Courses qualifying for credit in Abstract Reasoning typically share these characteristics:'' [% INCLUDE GenEd name = 'They focus substantially on properties of classes of abstract models and operations that apply to them.' anchor = '#Abstract Reasoning' %] [% INCLUDE GenEd name = 'They provide experience in generalizing from specific instances to appropriate classes of abstract models.' anchor = '#Abstract Reasoning' %] [% INCLUDE GenEd name = 'They provide experience in solving concrete problems by a process of abstraction and manipulation at the abstract level. Typically this experience is provided by word problems which require students to formalize real-world problems in abstract terms, to solve them with techniques that apply at that abstract level, and to convert the solutions back into concrete results.' anchor = '#Abstract Reasoning' %] ==== Quantitative Reasoning ==== From the [[http://www.earlham.edu/curriculumguide/academics/analytical.html Catalog Description]] ''General Education courses in Quantitative Reasoning foster students' abilities to generate, interpret and evaluate quantitative information. In particular, Quantitative Reasoning courses help students develop abilities in such areas as:'' [% INCLUDE GenEd name = 'Using and interpreting formulas, graphs and tables.' anchor = '#Quantitative Reasoning' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Representing mathematical ideas symbolically, graphically, numerically and verbally.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Using mathematical and statistical ideas to solve problems in a variety of contexts.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Using simple models such as linear dependence, exponential growth or decay, or normal distribution.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Understanding basic statistical ideas such as averages, variability and probability.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Making estimates and checking the reasonableness of answers.' anchor = '#Quantitative Reasoning' %] [% INCLUDE GenEd name = 'Recognizing the limitations of mathematical and statistical methods.' anchor = '#Quantitative Reasoning' %] === Scientific Inquiry Requirement === From the [[http://www.earlham.edu/curriculumguide/academics/scientific.html Catalog Description]] ''Scientific inquiry:'' [% INCLUDE GenEd name = 'Develops students\' understanding of the natural world.' anchor = '#Scientific Inquiry Requirement' %] [% INCLUDE GenEd name = 'Strengthens students\' knowledge of the scientific way of knowing - the use of systematic observation and experimentation to develop theories and test hypotheses.' anchor = '#Scientific Inquiry Requirement' %] [% INCLUDE GenEd name = 'Emphasizes and provides first-hand experience with both theoretical analysis and the collection of empirical data.' anchor = '#Scientific Inquiry Requirement' %]