Difference between revisions of "CS382:Scraper"

From Earlham CS Department
Jump to navigation Jump to search
(DESCRIPTION)
Line 1: Line 1:
 
= Documentation =
 
= Documentation =
==NAME==
+
==NAME==
 
mwscraper - MediaWiki scraper tool builds dynamic MediaWiki pages from template files that scrape data from other pages.
 
mwscraper - MediaWiki scraper tool builds dynamic MediaWiki pages from template files that scrape data from other pages.
  
 
==SYNOPSIS==
 
==SYNOPSIS==
mwscraper --help
+
mwscraper --help|--usage
  
mwscraper [--host URL] [--login [USERNAME] ] [--password PASSWORD] [--upload] TEMPLATES...
+
mwscraper [--host URL] [--login] [--username USERNAME] ] [--password PASSWORD] [--upload] TEMPLATE
 +
 
 +
==DESCRIPTION==
 +
mwscraper allows you to build dynamic MediaWiki pages using an expressive template language with built-ins for parsing information out of other wiki pages.
 +
 
 +
The templates are built using the Template Toolkit. In addition to the normal functionality provided by that package, a number of functions have been provided. See the FUNCTIONS section.
 +
 
 +
By default the generated page will be printed on stdout. If you wish the page to be uploaded directly to the wiki you may use the --upload option.
 +
 
 +
==DEPENDENCIES==
 +
; Term::ReadKey
 +
:For prompting.
 +
; MediaWiki::API
 +
:To interact with the Wiki.
 +
; Template
 +
:Provides the template language.
 +
; Crypt::SSLeay
 +
:Allows https URI's for the wiki
  
 
==OPTIONS==
 
==OPTIONS==
Line 12: Line 29:
 
:Sets the url of the wiki to use. Should start with either http:// or https://.
 
:Sets the url of the wiki to use. Should start with either http:// or https://.
 
; -l|--login
 
; -l|--login
:Tells the scraper to attempt to login if it needs to access the wiki (either for reading or uploading). If a username is not given it will be prompted for. If a template uses the login function, it will override this option.
+
:Tells the scraper to attempt to login if it needs to access the wiki (either for reading or uploading). If a username or password is not given it will be prompted for.
 +
; -u|--username
 +
:Provides a username to be used when logging in.
 
; -p|--password
 
; -p|--password
:Provides a password to be used when logging in. This option only works if --login was specified with a username.
+
:Provides a password to be used when logging in. This option only works if --username or MWSCRAPER_USERNAME is set.
; -u|--upload
+
; -e|--edit|--upload
:Tells the scraper to upload the generated pages to the wiki. The name of the page it will be prompted for unless the template specifies one using the title function.
+
:Tells the scraper to upload the generated pages to the wiki. The name of the page it will be prompted for unless the template specifies one using the title function. If the page does not already exist a new one will be created.
 +
; --help, --usage
 +
:Displays help/usage information.
  
==DESCRIPTION==
+
==ENVIRONMENT==
mwscraper allows you to build dynamic MediaWiki pages using an expressive template language with built-ins for parsing information out of other wiki pages.
+
; MWSCRAPER_HOST
 +
:Provides a default host if none is provided on the command line.
 +
; MWSCRAPER_LOGIN
 +
:Tells the scraper to login if it needs to access the wiki.
 +
; MWSCRAPER_USERNAME
 +
:Provides a default username if none is provided on the command line. This variable must be set for MWSCRAPER_PASSWORD to be used.
 +
; MWSCRAPER_PASSWORD
 +
:Provides a password to be used with MWSCRAPER_USERNAME. If MWSCRAPER_USERNAME is not set this value will be ignored.
  
You can pass it a host url for the wiki, or by default it uses wiki.cs.earlham.edu. If no username or password is provided they will be prompted. By default, the generated page will be printed to stdout but if the --upload flag is provided the page will be directly uploaded to the wiki using the title specified by the title function. If the template does not specify a title for the generated page, one will be prompted for.
+
==TEMPLATES==
 +
The template format is that of the Template perl module. See the Template Perldoc page for more details.
  
 +
Template code is placed between [% %] blocks and allows a wide range of functionality. The most basic form is inserting the value of a variable. The form of this is simply [% variable_name %]
  
You will want to look at [[CS382:Scraper_Recipes]] to get the most out of this tool.
+
You can also assign a value to variable using the = operator. [% variable_name = expression %]
  
==DEPENDENCIES==
+
===EXAMPLE===
* Term::ReadKey
+
software.tt:
* MediaWiki::API
 
* Template
 
* Crypt::SSLeay (for https)
 
  
==TEMPLATES==
+
<code>
The template format is that of the Template perl module. Here is an example:
+
  [% title('CS382:Software') ~%]
 
+
   
software.tt:
+
  [% BLOCK unit_section -%]
  [% title('CS382:Software') -%]
 
  = Software =
 
  [% FOREACH title IN scrape_all('CS382:Topics Matrix','\| *\[\[([^|]*)\|') # looks for links -%]
 
 
  [% name = scrape(title, '= *(.*?) *=') # grab first header -%]
 
  [% name = scrape(title, '= *(.*?) *=') # grab first header -%]
  [% software = scrape(title,'^==== *Software *==== *\n((?:.|\n)*?)={1,4}') -%]
+
  [% software = scrape(title,'^==== *Software *==== *\n((?:.|\n)*?)={1,4}') ~%]
 +
 
  == [[[% title %]|[% name %]]] ==
 
  == [[[% title %]|[% name %]]] ==
 
  [% software %]
 
  [% software %]
 +
 +
[%- END ~%]
 +
 +
= Software =
 +
[% FOREACH title IN scrape_all('CS382:Topics Matrix','\| *\[\[([^|]*)\|') # looks for links -%]
 +
[% INCLUDE unit_section %]
 
  [% END -%]
 
  [% END -%]
 +
</code>
 +
===COMMANDS===
 +
There are also a number of some special commands, here are a few common ones.
 +
 +
; FOREACH
 +
:Allows you to repeat a section of a template for each value in a array.
 +
<code>
 +
[% FOREACH var IN expression %]
 +
[% var %]
 +
stuff...
 +
[% END %]
 +
</code>
 +
; BLOCK
 +
:Allows to create a named section of a template to be included elsewhere.
 +
<code>
 +
[% BLOCK blockname %]
 +
stuff...
 +
[% END %]
 +
</code>
 +
; INCLUDE
 +
:Inserts the contents of a block (or external template file).
 +
<code>
 +
[% INCLUDE block/filename %]
 +
</code>
  
Template code is placed between [% %] blocks and allows a wide range of functionality. Anything not within code blocks is printed verbatim. To allow a code block to exist by itself on a line without a newline being produced, end the block with -%] instead.
+
===WHITESPACE===
 +
Anything not within code blocks is printed verbatim. This means that any whitespace, including newlines surrounding template code is still there. To remove whitespace around template code you can add either - or ~ to right inside of the % on the side that you want to remove from. The - will remove up until and including the next newline it encounters. The ~ will remove all adjacent whitespace on that side including newlines.
  
The first line tells the scraper what the title of the generated page should be. This is one of a handful of functions provided by the scraper. Blocks with just a single word will replace that block with the value of that variable, likewise blocks in the form of [% word = expression %] will assign a value to a variable to be retrieved later.
+
For example:
  
The only other major block type is FOREACH. FOREACH has the following form: [% FOREACH word IN expression %] stuff... [% END %]
+
<code>
 +
[% var = "Hey There" -%]  
 +
[% var %]
 +
</code>
 +
Will just print "Hey There\n" with the newline after the first line removed.
  
The effect of this will be to assign each value returned by expression to word in turn and then evaluate the template up until the END block.
+
<code>
 +
[% var = "Hey There" %]
 +
 +
[%~ var %]
 +
</code>
 +
Will also just print "Hey There\n" because all whitespace before the third line will be removed.
  
 
==FUNCTIONS==
 
==FUNCTIONS==
Line 61: Line 126:
 
:Prompts the user in the form 'STRING: ' and then returns the next entered line without the trailing newline.
 
:Prompts the user in the form 'STRING: ' and then returns the next entered line without the trailing newline.
 
; login( [USERNAME [, PASSWORD ] ] )
 
; login( [USERNAME [, PASSWORD ] ] )
:Logs in to the wiki using USERNAME and PASSWORD. If one or both aren't provided they will be prompted for.
+
:Tells the scraper to attempt to login if it needs to access the wiki (either to scrape or upload). If either or both username and password aren't provided they will be prompted for.
 
; cat( STRING... )
 
; cat( STRING... )
 
:Concatenates all passed strings together and returns the result.
 
:Concatenates all passed strings together and returns the result.

Revision as of 21:35, 7 April 2009

Documentation

NAME

mwscraper - MediaWiki scraper tool builds dynamic MediaWiki pages from template files that scrape data from other pages.

SYNOPSIS

mwscraper --help|--usage

mwscraper [--host URL] [--login] [--username USERNAME] ] [--password PASSWORD] [--upload] TEMPLATE

DESCRIPTION

mwscraper allows you to build dynamic MediaWiki pages using an expressive template language with built-ins for parsing information out of other wiki pages.

The templates are built using the Template Toolkit. In addition to the normal functionality provided by that package, a number of functions have been provided. See the FUNCTIONS section.

By default the generated page will be printed on stdout. If you wish the page to be uploaded directly to the wiki you may use the --upload option.

DEPENDENCIES

Term::ReadKey
For prompting.
MediaWiki::API
To interact with the Wiki.
Template
Provides the template language.
Crypt::SSLeay
Allows https URI's for the wiki

OPTIONS

-h|--host
Sets the url of the wiki to use. Should start with either http:// or https://.
-l|--login
Tells the scraper to attempt to login if it needs to access the wiki (either for reading or uploading). If a username or password is not given it will be prompted for.
-u|--username
Provides a username to be used when logging in.
-p|--password
Provides a password to be used when logging in. This option only works if --username or MWSCRAPER_USERNAME is set.
-e|--edit|--upload
Tells the scraper to upload the generated pages to the wiki. The name of the page it will be prompted for unless the template specifies one using the title function. If the page does not already exist a new one will be created.
--help, --usage
Displays help/usage information.

ENVIRONMENT

MWSCRAPER_HOST
Provides a default host if none is provided on the command line.
MWSCRAPER_LOGIN
Tells the scraper to login if it needs to access the wiki.
MWSCRAPER_USERNAME
Provides a default username if none is provided on the command line. This variable must be set for MWSCRAPER_PASSWORD to be used.
MWSCRAPER_PASSWORD
Provides a password to be used with MWSCRAPER_USERNAME. If MWSCRAPER_USERNAME is not set this value will be ignored.

TEMPLATES

The template format is that of the Template perl module. See the Template Perldoc page for more details.

Template code is placed between [% %] blocks and allows a wide range of functionality. The most basic form is inserting the value of a variable. The form of this is simply [% variable_name %]

You can also assign a value to variable using the = operator. [% variable_name = expression %]

EXAMPLE

software.tt:

[% title('CS382:Software') ~%]

[% BLOCK unit_section -%]
[% name = scrape(title, '= *(.*?) *=') # grab first header -%]
[% software = scrape(title,'^==== *Software *==== *\n((?:.|\n)*?)={1,4}') ~%]

== [[[% title %]|[% name %]]] ==
[% software %]

[%- END ~%]

= Software =
[% FOREACH title IN scrape_all('CS382:Topics Matrix','\| *\[\[([^|]*)\|') # looks for links -%]
[% INCLUDE unit_section %]
[% END -%]

COMMANDS

There are also a number of some special commands, here are a few common ones.

FOREACH
Allows you to repeat a section of a template for each value in a array.

[% FOREACH var IN expression %]
[% var %]
stuff...
[% END %]

BLOCK
Allows to create a named section of a template to be included elsewhere.

[% BLOCK blockname %]
stuff...
[% END %]

INCLUDE
Inserts the contents of a block (or external template file).

[% INCLUDE block/filename %]

WHITESPACE

Anything not within code blocks is printed verbatim. This means that any whitespace, including newlines surrounding template code is still there. To remove whitespace around template code you can add either - or ~ to right inside of the % on the side that you want to remove from. The - will remove up until and including the next newline it encounters. The ~ will remove all adjacent whitespace on that side including newlines.

For example:

[% var = "Hey There" -%] 
[% var %]

Will just print "Hey There\n" with the newline after the first line removed.

[% var = "Hey There" %]

[%~ var %]

Will also just print "Hey There\n" because all whitespace before the third line will be removed.

FUNCTIONS

The following functions are provided by the scraper

title( TITLE )
Tells the scraper what title to upload the generated page to in the wiki. If it's not called at least once in the template, a title will be prompted for.
prompt( STRING )
Prompts the user in the form 'STRING: ' and then returns the next entered line without the trailing newline.
login( [USERNAME [, PASSWORD ] ] )
Tells the scraper to attempt to login if it needs to access the wiki (either to scrape or upload). If either or both username and password aren't provided they will be prompted for.
cat( STRING... )
Concatenates all passed strings together and returns the result.
scrape( TITLE, REGEX )
Finds the first match of REGEX inside the page called TITLE in the wiki and returns the regex captures.
scrape_next( TITLE, REGEX )
Finds the next match of REGEX inside the page called TITLE in the wiki and returns the regex captures. Starts searching directly after the position of the last match
scrape_all( TITLE, REGEX )
Finds the all matches of REGEX inside the page called TITLE in the wiki and returns the regex captures.
subsection( TITLE, START_REGEX, END_REGEX )
Sections off the portion of the page designated by the first match of START_REGEX and the first match of END_REGEX after START_REGEX. Returns a quasi-title that can be used anywhere a title can be used.
subsection_next( TITLE, START_REGEX, END_REGEX )
Sections off the next portion of the page designated by the first match of START_REGEX and the first match of END_REGEX after START_REGEX. Returns a quasi-title that can be used anywhere a title can be used.
Starts searching directly after the position of the last match
subsection_all( TITLE, START_REGEX, END_REGEX )
Sections off all portions of the page designated by matches of START_REGEX and the first match of END_REGEX after each match of START_REGEX. Returns a list of quasi-titles that can be used anywhere a title can be used.

COPYRIGHT

2009, Matt Edlefsen

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Examples

software.tt

[% title('CS382:Software') -%]
= Software =
[% FOREACH title IN scrape_all('CS382:Topics Matrix','^\| *\[\[([^|]*)\|') # looks for links -%]
[% name = scrape(title, '^= *(.*?) *=') # grab first header -%]
[% software = scrape(title,'^==== *Software *==== *\n((?:.|\n)*?)^={1,4}') -%]
== [[[% title %]|[% name %]]] ==
[% software %]
[% END -%]

geneds.tt

[% title('CS382:GenEds') -%]
[% pages = scrape_all('CS382:Topics Matrix','\| \[\[([^|]*)\|') ~%]

[% BLOCK UnitLink -%]
[% linkname = scrape(page, '^= *(.*?) *=') -%]
[% IF linkname.length == 0 -%]
[% linkname = page -%]
[% END -%]
[[[% page %][% anchor %]|[% linkname %]]]
[%- END ~%]

[% BLOCK GenEd -%]
* ''[% name %]''
[% FOREACH page IN pages -%]
** [% INCLUDE UnitLink %]: [% scrape(page, cat( name.replace('-','.') , '.*?^\** (.*?)$')) %]
[% END -%]
[% END ~%]

[% BLOCK GenEdRow -%]
[% FOREACH page IN pages -%]
| [% scrape(page, "${name.replace('-','.')}.*?^\\** (.{1,15}?)\\.") %]
[% END -%]
[% END ~%]

== General Education Alignment ==

<center>
{| class="wikitable" border="1"
|+ Helpful Total Geneds Coverage Table, Fig. 18c.
|-
! Unit
[% FOREACH page IN pages  %]
! [% INCLUDE UnitLink anchor = '#General Education Alignment' %]
[% END %]
|-

| ARa
[% INCLUDE GenEdRow name = 'They focus substantially on properties of classes of abstract models and operations that apply to them.' %]
|-
| ARb
[% INCLUDE GenEdRow name = 'They provide experience in generalizing from specific instances to appropriate classes of abstract models.' %]
|-
| ARc
[% INCLUDE GenEdRow name = 'They provide experience in solving concrete problems by a process of abstraction and manipulation at the abstract level. Typically this experience is provided by word problems which require students to formalize real-world problems in abstract terms, to solve them with techniques that apply at that abstract level, and to convert the solutions back into concrete results.' %]
|-
| QRa
[% INCLUDE GenEdRow name = 'Using and interpreting formulas, graphs and tables.' %]
|-
| QRb
[% INCLUDE GenEdRow name = 'Representing mathematical ideas symbolically, graphically, numerically and verbally.' %]
|-
| QRc
[% INCLUDE GenEdRow name = 'Using mathematical and statistical ideas to solve problems in a variety of contexts.' %]
|-
| QRd
[% INCLUDE GenEdRow name = 'Using simple models such as linear dependence, exponential growth or decay, or normal distribution.' %]
|-
| QRe
[% INCLUDE GenEdRow name = 'Understanding basic statistical ideas such as averages, variability and probability.' %]
|-
| QRf
[% INCLUDE GenEdRow name = 'Making estimates and checking the reasonableness of answers.' %]
|-
| QRg
[% INCLUDE GenEdRow name = 'Recognizing the limitations of mathematical and statistical methods.' %]
|-
| SIa
[% INCLUDE GenEdRow name = 'Develops students\' understanding of the natural world.' %]
|-
| SIb
[% INCLUDE GenEdRow name = 'Strengthens students\' knowledge of the scientific way of knowing - the use of systematic observation and experimentation to develop theories and test hypotheses.' %]
|-
| SIc
[% INCLUDE GenEdRow name = 'Emphasizes and provides first-hand experience with both theoretical analysis and the collection of empirical data.' %]
|}
</center>


=== Analytical Reasoning Requirement ===
==== Abstract Reasoning ====
From the [[http://www.earlham.edu/curriculumguide/academics/analytical.html Catalog Description]] ''Courses qualifying for credit in Abstract Reasoning typically share these characteristics:''
[% INCLUDE GenEd name = 'They focus substantially on properties of classes of abstract models and operations that apply to them.' anchor = '#Abstract Reasoning' %]
[% INCLUDE GenEd name = 'They provide experience in generalizing from specific instances to appropriate classes of abstract models.' anchor = '#Abstract Reasoning' %]
[% INCLUDE GenEd name = 'They provide experience in solving concrete problems by a process of abstraction and manipulation at the abstract level. Typically this experience is provided by word problems which require students to formalize real-world problems in abstract terms, to solve them with techniques that apply at that abstract level, and to convert the solutions back into concrete results.' anchor = '#Abstract Reasoning' %]

==== Quantitative Reasoning ====
From the [[http://www.earlham.edu/curriculumguide/academics/analytical.html Catalog Description]] ''General Education courses in Quantitative Reasoning foster students' abilities to generate, interpret and evaluate quantitative information. In particular, Quantitative Reasoning courses help students develop abilities in such areas as:''
[% INCLUDE GenEd name = 'Using and interpreting formulas, graphs and tables.' anchor = '#Quantitative Reasoning' anchor = '#Quantitative Reasoning' %]
[% INCLUDE GenEd name = 'Representing mathematical ideas symbolically, graphically, numerically and verbally.' anchor = '#Quantitative Reasoning' %]
[% INCLUDE GenEd name = 'Using mathematical and statistical ideas to solve problems in a variety of contexts.' anchor = '#Quantitative Reasoning' %]
[% INCLUDE GenEd name = 'Using simple models such as linear dependence, exponential growth or decay, or normal distribution.' anchor = '#Quantitative Reasoning' %]
[% INCLUDE GenEd name = 'Understanding basic statistical ideas such as averages, variability and probability.' anchor = '#Quantitative Reasoning' %]
[% INCLUDE GenEd name = 'Making estimates and checking the reasonableness of answers.' anchor = '#Quantitative Reasoning' %]
[% INCLUDE GenEd name = 'Recognizing the limitations of mathematical and statistical methods.' anchor = '#Quantitative Reasoning' %]

=== Scientific Inquiry Requirement ===
From the [[http://www.earlham.edu/curriculumguide/academics/scientific.html Catalog Description]] ''Scientific inquiry:''
[% INCLUDE GenEd name = 'Develops students\' understanding of the natural world.' anchor = '#Scientific Inquiry Requirement' %]
[% INCLUDE GenEd name = 'Strengthens students\' knowledge of the scientific way of knowing - the use of systematic observation and experimentation to develop theories and test hypotheses.' anchor = '#Scientific Inquiry Requirement' %]
[% INCLUDE GenEd name = 'Emphasizes and provides first-hand experience with both theoretical analysis and the collection of empirical data.' anchor = '#Scientific Inquiry Requirement' %]