Difference between revisions of "SE2006:group bar:scrapi"

From Earlham CS Department
Jump to navigation Jump to search
Line 14: Line 14:
 
</pre>
 
</pre>
  
Note that here we have a reference to @@@theater_id@@@ in the URL. This will inherited from the parent source and expanded based on the regExp associated with the that variable's type. The parent Source will create a new Source for each regexp match it makes its on URL. It is defined with its subSource as an extra argument to the constructor.  
+
Note that here we have a reference to @@@theater_id@@@ in the URL. This will inherited from the parent source and expanded based on the regExp associated with the that variable's type. The parent Source will create a new Source for each regexp match it makes on its URL. It is defined with its subSource as an extra argument to the constructor.  
  
 
<pre>
 
<pre>

Revision as of 16:49, 17 March 2006

Defining Sources

Begin by starting at the bottom of the scrape. Create a regular expression.

Schema theaterInfoSchema = 
 new Schema("<B>@@@name(string)@@@</B><BR><FONT .*?>@@@street_address(string)@@@<BR>@@@city(string)@@@,[ ]?......");

Note that here we use variable names and variable types. The variable names are the eventual keys in the hash returned by the scrape. These names will be replaced by a regular expression associated with the type. The variable name is also associated with a SQL type.

Next, based on this regular expression and a URL create a Source class.

Source theaterInfo = 
 new Source("http://www.kerasotes.com/Showtimes.aspx?SearchString=&TheaterSearch=@@@theater_id@@@&OptionTheater=++Go++",theaterInfoSchema);

Note that here we have a reference to @@@theater_id@@@ in the URL. This will inherited from the parent source and expanded based on the regExp associated with the that variable's type. The parent Source will create a new Source for each regexp match it makes on its URL. It is defined with its subSource as an extra argument to the constructor.

Schema kerasotesHome = 
 new Schema("<OPTION value=\"@@@theater_id(theater_num)@@@\"[^>]*>.......");
Source kerasotes = 
 new Source("http://www.kerasotes.com/Home.aspx",kerasotesHome,theaterInfo);

Here you can see how the Source will pick off @@@theater_id@@@ variable and pass it into its subSource, theaterInfo.

Adding New Types

Creating new types means modifying the Schema. Every time a Schema is instantiated the constructor calls createTypes() which fills in the typeNameSqlTypes and typeNameRegExps HashMaps. createTypes() in turn calls addType which takes the variable name, the sql data type, and the regular expression that will be expanded as arguments.

/* add to me */
private void createTypes() {
 addType("href_tag","text","(<A href.*?>)");
 addType("string","string","(.*?)");
 addType("int","integer","(\\d*)");
 addType("state_abrev", "string","(\\\\w\\\\w)");
 addType("theater_num","integer","([^0].*?)");
}
	
private void addType(String typeName,String sqlType, String RegExp) { .. }

Skipping Groups

Note that currently the only way to have a group in your regular expression without messing up our framework is to have it matched as a variable. This means that you will have to define a type for the group you want and use that in your regular expression. We have been using the skip variable name which we would not pass on into the database. If you have more than one skip variable of different types you will need to give them different names ie skip_n.