Difference between revisions of "SE2006:group bar:scrapi"
Line 14: | Line 14: | ||
</pre> | </pre> | ||
− | Note that here we have a reference to @@@theater_id@@@ in the URL. This will inherited from the parent source and expanded. The parent Source will create a new Source for each regexp match it makes its URL. It is defined with its subSource as an extra argument to the constructor. | + | Note that here we have a reference to @@@theater_id@@@ in the URL. This will inherited from the parent source and expanded based on the regExp associated with the that variable's type. The parent Source will create a new Source for each regexp match it makes its URL. It is defined with its subSource as an extra argument to the constructor. |
<pre> | <pre> |
Revision as of 16:26, 17 March 2006
Defining Sources
Begin by starting at the bottom of the scrape. Create a regular expression.
Schema theaterInfoSchema = new Schema("<B>@@@name(string)@@@</B><BR><FONT .*?>@@@street_address(string)@@@<BR>@@@city(string)@@@,[ ]?......");
Note that here we use variable names and variable types. The variable names are the eventual keys in the hash returned by the scrape. These names will be replaced by a regular expression associated with the type. The variable name is also associated with a SQL type.
Next, based on this regular expression and a URL create a Source class.
Source theaterInfo = new Source("http://www.kerasotes.com/Showtimes.aspx?SearchString=&TheaterSearch=@@@theater_id@@@&OptionTheater=++Go++",theaterInfoSchema);
Note that here we have a reference to @@@theater_id@@@ in the URL. This will inherited from the parent source and expanded based on the regExp associated with the that variable's type. The parent Source will create a new Source for each regexp match it makes its URL. It is defined with its subSource as an extra argument to the constructor.
Schema kerasotesHome = new Schema("<OPTION value=\"@@@theater_id(theater_num)@@@\"[^>]*>......."); Source kerasotes = new Source("http://www.kerasotes.com/Home.aspx",kerasotesHome,theaterInfo);
Here you can see how the Source will pick off @@@theater_id@@@ variable and pass it into its subSource, theaterInfo.
Adding New Types
Creating new types means modifying the Schema. Every time a Schema is instantiated the constructor calls createTypes() which fills in the typeNameSqlTypes and typeNameRegExps HashMaps. createTypes() in turn calls addType which takes the variable name, the sql data type, and the regular expression that will be expanded as arguments.
/* add to me */ private void createTypes() { addType("href_tag","text","(<A href.*?>)"); addType("string","string","(.*?)"); addType("int","integer","(\\d*)"); addType("state_abrev", "string","(\\\\w\\\\w)"); addType("theater_num","integer","([^0].*?)"); } private void addType(String typeName,String sqlType, String RegExp) { .. }
Skipping Groups
Note that currently the only way to have a group in your regular expression without messing up our framework is to have it matched as a variable. This means that you will have to define a type for the group you want and use that in your regular expression. We have been using the skip variable name which we would not pass on into the database. If you have more than one skip variable of different types you will need to give them different names ie skip_n.