SE2006:group bar:scrapi
Defining Sources
Begin by starting at the bottom of the scrape. Create a regular expression.
Schema theaterInfoSchema = new Schema("<B>@@@name(string)@@@</B><BR><FONT .*?>@@@street_address(string)@@@<BR>@@@city(string)@@@,[ ]?......");
Note that here we use variable names and variable types. The variable names are the eventual keys in the hash returned by the scrape and thus the eventual column names in the database. These names will be replaced by a regular expression associated with the type. The variable name is also associated with a SQL type.
Next, based on this regular expression and a URL create a Source class.
Source theaterInfo = new Source("http://www.kerasotes.com/Showtimes.aspx?SearchString=&TheaterSearch=@@@theater_id@@@&OptionTheater=++Go++",theaterInfoSchema);
Note that here we have a reference to @@@theater_id@@@ in the URL. This value will inherited from the parent source and filled in the URL. The parent Source will create a new Source for each regexp match it makes on its URL. It is defined with its subSource as an extra argument to the constructor.
Schema kerasotesHome = new Schema("<OPTION value=\"@@@theater_id(theater_num)@@@\"[^>]*>......."); Source kerasotes = new Source("http://www.kerasotes.com/Home.aspx",kerasotesHome,theaterInfo);
Here you can see how the Source will pick off @@@theater_id@@@ variable and pass it into its subSource, theaterInfo. Whatever variables appear in the subSource's URL will be expanded based on variable values it inherits from its parent.
Adding New Types
Creating new types means modifying the Schema. Every time a Schema is instantiated the constructor calls createTypes() which fills in the typeNameSqlTypes and typeNameRegExps HashMaps. createTypes() in turn calls addType which takes the variable name, the sql data type, and the regular expression that will be expanded as arguments.
/* add to me */ private void createTypes() { addType("href_tag","text","(<A href.*?>)"); addType("string","text","([\\\\s\\\\S]*?)"); addType("int","integer","(\\d*)"); addType("state_abrev", "varchar(2)","(\\\\w\\\\w)"); addType("state_abrev", "varchar(2)","(\\\\w\\\\w)"); addType("state_census", "varchar(9)", "(04000US\\\\d\\\\d)"); addType("theater_num","integer","([^0].*?)"); addType("two_char","varchar(2)","(\\\\w\\\\w)"); addType("one_char","varchar(1)","(\\\\w)"); addType("float","float","([-+]?[0-9]*\\.?[0-9]+)"); addType("big_int","int","([0-9,]*?)"); } private void addType(String typeName,String sqlType, String RegExp) { .. }
Skipping Groups
Note that every type when expanded to a regExp becomes a group. We use the ordering of these groups in the regular expression to make a back reference number to variable name mapping. In order to ensure that any groups you add to a regular expression outside of what will be expanded by a variable will not be picked up as a value for a variable you must make sure that the group is not a back reference. Put a '?' as the first character in that group, e.g., (?.+)for the group you want and use that is in your regular expression.