I am after someone who has experience writing custom Nutch plugins.
The details of the project will be given to only those that meet first round requirements. You must have decent experience here and can show experience with Nutch.
I am not after someone to write me a parser for a particular site.
I am after someone who can write a custom parser based on ANY DOM TREE STRUCTURE!
If you dont understand what that means, after nutch crawls a page I want fields with any data stored and automatically named.
Eg if there is a field <Div>Someinfohere</Div> then the field that extracts that data is called <fieldname Div1>Someinfohere<field>
Thats the first step, creating order from html.
Second step is an easy way for me to map this to <fieldname Div1> to a solr [url removed, login to view] field.
To do that I think the best way would be to have the data stored in a database of some description and a simple GUI created so that I can easily map <fieldname Div1> to <solr schema field>.
Choice of technolgy is yours as long as it runs on a LAMP stack. Php and mysql preferred.
Will be crawling approx 10 000 sites so this thing will have to handle any html template I throw at it. If there are multiple Divs on a page, call them Div1, Div2 etc.
The Dom structure will be your guide eg HTML|DIV|TABLE|TR|TD| Some info here
Im on a tight budget, dont go crazy with your bid.