This project can be coded in PHP, c++ or java, depending on language you already have crawler code written in and what your
expertise is. If you do not have the experience to do this job or you are unstable then I ask you not to bid.
The server is LAMP so the database used must be MySQL. The ability schedule repeat crawls/info updating as a cron job would be nice.
CRAWLER PART 1
Crawl Yellow Pages/business listings to retrieve company names/info.
Retrieve any of the following found:
business name(required), address, category, emails, phone nums, websites,
logos, rating. Record the url of page then listing was found on.
If a business logo is found, it needs to be downloaded and matched(renamed) to the recorded listing.
The regular expressions used to find/retrieve the information on the sites need to be
modifiable so they can be altered and reused for other directories so they should
be saved to a MySQL table.
Because the crawler will be used on sites in different
languages, the language words in the regular expressions need to be
inter-changeable. If the language word 'dog' is in the expression and
I choose to crawl using Punjabi, 'dog' needs to be replaced with 'gutta'.
The language word/phrase lists don't need to be supplied, just the ability to supply them
at a later time is necessary.
CRAWLER PART 2
For listings from CRAWLER PART 1 which didn't return web addresses,
An automated website finder (using google?) which will search for a business' website using the business' name,
email addresses, phone numbers and other data.
CRAWLER PART 3
Crawl a list of websites and locate the customer contact information page.
When it is found parse it and record the email address/phone number and contact name,
or if a submit question form exists, record that a form was found.
CRAWLER PART 4
Crawl a list of websites looking for product information and prices for specific categories (Bank, Insurance, Pension, Mobile, Mortgage and others)
CRAWLER PART 5
Search forums by thread and try to determine what the thread is about using sets of keywords with associated weights.
Thus a thread containing words like summer, bike and canoe could be weighed as something about leisure.
The keywords and their weights have to be easily configurable.
If a thread has enough weight to fall into a category then it will be recorded.
The software must have a management console enabling the following functions:
Must be accessible using Windows (preferably web based so our mac/linux system can use it too)
Must be able to modify the list of sites to crawl,
Must be able to change regex for parsing,
Must be able to switch the language for the regex'
Must be able to start/stop crawler
1. Which solution do you recommend?
2. Do you have any comments on the above?
3. Can you provide a demo of the crawlers that you find suitable for this project?
4. Can you be available in the future for paid maintenance and further development?
I'd like to automate gathering contact information from businesses, initially building lists of businesses in various countries/regions by scanning yellow pages and directories. After the lists are built, then the business' websites need to be checked for their contact pages/contact info. Since it needs to work for different languages, the regex need to be easily configurable so the language words/phrases can be changed.
The methods for sites will be different but thats why a customizable crawler is required. The first steps can all be the same, following links inside the directory site looking for pages with keywords like category, phone, address and checking if the words exist multiple times on a page or if groups of them exist.
An example is [url removed, login to view], the direcory has the name of the business' and their phone numbers but not easily recognizable entries for a bot which wasn't written for YellowPages specifically. For a page like that it would try normal contact keywords and once it has failed it could look at the number of phone numbers occurring on the page and cut it up into phone number to phone number chunks (stripping extra html but leaving tokens to denote font size etc). Later that bulk data can be viewed and separated thru custom methods.
The accuracy wont' be as good as it would be if it was created for a specific site but the ability to tweak the regex' will allow it to be get more precision. Even items like finding which link goes to the next page in a directory can be customizable.
Many directories have similar titles ('email' 'address' 'phone') so they'll be easier to lock onto.
Later it'll need an automated mailer to email the site's contact emails and read the responses to see if the email is the approved email for normal contact.
The other crawler is to crawl forums and build a knowledge base of information on various pre-determined subjects, looking for quality information over quantity.
Sample db code is included to give a basic idea of what I'm looking for.