I am looking for a developer with skills to develop a web crawler and data extraction spider. I am only interested in provider with experience in crawler development.
The spider needs to perform these functions:
1) Search for all sites in a particular country that meet the specific subject/search terms. This could be achieved by querying Google to get a list of sites to crawl. This will then create a list of sites to crawl regularly.
2) Crawl the list of sites from step 1 and search for a specific types of items on the crawled web sites in the list.
3) If the item type is found on the web site then extract the data from the web pages in as clean a way as possible. This process will remove as many HTML, CSS and other tags as possible to acquire data that is relevent and as free from distracting tags as possible.
4) Write the extracted content to a table in a MySQL database.
All code to be PHP 5+ compatible and to run as a windows executable or as PHP script.
There will be other projects for the developer with the right skill sets, experience and sheer ability to deliver high quality systems at a reasonable price.
1. The crawler is to crawl yellowpage types sites (indexes and directories) sites in 20-30 different countries and grab information about company names, emails, addresses, phones, logos etc. The information grabbed must be structured/indexed in categories as it is in a phone book if possible.
2. The crawler identifies the customer care email and the email address where a new customer can inquire to become customer/ask a question
3. The crawler sends an email with a question to see if the email gets be answered
4. The crawler finds the url/web address of the companies from step 1
5. The crawler goes to the company sites and searches for prices and products in selected categories: Bank, Insurance, Pension, Mobile, Mortgage and others.
The software must have a management console enabling the following functions:
1. Must be able to deal with any site and extract any type of information.
2. Must be able to customize regional options so it can be told to crawl sites in a specific domain using the words for the domain.
3. Automated Scheduling of crawler for target site [hourly, daily, weekly, bi-weekly, etc] for the crawler to run.
Reporting of crawl progress, results, log.
Exception handling – providing details of items not crawled.
Duplicate email address handling IS paramount, to delete duplicate listings email addresses.
4. Deletion handling to recognize that previously crawled listings/email addresses are no longer listed on the target site and to handle these accordingly by ignoring these listings into an inactive or archive table separate to the main listings
5. Backup functions to enable all of the database to be backed up.
6. The ability to easily search for and edit and remove email address records.
Write the extracted content to a table in a MySQL database.
7. It should be possible to configure the crawler for different sites. Ease-of-use in this process is important.
8. The crawler should be very fast, not slow due to bad programming.
9. The web crawler must be able to put all of the companies into categories.
10. The crawler should be able to enter sites from different Proxies, so that the sites do not detect suspicious behavior.
1. Which solution do you recommend?
2. Do you have any comments on the above?
3. Can you provide a demo of the crawlers that you find suitable for this project?
4. Can you be available in the future for paid maintenance and further development?
For scanning forums:
Search forums by thread and try to determine what the thread is about using sets of keywords with associated weights. Thus a thread containing words like summer, bike and canoe could be weighed as something about leisure. The keywords and their weights have to be easily configurable. If a thread has enough weight to be considered fully about a category then it is to be recorded in a table under t hat category.
Here is an easier break down of the project,
CRAWLER PART 1
Crawl Yellow Pages/business listings to retrieve company names/info.
Retrieve any of the following found:
business name(required), address, category, emails, phone nums, websites,
logos, rating. Record the url of page then listing was found on.
If a business logo is found, it needs to be downloaded and matched(renamed) to the recorded listing.
The regular expressions used to find/retrieve the information on the sites need to be
modifiable so they can be altered and reused for other directories so they should
be saved to a MySQL table.
Because the crawler will be used on sites in different
languages, the language words in the regular expressions need to be
inter-changeable. If the language word 'dog' is in the expression and
I choose to crawl using Punjabi, 'dog' needs to be replaced with 'gutta'.
The language word/phrase lists don't need to be supplied, just the ability to supply them
at a later time is necessary.
CRAWLER PART 2
For listings from CRAWLER PART 1 which didn't return web addresses,
An automated website finder (using google?) which will search for a business' website using the business' name,
email addresses, phone numbers and other data.
CRAWLER PART 3
Crawl a list of websites and locate the customer contact information page.
When it is found parse it and record the email address/phone number and contact name,
or if a submit question form exists, record that a form was found.
CRAWLER PART 4
Crawl a list of websites looking for product information and prices for specific categories (Bank, Insurance, Pension, Mobile, Mortgage and others)
CRAWLER PART 5
Search forums by thread and try to determine what the thread is about using sets of keywords with associated weights.
Thus a thread containing words like summer, bike and canoe could be weighed as something about leisure.
The keywords and their weights have to be easily configurable.
If a thread has enough weight to fall into a category then it will be recorded.
Must be able to modify the list of sites to crawl,
Must be able to change regex for parsing,
Must be able to switch the language for the regex'
Must be able to start/stop crawler
Please respond with any questions you may have