I need two data scraper for the following sites:
www dot aziende dot it
login dot cercaziende dot it
The scraper needs to collect the following information
- category (eg plumbers etc)
- Business Name
- description (id="textDescriptor")
- All phone & fax numbers
- website address
- email address
A business may have more than one phone number and should be broken into the following fields.
- AH Contact
I also need the address broken into separate fields
- Street number and name
The script must be able to:
use a proxy server lists in round robin way, rotating them every 20 or 50 requests
use as input a file with the urls list
export the data to a csv.
A simple interface will allow me to start/stop the script and provide basic progress feedback.
automatically extract the data from the continuing pages i.e. 2, 3, 4 onwards to get the full data
I should be able to specify the max number of records to retrieve and the speed (delay) of retrieving
For the first web site the url that contains the links to the information
http://www dot aziende dot it/abbigliamento/[url removed, login to view]
http://www dot aziende dot it/casa-e-giardino/[url removed, login to view]
and so on.
For the second website the urls are like:
http://login dot cercaziende dot it/category/abbigliamento
http://login dot cercaziende dot it/category/auto-e-moto
and so on
for this site the info is all on the page, you do not have to follow other links beside the paging.