I require a java based web scraper to do the following.
ability to enter/import list of URL's to be scraped.
The list will be id,website (1, [url removed, login to view])
Scraper should look at the provided url and:
1. follow redirect (up to 10) to the final url.
2. extract and store the page title (<title>)
3. extract and store any email addresses on it.
4. get links to contact us and privacy page if any
5. get the link to a facebook page if any
go to the contact us page url (if any) and search for email address. extract and store
go to the privacy page url (if any) and search for email address. extract and store
go to the facebook page url (if any) and search for email address. extract and store.
the crawler should produce a csv file as follows:
id, original website, website crawled (point 1), title, url emails (delimited by ;), contact url, emails from contact url, privacy url, email from privacy url, facebook url, email from facebook url.
crawler should have the ability to run multiple crawls at the same time. I need the ability to enter the number of concurrent crawls.
I must be able to enter list of proxies to be used for scraping.
crawler must be able to parse html entities correctly. if you don't know what i mean ask me.
some websites will not resolve their dns. these website in the report should show, id, website, dead
some website will return 404. these website in the report should show, id, website, 404
some website will be suspended/parked. these website in the report should show, id, website, suspended
if you have any questions, don't hesitate to ask me.
20 pekerja bebas membida secara purata $126 untuk pekerjaan ini
hi. i have a lot of experience in web scraping. i can complete this project. we can have a chat. thanks Relevant Skills and Experience web scraping Proposed Milestones $250 USD - all