I am looking for a solid web crawler, that has one task, and one task only...
Identify different page layouts on a site.
Some site, especially webshops have category pages, subcategory pages, product pages, checkout pages...
This crawler, should not identify the purpose of the page, but be able to take a site with 500.000 pages, and identify how many different page layouts there are.
In the end, it should end up making a list of each url, and add a layout ID (XML)
Performance and speed of the scraper - as well as how it will intelligently view one page appart from the other is a main ingredient of this scraper.
Some sites have very similar pages, however making the scraper identify an element as a menu, submenu or navigation - thereby making it ignore the element is very much wanted...
I dont want to scrape a site with 200.000 pages, and the scraper comes up with 110.000 different category's of pages.
7 pekerja bebas membida secara purata $427 untuk pekerjaan ini
I have been working as a .net developer for last six years. I also have experience on sharepoint. I think i suit well for this work. my core skillset includes: C#, SQL Server, .Net framework, Sharepoint and html.