C++ Crawler able to index/reindex pages and download content making xml file for each page.
Here are main requirements:
* Can be scheduled
* The Agent can accept multiple crawl start locations per web site
* Support for [login to view URL]
* Forbiden string in url (for example do not follow ?, %, or keyword)
* Can leave domain / do not leave domain
* Max pages per domain (user input)
* The agent can support exclusions of files beyond that of the servers standard [login to view URL]
* Specify how many levels deep to follow links for starting location crawl
* Multi-Threaded for Concurrent Scans
* Reindexing New Files or Modified Files Only
* Complete Cache Management
* Download to specific storage (web, news)
* Download Title, Description, Keywords, Page content, Add the following fields: date indexed, Page size, url
* Make XML file for each downloaded page with the info above
-------------------------------------------------------------------
* Web based administration
* List of url's to crawl
* Start/Stop/Hold/Continue
* Scheduled time index/reindex for specific storage and list of sites
* File type: html based (html, htm, php, asp, js, do ...)