I am looking for a developer(s) who can develop a web crawler or spider in PHP or using PHP frameworks.
The end-product will only require root domain names.. (eg [url removed, login to view]); which is stored in a table... and it will go through through it one by one.
The purpose of the spider is to finds all working links within that domain.
Would be nice to have a functionality that allows it to start at [url removed, login to view]\sub folderThe discovered links will be stored in a table.
It should knows if it fails on a site and continue where it left off...
Error logs if it encounters issue.
Should work on any domains.
Needs to be efficient.
The database used is mySQL.
The end product should work with php [url removed, login to view]
This can be billed as hourly, but I would need a quote on how many hours it takes and at what rate...
This can be on a project based also...
This is one of part of many development, would like to establish a long term relationship.
More thorough specification:
Assuming we are crawling for domain.com...
- It looks for all pages within this website, 100% of the pages in domain.com must be accounted for.
- All found links must be in the format of http://domain.com/<whatever>
- It needs to be able to handle multiple domains, I must be able to add domains and it will crawl them next time the cycle runs
- It needs to crawl multiple domains at once, meaning it should not be ran one after another. It should start multiple threads (like 3-4 at once) or preferrably some way I can control how many threads it starts.
- It needs to know when it fails and when it completes. Meaning, it shouldn't stop in the middle of a crawling. If it does, it must know to timeout and go to the next site or link and retry next time.
- The found list will be inserted into a new table, and columns as follow:
2:domain name - [domain the link belong to]
3:link itself - [link]
4:Date Active -[date] (first date the link was found. When the same script is ran again, it should not update this date again.)
5: Date inactive - [date] (if script is ran again, and this link cannot be found, insert a date)
- Script will be crontab, if not you can suggest otherwise.