I want to realise a Crawling Project with an additional Administration Tool and some other Features.
We need a full scaleable Crywling System with an Administration Frontend, Observer for the Crawler, Database, Dead by Decaptcha and Proxy Server Support.
The Crawl Jobs are based on Articlelists (Name, EAN) from a MySQL Database and
there are different Sites to crawl (Amazon DE, Google Shopping DE and some different German Price Comparsion Pages too)
The complete Crawlsystem need to be scaleable (i need to add many Crawler to one Crawljob as needed, based on the runtime
of the runtime of the average article crawl. (Example: If one Crawl-run on [url removed, login to view] need more than 5sec. the System add
automatically more crawler to the crawling job.
So the system need a ban prevention too.
Next Point is full support for Proxy Server (The Proxy-IP, Port, Username and Password is stored inside the MySQL DB)
with a rotation of the proxy IPs after a defined amount of articles.
For Google Shopping and some other German Price Comparsion Pages the System needs full
Decaptcha Support (Dead by Captcha or similar) so the Recaptchas can solved with the Decaptcha API.
The observer supervised the crawler and the runtimes of each article. (Because i want to crawl between 250.000 up to 2.000.000 Articles from
each sourcepage the runtime and that they not banned from the site are the most important points)
full and clean code documentation is a must have.
The complete system need to be configurable from a MySQL Table.
The full system needs to be Webfrontend ready. (All Information from every crawl saved into MySQL)
(Maybe IMacros Enterprise + Players and a self coded administration tool is a Option)
The Sourcepages are at the first step germany based sites (amazon germany, google shopping germany and different price comparsion pages from germany)