proxylist - ip:port - one per line , the number of proxy will be different for each querry
urllist - list of URLS that must be crawled. URLs will be one per line and it should work both with http or not
threads - nr of threads for the process to run in the same time
timeout - timeout / proxy , timeout / request - not sure yet if it shall be the same or different. it will be measured inseconds
the script shall run like this:
./crawlerpython - proxy proxylist -url urlist -threads 5 -timeout 3
- first the script shall verify all proxy list on multiple threads ( nr of threads = nr of proxyes, but not bigger than 30 paralel threads ) in order to know the number of working proxyes . this script shall respect the timeout atribute.
number of proxyes can be bigger or smaller than the number of URLS.
best is each URL to be crawled with a different proxy, but in case we'll have less proxyes than URLs then we can crawl more URLs by using the same proxy. because of this maybe is good to have also an automatic mode for choosing the number of paralel threads. the algo should work like this:
- if the number of working proxyes is bigger than the number of URL then crawl each URL with different proxy.
- if the number of working proxyes is smaller than the URLs number, then the number of paralel threads should be egual with the number of working proxyes
We need to implement this algo in order to not be banned by the sites we are crawling.
- take a proxy ( from the working proxyes list )
- run a querry on the alocated URL
- get the HTML answer, respecting the timeout attribute. In case the answer not comming in the specified time, try same querry with another working proxy. we shall try tho get another working proxy for max 5 times, beacuse tieout problems can have different reasons, not only bad proxy.
the result should be multidimensional array that should contain:
- for each URL the exact HTML code received when browsing or the error received after trying max nr of 5 different proxyes
- the time spent for each thread ...
an optimization on this should be on first step. Proxy checking...
from the proxy list the number of proxyes to be checked should not be bigger than 1.5*nr of URLs ...
ex: if the number of urls is 1
and nr of proxyes is 30, then no need to check all the 30 proxyes before running the process...
best is also after checking the proxyes.. the script should take random proxyes from the workingproxy list.... not... in the same order...