Python CURL Project

Script inputs:

proxylist - ip:port - one per line , the number of proxy will be different for each querry

urllist - list of URLS that must be crawled. URLs will be one per line and it should work both with http or not

threads - nr of threads for the process to run in the same time

timeout - timeout / proxy , timeout / request - not sure yet if it shall be the same or different. it will be measured inseconds

the script shall run like this:

./crawlerpython - proxy proxylist -url urlist -threads 5 -timeout 3

- first the script shall verify all proxy list on multiple threads ( nr of threads = nr of proxyes, but not bigger than 30 paralel threads ) in order to know the number of working proxyes . this script shall respect the timeout atribute.


number of proxyes can be bigger or smaller than the number of URLS.

best is each URL to be crawled with a different proxy, but in case we'll have less proxyes than URLs then we can crawl more URLs by using the same proxy. because of this maybe is good to have also an automatic mode for choosing the number of paralel threads. the algo should work like this:

- if the number of working proxyes is bigger than the number of URL then crawl each URL with different proxy.

- if the number of working proxyes is smaller than the URLs number, then the number of paralel threads should be egual with the number of working proxyes

We need to implement this algo in order to not be banned by the sites we are crawling.


thread proccess:

- take a proxy ( from the working proxyes list )

- run a querry on the alocated URL

- get the HTML answer, respecting the timeout attribute. In case the answer not comming in the specified time, try same querry with another working proxy. we shall try tho get another working proxy for max 5 times, beacuse tieout problems can have different reasons, not only bad proxy.


the result should be multidimensional array that should contain:

- for each URL the exact HTML code received when browsing or the error received after trying max nr of 5 different proxyes

- the time spent for each thread ...


an optimization on this should be on first step. Proxy checking...

from the proxy list the number of proxyes to be checked should not be bigger than 1.5*nr of URLs ...

ex: if the number of urls is 1

and nr of proxyes is 30, then no need to check all the 30 proxyes before running the process...


best is also after checking the proxyes.. the script should take random proxyes from the workingproxy list.... not... in the same order...

Kemahiran: Python

Lihat lagi: python curl script, python run curl, python curl, not yet received, not received yet, list get python, get html code from url, curl python, python work, project optimization, need curl, proxy python, script multiple pending order, python html script, http proxy request, curl code, attribute mode, project need python script, max case, curl error code, port python script, project code optimization, code project python, step project, python crawling

Tentang Majikan:
( 40 ulasan ) Ostratu / Corbeanca, Romania

ID Projek: #1539508

Dianugerahkan kepada:


Thank you!

$350 USD dalam 3 hari
(2 Ulasan)