I have a list of about 10k urls that I need to validate, all I have to do, is scraping the home page if the website does respond, and discard the others.
However, I've noticed that, when running the spider on Scrapinghub several times in a row, I get inconsistent results, meaning not the same number of scraped items. Usually the main difference is on the number of timed out urls.
I have set up DOWNLOAD_TIMEOUT up to 300 (with RETRY_ENABLED to False), but I still get a bunch of "[login to view URL] [login to view URL]: User timeout caused connection failure: Getting [login to view URL] took longer than 300.0 seconds.."
I have tried some of the 'slowest' websites (with request duration > 50 seconds) in the browser and they work fine. Even when running the scrapping on a single website in my local machine it works fine and loads quickly (less than 2/3 sec).
When looking at the request logs, I've found 300 urls with a request duration of more than 50 seconds, whenever I browse those website, or lauch a spider on only one of those urls, it works fast.
Now, I have isolated the 100 slowest requests (50 seconds or more), and created a new spider with those urls.
When I look at this spider request logs, I see that the request durations are not the same at all, and that the request duration follows a pattern of going from 200ms for the first request to around 2000ms for the last request.
So my final question is : how could I avoid this 'instability'? I need to run those spiders regularly, in order to maintain a list of working urls, and I can't afford to have missing items.
I have attached a zip file ([login to view URL]) with all files to support my investigation :
- [login to view URL] : here you can see 4 identical spiders, giving different results
- [login to view URL] : the [login to view URL] file
- [login to view URL] : an overview of the spider
- [login to view URL] : the 4 spiders stats, showing a big difference in the timeouts and http status)
- [login to view URL] : the requests logs of 8830 urls (see how the request duration goes gradualy higher, and then cycle back)
- [login to view URL] : an extract of the 100 slowest requests from [login to view URL]
- [login to view URL] : the same spider, running on the 100 'slowest' urls taken from [login to view URL] (see how the request duration goes gradualy higher)