I have a huge list of domains that we need to parse to get all of the sitemap data out of.
I’ll provide csv of all the domains. You might need to normalize them (checking http/https protocol) and check www or not.
We need two outputs:
Summary csv with the following
Proper url to the sitemap | total pages in sitemap | list of dates for the last year and count of pages updated on those dates.
So the csv will have 367 columns
Next output I need
You can hit the sitemap for each site and dump to csv a file per domain. The csv should have the sitemap data in it.
Url / modified
I have about 160k domains that we need to process for this.
I’ll provide you a Ubuntu Aws machine to run your solution on. Thinking scrapy or similar running for a few days.
To apply for this job your proposal must include the following
2- what framework will your solution use?
3- ballpark how much time to get the solution running?
4- how many domains per 10 sec do you think we can process?