1- I need the website textual content of 10,000 (not HTML I need the text) up to level\depth (3) for every website.
3- include the textual content of downloadable files such as PDF and word documents
4- I have compleated the code in scrapy you can utalize it. but for some websites it returns
[[login to view URL]] INFO: Ignoring response : HTTP status code is not handled or not allowed... therfore you can fix it and start from it.
5- example output is atached. where if the type is a file I did not attach the content but I downlod it on computer.
20 pekerja bebas membida secara purata $273 untuk pekerjaan ini
I having 3+ year exp in scrapy and have worked on more than 600 urls . Kind of worked sites HTML,CSV and EXCEL,XML,JSON,CAPTCHA,PDF. I can deliver the output in the timeline