Dibatalkan

Scrape website aggregators.

We would like you to gather all structured data relating to individual websites from sites that have aggregated from many sources.

We have a list of 15 million(!) domains we would like queried.

You can use any of these:

* [url removed, login to view]

* [url removed, login to view]

* [url removed, login to view]

* [url removed, login to view]

* [url removed, login to view]

* [url removed, login to view]

If one source doesn't have info on the domain or only has one or two sources' worth of info and another of these sites does have it, then use that other site.

Choose any technology you like as long as it runs on Linux.

Deliverables:

D1. You must build a simple API which accepts a domain name as input.

It will then return the data as BSON where all values are in UTF8 format.

This API will be used to automatically check that your program is getting the right data. PM me for the testing code we have developed.

We will add more tests to it as we go along. It must not pass on the query to the sites live, but instead pull it out of the database you've built from the crawl.

D2. All the code you used to generate the content. The code must be rerunnable automatically without manual intervention.

D3. In M3, the full BSON dump of all data.

Milestones:

M1 (10% of project value): Everything for an initial list of 100k domains we provide

M2. (20% of project value) Everything for a subsequent list of 1 million domains that we provide.

M3. Everything for the rest of the dataset: another 15 million domains.

In your bid please include:

[url removed, login to view] long it will take to complete each milestone

B2. What sites you've crawled in the past and what data you got out of them

B3. What technologies you will use.

B4. What server resources you need us to provide.

Kemahiran: Linux, Memasang Skrip, Kejuruteraan Perisian, Reka Bentuk Laman Web

Lihat lebih lanjut: web content dump, script pull data website, milestone technologies, m3 website, m3 design, one design website, gather content website, design resources website, design dump, b2 design, d3 build, take tests, manual testing project, m3, m1, list million, manual testing, b3, b2, linux check domain, structured database, 100k list, crawled data, website design deliverables milestones, crawl sites

Tentang Majikan:
( 91 ulasan ) Cairo, Egypt

ID Projek: #1566369

3 pekerja bebas membida secara purata $400 untuk pekerjaan ini

SigmaVisual

We can help in your project, please check PMB and our ratings/reviews to get idea of our experience.

$250 USD dalam 5 hari
(75 Ulasan)
6.9
EvanKos

Hi, i can help you with that, check pm.

$250 USD dalam 10 hari
(3 Ulasan)
3.8
samiullah67

Dear hiring manager! I have experience of web scraping(automated application). I have completed number of projects. My past samples are attached in PM. I will provide you efficient solution. Please refer to PM for m Lagi

$700 USD dalam 60 hari
(0 Ulasan)
0.0