Lengkap

linux craigslist crawler / scraper / harvester

I want this script done in linux, to be ran at the command prompt. No GUI needed; I won't be running it from a web-browser. Just through shell access. You can make recommendations as to what programming language you feel would be best.

I have a .csv file of URLs on Craigslist that I need to be scraped and parsed. The script will parse the email address, city, subject line of the ad, and the date that the ad was posted. I need the ability to specify a specific date range for the script to scrape data from, as well as just the option for the script to scrape everything. If you go to any of the links in the text file, there is usually a link at the bottom that says "next 100 postings" ([url removed, login to view] is an example - just scroll down to the bottom); when the script encounters this, it will automatically parse that link, and continue onto the next page, until no more of these are found. This function would only be used if I have selected to scrape everything. If I am only scraping a specific date range, then the script will still have to use the 'next 100 postings' link at times, but won't need to continue until there are no more of the 'next 100 postings' links.

The script must be multi-threaded (must be able to handle up to 500 simultaneous threads), and must support the usage of http/https/socks4/socks5 proxies. I will have a text file of proxies, and the script will randomly grab a proxy for each URL that it scrapes.

The .csv file will have 3 columns in it:

1. The URL to begin scraping

2. The Country that is being scraped

3. The City that is being scraped

The script will use the country value to place the data scraped from that country into its' own folder, and it will use the city value in the .csv files that it outputs after it parses each page. As an example:

[url removed, login to view],USA,Austin

[url removed, login to view],Canada,Vancouver

[url removed, login to view],Australia,Canberra

[url removed, login to view],UK,Cambridge

In this sample, the script will go to [url removed, login to view], and it will see numerous posts. If I have it set to only scrape a specific date range, it will only parse the URLs that are in that date range. If not, it will parse all of those URLs, as well as go to the 'next 100 postings' link and do the same, etc.

As of the the time I wrote this, the very first link link to be parsed is the "Expanding Firm Hiring - Marketing & Management" link - [url removed, login to view] The script will parse this link, and will save this data to a .csv file called [url removed, login to view], in a folder called USA. This is what the output of the [url removed, login to view] file will look like, just from scraping that link:

email_address_here,Austin,Expanding Firm Hiring - Marketing & Management (AUSTIN),9/23/2009

I know that the date is shown as 2009-09-23, but I would need whatever format the date is in to be formatted in the above example (month/date/year).

I also need the option to select either scrape all countries, or just certain countries. For instance, if I just wanted to scrape the USA, or I wanted to scrape the USA, Canada, and Australia, etc.

The script will do the exact same thing for the other 3 examples, in Canada, Australia, and the UK.

I will own the exclusive rights to this script; you will not be able to re-sell it, and I will obtain full rights to this script.

If you have any questions, please don't hesitate to ask.

Kemahiran: Linux, PHP, Pengikisan Web

Lihat lagi: script scrape web page linux, gui web crawler scraper, linux gui scraper, what was the first programming language, what support will i need from management, what programming language is this, what programming language, what is the best web programming language, what is the best programming language, web programming uk, value city, usa programming, time in canberra, time canberra, the best web programming language, the best programming language, shell programming language, script php proxy web, programming on linux, programming language usage

Tentang Majikan:
( 32 ulasan ) Doral, Costa Rica

ID Projek: #520096

Dianugerahkan kepada:

zeke

Dear Customer! This is my favourite kind of project and I have a lot of experience wrigint crawlers/scrappers/web bots/etc. Please see PMB for examples of my previous works in this field. Ready to start right now and f Lagi

$300 USD dalam 3 hari
(152 Ulasan)
6.8

13 pekerja bebas membida secara purata $355 untuk pekerjaan ini

srinichal

I can do this in bash using wget

$220 USD dalam 3 hari
(100 Ulasan)
7.1
pgcoding

please check pmb.

$400 USD dalam 15 hari
(25 Ulasan)
6.2
LanceGuru

Hi, Please see the private message. Thank You

$400 USD dalam 3 hari
(23 Ulasan)
6.2
rapuk

Hej, Steve. I'm very much interested in this project. I'll get this job done and meet all your requirements. If you want I can make a demo. I prefer to use java for this scraper.

$250 USD dalam 5 hari
(30 Ulasan)
5.1
Scorpio1987

Please check PM..

$350 USD dalam 4 hari
(11 Ulasan)
4.5
ukshumi

Please check PM, Already have some thing

$250 USD dalam 4 hari
(10 Ulasan)
3.7
AlexeyKaplin

I can do it on Perl.

$0 USD dalam 0 hari
(2 Ulasan)
0.8
Sapron

Please, check you pmb.

$200 USD dalam 5 hari
(0 Ulasan)
0.0
tech2trade

Hi, We have already done similar crawler for cityserch's web site using Microsoft technologies. Have all data form the same. Please feel free to call me on 001 408 218 8015 or mail me your contact information to Lagi

$900 USD dalam 20 hari
(0 Ulasan)
0.0
bubble1000

Hi, I read your requirement carefully, I have such experience, I can take this job. thanks.

$600 USD dalam 14 hari
(0 Ulasan)
0.0
ppan279

Hi, I have no reviews to show for as I have registered recently but I have rich experience of scraping of about 4 years in which I have scraped not less than 500 sites of all hue and [login to view URL] me with this work a Lagi

$350 USD dalam 7 hari
(0 Ulasan)
0.0
badousoft

pls see PM

$400 USD dalam 7 hari
(0 Ulasan)
0.0