Website scraping scripts & optimization

I need a script that scrapes several airline websites for airfare information.

This is a pilot project (we may hire you again if we're happy with your service) to find out the feasibility of a particular scraping technique.

The main goal is to find a FAST way to scrape the sites. The script will eventually run on Amazon's EC2.

Project description

The project consists of two parts:

1. Scraping scripts

2. Speed optimization

I'll now explain both a bit further:

1. The scripts

The websites are:

- [url removed, login to view]

- [url removed, login to view]

- [url removed, login to view]

- [url removed, login to view]

Some of these sites rely on AJAX, so we should scrape their sites with Firefox (through JSSh).

I've already done some tests with FireWatir and scRUBYt (version [url removed, login to view]).

The script will be used in a webservice fashion, so it should return XML.

Input variables:

- departure airport (three-letter IATA code; for example: AMS)

- destination airport (IATA code as well)

- whether it is a round-trip or one-way trip.

- departure date

- return date (for when it is a round-trip search)

- number of persons (age > 11 yrs.)

- number of children (age < 12 yrs.)


- flight numbers with for each flight:

- fare (without fees)

- fees & taxes

- crawl/scrape time (per site/flight).

The crawler (scraper) should always select the cheapest available (economy) fare that the site has on offer. If the site offers several (available) flights on the same day, then a <flight> result should be added to the output for each of those flights.

The output should be like this:



<departure time="2008-07-15T13:15:00">




<arrival time="2008-07-15T14:22:00">



<fare currency="EUR" crawled-at="2008-07-09T14:02:03" elapsed-crawl-time="00:00:08" source="www.klm.com">






2. Speed optimization:

The test scripts that I wrote myself (attached), run pretty fast on an Apple MacBook Pro [url removed, login to view], 2GB RAM (about 5 seconds), but they're a lot slower (about 20 seconds) on an Amazon EC2 [url removed, login to view] instance, mostly because there the script is really slow at filling out the form fields.

I need a solution for this, or at least a clear explanation of the differences in performance.

Skills & Experience

I'm looking for someone who has experience with (AJAX) scraping, preferably with the technologies that are mentioned in the above project description (most importantly: Ruby programming and screen scraping).

Please provide more information on relevant experience.

If you need any more information, do not hesitate to get in touch.

Kemahiran: .NET, Pengaturcaraan C, Java, Perl, Ruby on Rails

Lihat lagi: website without programming, website programming sites, website out source, website optimization service, website hire, time now ruby, speed date script, screen scraping service, scraping com, programming technique, programming scripts, programming ruby, programming in ruby, programming for children, pilot skills test, pilot hire, pilot for hire, parts now, parts base, out source website, letter to out source website, letter of explanation, letter example, klm, hire select

Tentang Majikan:
( 0 ulasan ) Amsterdam, Netherlands

ID Projek: #284800

4 pekerja bebas membida secara purata $228 untuk pekerjaan ini


I can do this job for you. See PM for details.

$160 USD dalam 3 hari
(182 Ulasan)

Dear sir, Thank you very much for giving us opportunity to participate your project. We possess 5 years of Experience in such operation. Please check the PM for more details. We provide 100% perfect result. We look Lagi

$250 USD dalam 5 hari
(0 Ulasan)

Quality work. Proven experience.

$250 USD dalam 7 hari
(0 Ulasan)

I have a great exp in .Net/ Ajax. if you want this project in .Net and Ajax then I am the person who can provide you with high performance. Thank You Piyush Trivedi

$250 USD dalam 10 hari
(0 Ulasan)