Find Jobs
Hire Freelancers

language web crawler

$100-500 USD

Selesai
Disiarkan lebih dari 20 tahun yang lalu

$100-500 USD

Dibayar semasa penghantaran
We want to crawl the web to get: 1)? lists of the words used in different? languages on the web, and 2) a count of the number of times each word is found in each language UNTIL WE HAVE A STATISTICALLY SIGNIFICANT SAMPLE. Maybe 1000 pages of each language? We do not have a list of URLs we want to use. All that matters is that we do not count the same page twice. Other than that, ANY 1000 pages of each language will be fine. I imagine that the program will crawl pages by charset, CHECK to be sure the page is the "correct" language (per the charset tag) by comparing the simplest words in that language (see CHECK below), count the words on the page, note which page it is so it does not get counted again, and move on. CHECK Because charset tags are not alway reliable, we would pick 20 (or so) common words that are unique (and really common) to each language. E.G. an English example: the, an, in, are, is, and, to, on, this, a, by, that, were, have, been, will, a, of ...and then look for a meaningful subset of them to appear on a page before deciding what language it is. Obviously, we would test the search mechanism "by hand" first to be sure it worked in each language.) Note: I will identify the "check" words for each language, and be accordingly be responsible for the quality of this language filter. The? app will place the words and count into an Excel spreadsheet. (one sheet per language). As an example, after using this tool in English (and sorting by frequency within Excel) there would be? VERY long list, with a number next to it (indicating how many times it was found) like: the? 9,323,343 of? ? 9,028,282 and 9,003,939 a? ? ? ? 8,757,232 etc.... The languages of interest are: Afrikaans, Arabik,? Bulgarian, Catalan, Pinyin (Chinese), Croatian, Czeck, Dutch, English, Estonian, Finnish, French, German, Greek, English, German, French, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Norwegian, Polish, Portugese, Romanian, Serbian, Slovak, Slovenian, Spanish, Swahili,? Swedish, Tagalog, Thai, Turkish, Urkranian and Vietnamese. ## Deliverables 1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request. 3) Exclusive and complete copyrights to all work purchased. (No GPL, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site). ## Platform We are running Windows 2000, IE 6, and Excel 2002.
ID Projek: 3000944

Tentang projek

5 cadangan
Projek jarak jauh
Aktif 20 tahun yang lalu

Ingin menjana wang?

Faedah membida di Freelancer

Tetapkan bajet dan garis masa anda
Dapatkan bayaran untuk kerja anda
Tuliskan cadangan anda
Ianya percuma untuk mendaftar dan membida pekerjaan
Dianugerahkan kepada:
Avatar Pengguna
See private message.
$85 USD dalam 25 hari
5.0 (644 ulasan)
7.9
7.9
5 pekerja bebas membida secara purata $357 USD untuk pekerjaan ini
Avatar Pengguna
See private message.
$425 USD dalam 25 hari
5.0 (103 ulasan)
7.0
7.0
Avatar Pengguna
See private message.
$425 USD dalam 25 hari
5.0 (10 ulasan)
4.7
4.7
Avatar Pengguna
See private message.
$425 USD dalam 25 hari
2.4 (6 ulasan)
3.7
3.7
Avatar Pengguna
See private message.
$425 USD dalam 25 hari
0.0 (0 ulasan)
0.0
0.0

Tentang klien

Bendera UNITED STATES
United States
5.0
9
Ahli sejak Nov 2, 2003

Pengesahan Klien

Terima kasih! Kami telah menghantar pautan melalui e-mel kepada anda untuk menuntut kredit percuma anda.
Sesuatu telah berlaku semasa menghantar e-mel anda. Sila cuba lagi.
Pengguna Berdaftar Jumlah Pekerjaan Disiarkan
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Memuatkan pratonton
Kebenaran diberikan untuk Geolocation.
Sesi log masuk anda telah luput dan telah dilog keluar. Sila log masuk sekali lagi.