Sedang Disiapkan

Words used in Wikipedia

Using the full latest English Wikipedia database, write a program to generate a frequency-ranked case-sensitive list of words used in the main entry pages. These should include single words and groups up to four words (hyphen or space-separated), only text (not Wiki tags), and taken from the middle of sentences (not the first word in each sentence, so all are correctly capitalized).

Provide list of all words and word groups that appear at least 10 times in Wikipedia, and provide a file containing ten complete sentences in which each word appears and name of wiki page on which it appears, e.g.


[page: Prion]

Prions are hypothesized to infect and propagate by refolding abnormally into a structure which is able to convert normal molecules of the protein into the abnormally structured form.

[page: Mars_Ocean_Hypothesis]

The blue region of low topography in the Martian northern hemisphere is hypothesized to be the site of a primordial ocean of liquid water.


I'm flexible in exactly what format the data is provided, and you can skip groups starting and ending with common stop words (a, the, etc).

The main objective is the result, so you can write the program in any language you like. You'll need to download the Wikipedia database from [url removed, login to view]; the project is very straightforward, but the database is quite large.

Kemahiran: Pengaturcaraan C, PHP, Python, XML

Lihat lebih lanjut: providelist wikipedia, starting objective, region 13, need wikipedia page, words used webmaster, wikimedia download, objective language, wikipedia, wikimedia, structured, skip , prion, php wikipedia, liquid, frequency, E liquid, structured database, project data structure program, program region, structured form, php generate word, skip list, frequency data, convert data list, word frequency file

Tentang Majikan:
( 76 ulasan ) Brighton, United Kingdom

ID Projek: #387392