Text repair based on n-gram language model

Projek ini menerima 3 bida daripada freelancer berbakat dengan harga bida purata $218 USD.

Dapatkan sebut harga percuma untuk projek seperti ini
Bajet Projek
$30 - $250 USD
Jumlah Bida
Penerangan Projek

Well, my problem is that I have some texts that have some mistakes or lack some words inside of them. I would like to have a tool that would allow me to repair them automatically using n-gram language model in APRA format. My text in format like one sentence per line in UTF-8, I need the tool to work under Ubuntu linux - I am using version 16.04.

In the text there may be following errors:

- lack of word or words in a sentence (with <unk> symbol): e.g. "Ala ma <unk> i dwa psy." should be corrected to "Ala ma kota i dwa psy.". NOTE. There can be many <unk> in one sentence.

- wrong for form or inflection: e.g. "Ala kupił zielony samochód." and should be "Ala kupiła zielony samochód."

- (very rare) totally wrong word in text: e.g. "Rąb ma pole o wymiarach." and should be "Romb ma pole o wymiarach."

All this should be repaired based on n-gram language model (so statistical model). Such a model has for each n-gram probabilities such probabilities usually are calculated as conditional ones as described here [url removed, login to view] and here [url removed, login to view] . So in fact the goal is to find the most probable sentence with given words and find missing content and input it.

I will use 3-gram, 4-gram, 5-gram, 6-gram and 7-gram models. So for all of such models the tools should be able to provide results based on all n-grams possible.

Here is sample n-gram model you can use it for testing: [url removed, login to view]!th83lRII!ZS4Mr3NOxrMst941yQXpDuobHcA6yUgJJKMNu9DUBJE

This is very small LM so the tools should be fast, work in parallel for files like hundreds of GB.

Mencari untuk memperoleh sedikit wang?

  • Tetapkan bajet anda dan tempoh masa
  • Rangkakan cadangan anda
  • Dibayar untuk kerja anda

Upah Freelancer yang juga membida projek ini

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online