Well, my problem is that I have some texts that have some mistakes or lack some words inside of them. I would like to have a tool that would allow me to repair them automatically using n-gram language model in APRA format. My text in format like one sentence per line in UTF-8, I need the tool to work under Ubuntu linux - I am using version 16.04.
In the text there may be following errors:
- lack of word or words in a sentence (with <unk> symbol): e.g. "Ala ma <unk> i dwa psy." should be corrected to "Ala ma kota i dwa psy.". NOTE. There can be many <unk> in one sentence.
- wrong for form or inflection: e.g. "Ala kupił zielony samochód." and should be "Ala kupiła zielony samochód."
- (very rare) totally wrong word in text: e.g. "Rąb ma pole o wymiarach." and should be "Romb ma pole o wymiarach."
All this should be repaired based on n-gram language model (so statistical model). Such a model has for each n-gram probabilities such probabilities usually are calculated as conditional ones as described here [url removed, login to view] and here [url removed, login to view] . So in fact the goal is to find the most probable sentence with given words and find missing content and input it.
I will use 3-gram, 4-gram, 5-gram, 6-gram and 7-gram models. So for all of such models the tools should be able to provide results based on all n-grams possible.
Here is sample n-gram model you can use it for testing: [url removed, login to view]!th83lRII!ZS4Mr3NOxrMst941yQXpDuobHcA6yUgJJKMNu9DUBJE
This is very small LM so the tools should be fast, work in parallel for files like hundreds of GB.