It is a Machine Learning lesson project. In this project, a system will be designed to correct(fix) mistaken letters in a document with Hidden Markov Model. States will represent letters to be written correctly and outputs will represent the real letters. The most likely sequence of letters (hidden state) will be generated with Viterbi Algorithm for the given wrong text (observed information). The document in the below link may be helpful to understand the project.
[url removed, login to view]
1. Training: Calculations below will be done for the training documents which spelled correctly.
a) Calculate the probability of how many states are there that starts with s[i].
b) Calculate the probability of the transition from s[i] to s[j] state.
c) Calculate the probability of a character is likely to be in s[i] state.
Calculation above will be used to form the information below.
Initial State Probability-I : N represents different state number, I[s] is the every letter’s probability of being first letter of the correct words.
State Transition Probability Matrix – A: A[i][j] represents the probability of i th state to j th state in NxN dimensional matrix A, in other words it shows that the probability of presence of other letter after each letter for the correct words.
Output probability matrix-B: It represents the the probability of letter in misspelled Word against letter in correct spelled Word.
M represents the different letter number in misspelled words, N represents different letter number in correct spelled words, MxN dimensional B matrix B [o][s] : o output letter probability to be seen in s state.
[url removed, login to view]: The most likely sequences of letters will be obtained using Viterbi algorithm in the given test document.
Document Examples: English document examples are found in docs data document. The first 20.000 character in document is for test and the rest will be used for training. In the file that include documents, first column shows correct words and 2nd column shows misspelled words. For test process, misspelled words part will be used to generate correct words.
Example from the document:
Don’t use training examples for test.
In application calculate that;
a) How many mispelled words and how many misspelled letters are there in test data.
b) How many misspelled words and how many misspelled letters corrected with the help of the program.
c) How many correct words and how many correct letters broke because of theprogram.
d) Using the values of calculation, you will show the percent of success rate .
e) For 50 words which you will select, you will show these words’ misspelled and corrected state.