We are currently using tesseract to OCR certain images. To improve the overall accuracy, we need to train the language model for the specific input to improve the overall accuraccy.
The process is described here
[login to view URL]
We provide images and the ground truth text on the image.
The tasks involves:
- Setting up an enviroment in MS Azure
- Preparing the training data in the tesseract specific format
- Fine tune the existing model to improve the overall accuracy
- Handle errors and instability during the training
- Report the finally achieved error rate