We have an existing program (with source code) looks at PDF files and sends the text data to mySQL. This is how it currently works:
1. Splits a multi-paged PDF into single page PDF files.
2. Converts PDF single pages into JPG files.
3. OCR's the JPG into TXT files.
4. Sends the text data to a mySQL database.
5. Deletes the original PDF, split PDFs, JPGs, TXT files.
6. Goes onto next PDF to OCR.
Things to modify:
1. Have it auto start when computer reboots.
2. Always monitor the folder to start if new PDFs are added.
3. Skip OCRing the page if the JPG is greater than X size.
4. Allow the user to specify the mySQL database name
5. Allow user to save the above settings, so it loads next time the program loads.
Here is a markup draft of the final product: [login to view URL]