I need a solution/utility or set of utilities to be able to :
a) index a large set of PDF file(combination of single and multiple paged files)
b) dump the indexed information in a data structure ideally a MySQL/Access/MSSQL/Postgre database
c) allow manipulation of data collected
d) generate JPEG and image maps based on data manipulated
e) generate new PDFs based on data manipulated
The script/utility would normally be an iterative solution to be able to do the following on a set of PDF files (aprrox 400 PDFs):
1. Read the PDF files
2. Get all text from PDF into a Database
3. Get all images from PDF into a separate file and note in Excel/MySQL DB
4. Get all web hyperlinks from PDf
5. Get all email hyperlinks from PDF
6. Convert each page of PDF into a JPEG
For each data item(text,hyperlink,image) the Database should be hold information regarding :
1. The page(PDF page) and file in which the data item is found.
2. The position/(starting-ending coordinates)/size(width/height) of each data item on PDF page.
Based on the information stored in the database(parser results), either through same utlity or another separate utility, the following should be possible:
1. enter additional customizable information connected to data items - possibly in another table or database(comments such as enabled, category, section,location, address, etc.)
2. based on additional information, image maps should be created on the JPEG or new PDFs generated.
The solution should be very generic and reusable. It could be a server-based solution or executable solution.
Solution needs to be open source as further development on how to best utilize software in different scenarios and assess limitations will be based on script architecture and structure. Usage of code will be agreed with solution provider.
This is phase one of the project and is the framework for future works(normally much easier than this one)
The solution provider is expected to understand fully the script developed as the same provider will be kept for future enhancements and add-ons (planned: phase 2,3 and 4)