I need a python code that extracts information from pdf and words documents saved in a file. Result should be a python dictionary with key:value pairs for each document as below:
mainTitle : "main title of the document"
numPages : "number of pages in document"
numPara : "total number of paragraphs"
subTitle1 : "1st sub-title"
para1.0 : "1st paragraph under sub-title"
para1.1 : "2nd paragraph under sub-title"
subTitle2 : "1st sub-title"
para2.0 : "1st paragraph under sub-title"
para2.1 : "2nd paragraph under sub-title"
content of 2nd document...
Paragraphs will be blocks of texts under a title. If a paragraph (block) is too long, say more than 150 words, then it should be split in to using a dot (.) end of a sentence that best represents the middle.
The table of content and other irrelevant information should be ignored
Example of doc attached.
32 pekerja bebas membida secara purata $118 untuk pekerjaan ini
⭐⭐⭐⭐⭐ Okay. I have huge experience in working with these projects and will give you 100% accurate work. If you need sample work just send me a message. Waiting for your quick reply.
I have done my bachelors in computer science with a gold medal .I am very careful and complete task with accuracy.I assure you that you will happy if you choose [login to view URL] you