We use OpenSemanticSearch (OSS) to do full text fuzzy search in documents from 3GPP (e.g. from ftp://[url removed, login to view]).
The majority of documents are in MS Word format and have semi-structured meta data that can be extracted for example using regular expressions.
We are looking for an experimented SolR/OSS developper to develop the "data enrichment" modules (assuming regex in the OSS data processing flow, tbc) that will allow to extract the following information from Word documents (see attached example):
1) Document number, e.g. (note is also filename): R1-1706960
2) Source e.g.: Huawei
3) Title: PUCCH resource allocation for HARQ-ACK and SR
4) Meeting name, e.g.: 3GPP TSG RAN WG1 Meeting #89
5) Meeting place, e.g. : Hangzhou, China
6) Meeting date, e.g.: 15 May 2017
These 6 fields shall be visible in the solR search results snippets and the fields 2), 4) 6) be available in the filter criteria (facets). Refer to print screen for current results display.
Please provide the OSS/SolR/Tika configuration files needed when tested and validated in your environment and all a fews days for reproduction and milestone payment.