With the help of the Lucene library you will implement an information retrieval system. The user of the system will be able to formulate the queries through a graphical environment and the system will display in prior to relativity, with the most relevant first. In order to be able to estimate the performance of the system, we will use CACM collection ([url removed, login to view] file). Cacm includes 3204 research articles that have been published in the valid Communications of the ACM. Each text includes a group of fields, among which the most important are the title of the article, the
authors and the summary from the text of the article. Also this collection contains 64 information requests about the content of the collection texts ([url removed, login to view] file). Also the texts for each question have been specified ([url removed, login to view] file). Using the information above you will estimate the performance of the system by constructing a precision-recall chart for each query. The graphical presentation of
system performance information can be done by either the same system or alternatively with the help of an external program (e.g., Excel). In this collection, you will rely sorely on summaries, titles and names of the authors of the texts ignoring all the other fields. During this process, you should ignore the words contained in the file common_words.txt. You will also use Porter stemming algorithm. The index as well as the retrieval information system search functions will be implemented with the help of the Lucene library classes. The system will also enable the user to improve their results with successive relevance feedback . In principle, the initial query will be improved automatically by the system.
More specifically, the basic steps are as follows:
1. Select the first n texts from the sorted list of texts results from the initial question.
2. For each of the texts, the k most common terms in the text are selected.
3. All the above terms are added to the initial question, ie the k terms from each of the n texts.
4. If some of the additional terms already exist in the original query are ignored.
5. The query is resubmitted.
Parameters m and k will be dynamically changed by the user.
I am an experienced software engineer. I am Microsoft Certified Professional. I will do your projects. Feel free to contact me for further discussion. Regards, Moeen Ahmed