This project is strictly for people who are highly skilled in nutch, hadoop and solr, as integrating these three shouldn't take more than an hour for the person who knows his job. After this, I will have more work with respect of search engine development - I plan to do large scale searches.
For now -
I need to create a nutch, solr, hadoop integration such that -
1. Hadoop will be configured on more than 2 machines and it should be easy to add another machine to expand existing configuration wrt scale
2. Nutch will be used for indexing, will pick up urls from a flat file, will pick up configuration from a central settings file and will start indexing. Will use hadoop to use other machines to do clustered indexing. Needs to be configured such that, urls already indexed, should not be followed unless reindex flag is set in settings file
3. Nutch input will go to solr, and I should be able to search indexed websites using solr. Again, solr will also be integrated with hadoop to run clustered searches.
Initially, we will have a central server and 2 sub servers on which we can distribute search and indexing.
If you can also suggest ways to change ranking dynamically, I would be interested.
Let me know.