I need a "Similar Pages" Script/Tool to use on my site. My site is an article directory site with over 160K articles. I already have a "Duplicate Content" tool so the new Tool will actually be comparing articles to flush out articles that are SIMILAR or Duplicate but have been re-written but essentially its the same articles.
My existing Duplicate content tools can flush out 100% duplicate content but it can't spot those that are 95% or even 70% duplicate/similar content.
You must have done similar project before or you have a demo to show that your script works. Also because of the nature of what this script will do. I expect it would take huge server resources. Your script must be configure to prevent or manage this.
If you already know of a script or tool out there that can be configure to use on my site then no problem.
The site is [url removed, login to view]