I have included a doc that was found online that explains the way in which Microsoft outlook 2003 categorizes the content of the message from a spam perspective.
the files that Microsoft uses have a method of categorizing words or groupings of characters and assigning a weight to these words. ? The position of the words (body or subject) seams to have a bearing on the score as well. ? The file is stored with a md5 encryption and cannot be simply read. ? However, if we take a sample word, encrypt it and then look for the weighted score in the files then we should be able to pull out all of the scores. ?
I'll give details to interested guys.
So what is needed is as follows:
* An windows application built in C# and compiled.
* The application needs an ability to specify the .dat and .dll files to use for the process as there might be new updated files released by Microsoft
* Then there needs to be a way to specify a file that we want to use to test. ? This file will be stored in .txt format and will have lots of text in it. ? ? The program needs to be able to parse the file into individual words. ? The separator will be a space. ? So eg... ? If a file has the following in it. ? ?
* <[url removed, login to view]>? mike mike,
* then the three words or text strings to run against the program are:
* <[url removed, login to view]>
* each word that is run against the data files should have two weights returned. ? One for the subject and one for the body
* the results should be stored in a db (msde should be fine). ? The following should be stored:
* the string
* the weight in the subject position
* the weight in the body position
* the most recent date the value was obtained
* Some text strings will not get a score at all. ? The data file might not have a scoring for this word. ? If that is the case mark it as 0.00.
* this concludes the parsing and database population phase of this application. ? The second phase of this app is to generate text files of text strings that can be used to beat the filter. ? Since the Microsoft filter combines the weight of each word in the body, the program needs to be able to generate a bunch of text strings that can be used to push down the overall score of the entire message. ?
* The application needs to have a Generate section. ?
* In this section the user will specify the number of text strings he wishes to get back.
* There should be an option to included 0.00 words or not. ?
* The application will then go through the DB and randomly pull out words/strings with a negative weight in the body with or without words that have a score of 0.00
* The application needs to mark the words as used and place a date stamp in the database
* The selection of words in the generate function should use words that have the oldest last used date. ? So if there are no last used dates then it will be random. ? If there are only 1000 words that are the oldest then it will use these words first. ?
* We will be constantly filling the db with more words from different languages so it should have lots of words to bring back.
Some more considerations...
This document (Attached referees to outlook 2003 filtering). ? We would like to use the most recent version of outlook to grab the data and dll files. ? I just need you to confirm that the format for these files has not changed to drastically.
I can get you the 2003 and 2010 files if you need me to.