Using Python simulate a classifier that was built for a research paper. Creating a binary NB classifier for DMOZ (ODP) dataset (the dataset will be provided) using BOW toolkit.
DMOZ dataset contains (category, URI, title, description), the dataset used for training is the (category and URI), the dataset used for testing (URI). The URI should be in all-gram (4-5-6-7-8-gram) combined (for more details on all-gram look at the Research Paper). The dataset is in the rdf format and can be converted to csv using the tool [url removed, login to view] found in [url removed, login to view]
The number of test and train dataset is based on the Research Paper method, which is for testing 1K for each topic, for training the same number of positive (in the category), and same number of negative from all the other categories (not in topic). For example 1000 are in news category we will have to collect 1000/(number of categories) from each category. (Note: this can be done easily using a tool called [url removed, login to view], found in [url removed, login to view])
The resulted should be a table matching the table in the Research Paper page 10. So for ODP dataset each category has a P, R, and F score with the total average.
I will need the all the code created for the classifier and the result.
Research Paper used is: A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification