In this project, you will predict rating scores (1 5) from the text in online reviews. You are given a dataset with reviews and their rating scores. The task is to take as input the review text and predict what is the rating score for that review. You will experiment with several supervised
learning algorithms using this dataset. You must use Jupyter Notebook
There are 2 files, [login to view URL] and test.txt. Each file contains review-text and the score separated by a tab. There are 10K reviews in [login to view URL] and 1K reviews in test.txt.
You will need to first extract features from the review text. You will implement the bag-ofwords model. To do this, you will need to use packages in python sklearn. These packages convert documents into vectors after pre-processing the document (e.g. removing stop words, etc.) automatically. You should use the TF-IDF vectorization here. sklearn.feature_extraction.text.TfidfVectorizer. You will then experiment with PCA (dimensionality reduction) that is performed as a preprocessing step ([login to view URL]) to reduce the number of features.
You will experiment with the following learners.
i) Neural networks (MLPClassifier in sklearn)
ii) Naïve Bayes (MultinomialNB in sklearn)
iii) Logistic Regression (LogisticRegression in sklearn)
iv) AdaBoosting (AdaBoostClassifier in sklearn)
v) SVM ([login to view URL] in sklearn)
Tasks to perform
i. Run 5-fold Cross Validation on the [login to view URL] using the 5 learning algorithms. Report the average-precision, average-recall and average-F1-scores. Parameters that you should try to change include
a. In neural networks change the number of hidden layers and number of units in each layer
b. In SVMs, change the penalty parameter C and the kernel type
c. In Adaboosting change the number of estimators (n_estimators)
d. In Logistic regression change the penalty: L1 regularization that can also perform feature selection and L2. Also change the regularization strength parameter ( C )
ii. Perform feature selection using PCA and re-run the algorithms with their optimal settings. Compare the results for different values of n_components (number of reduced features) in PCA.
iii. Include some additional knowledge into your model. Specifically, not all words are useful in predicting rating scores. Words that express sentiments are more likely to be useful. Use the sentiment words in [login to view URL] and [login to view URL] to filter words from the review text, and then evaluate the algorithms once again. What changes did you observe?
iv. Perform evaluation on the test dataset using the optimal parameter settings that were obtained from the training set. How did each algorithm perform? Report its precision, recall and f-scores. Which types of reviews were the hardest to predict?
HI. As a python/c++/Java expert with strong math background and ML experience, I can finish your project wonderfully. Please let me know your deadline and budget.