Guidelines:

Coursework 1

1. You have been retained as a data scientist and suppose you have collected a dataset of already-classified instances and you must build a classifier. Consider the following type of classifiers

• Naïve Bayes

• K-nearest neighbour

[login to view URL] will you know how good your classifier is?

2. Task 1 - Dataset selection and evaluation: Select a dataset of your own choice, for the purposes of building training and validating the above type of classifiers 1-3. With the aid of R package visualise and justify the properties of the selected data set.

3. Task 2 - Formation of training and testing sets: Now that we have one large dataset that is classified, for training and testing sets from this dataset.

Leave-one-out cross-validation : A limiting case of cross-validation. Where m = N. N different classifiers are trained each using N–1 samples. For each of the N classifiers, the one left out sample is tested. The N test results are averaged. Classifiers are very close to optimal. All samples are used for testing. Result is unbiased and with minimum variance. If a fast leave-one-out algorithm is available. Fast algorithms exist for estimating mean, covariance matrix, as well as inverse and determinant of covariance matrix. So, useful for: Bayes quadratic, k-nearest neighbor (using Euclidean or Mahalanobis distance).

4. Task 3 - Need to construct, train and test K-NN type classifier in R. Train and test your K-NN classifier using the training and test sets generated based on the methods tried as part of the 2nd Task.

5. Task 5 – Measure Performance: For each type of classifier calculate and display the following performance-related metrics in R. Use the library (ROCR)

• Confusion matrix

• Precision vs. Recall

• Accuracy

• ROC(receiver operating characteristic curve)

• RAUC (receiver under the curve area)

Coursework 2

1. In this assignment, we consider a set of observations on several silhouettes related to different type of vehicles, using a set of features extracted from the silhouette. The dataset has 846 observations and there are 19 variables/features, all numerical and one nominal defining the class of the objects.

1st Objective (Hierarchical clustering):

• Conduct hierarchical clustering (agglomerative) analysis on the vehicles dataset. Investigate the hclust() function for single, complete & average methods.

• Create visualisation of all methods using dendograms.

• Look at the cophenetic correlation between each clustering result using cor.dendlist.

• Discuss all the produced results after using the coorplot function.

• In the report, provide accuracy of the implemented clustering scheme.

• Write a code in R Studio to address all the above issues (codes/results need to be included in your report).

2. Time series analysis can be used in a multitude of business applications for forecasting a quantity into the future. Exchange rate prediction is one of the challenging applications of modern time series forecasting and very important for the success of many businesses and financial institutions. Neural Networks (NN) is used to forecast the exchange rate as an alternative solution. The exchange rate data is collected from January 2010 to December 2011. Total of 500 data points, use first 400 as training sets and remaining as testing. Use 2nd column consisting of the rates.

2nd Objective (Forecasting):

• Construct MLP neural network for this problem.

• Discuss the input selection problem for time series prediction and propose various input configurations

• Need to show the performance of your network both graphically as well as in terms of usual statistical indices (MSE, RMSE and MAPE).

• Show all your working steps (code & results, including comparison results from models with different input vectors).

• The input selection problem is very important. Experiment with various options (i.e. how many past values you need to consider as potential network inputs). Full details of your results are needed in your report.

