Using machine learning optimal number of cluster should be formed on the dataset

On protein sequences we will use clustering algorithm . The data used is the 33 (α/β)8-barrel proteins belonging to the glycoside hydrolases family 2 from the CAZy database. The 33 proteins are divided into five subfamilies, namely, Ga (for β-galactosidase), GI (for β-glucuronidase), Cs (for exo-β-D-glucosaminidase), Ma, and Un (for β-mannosidase), where each protein is represented as a sequence of symbols from the alphabet set {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, X, Y}. The lengths of the sequences vary from 598 to 1270. These sequences are multimodular with various types of catalytic modules, known as “(α/β)8-barrel”. By this experiment task is to identify the correct number of clusters (K = 5), in terms of such structural characteristics hidden in the sequences.

To validate the quality of a series of clustering results, each generated by the clustering algorithm on the same sequences set S with various numbers of sequences clusters Cluster Validation Index is used

