MS Symptoms: Clustering on MCA scores¶

As a followup to the notebook "MS Symptoms: Hierarchical Clustering," we cluster the study participants by their symptoms. We use MCA factor scores as inputs to the k-means algorithm.

The first stage of the analysis proceeds exactly as in the notebook "MS Symptoms: MCA." As a sanity check, we verify that the same percentage of the variance is explained.

import mca
import pandas as pd

sym_df = pd.read_csv('../data/symptom.csv')
sym_df = sym_df[sym_df.DIAGDIS2].set_index(['SITE_ID','PATIENT_ID'])

sym_columns = [col for col in sym_df.columns if col.startswith('EVER')]
sym_only = sym_df[sym_columns]
sym_dummy = mca.dummy(sym_only)

mca_model = mca.MCA(sym_dummy, ncols=len(sym_only.columns))
mca_model.P = -mca_model.P; mca_model.Q = -mca_model.Q
mca_model.expl_var(greenacre=True).round(3)

array([ 0.926,  0.016,  0.007,  0.003,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ])

from bokeh import charts
charts.output_notebook()

We compute the top 4 factor scores for the row (subject) profiles. As before, we restrict attention to subjects that have been diagnosed with MS by their neurologist.

scores = mca_model.fs_r(N=4)
scores = pd.DataFrame(scores, columns=['FS1','FS2','FS3','FS4'], index=sym_df.index)
scores = pd.concat([sym_df[['DISEASE2']], scores], axis=1)

ms_patient_scores = scores[scores['DISEASE2'] == 'MS'].drop('DISEASE2', 1)
ms_patient_scores.head()

We can now apply the standard k-means clustering algorithm to the factor scores. Somewhat arbitrarily, we choose $k=4$ clusters.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=0)
clusters = kmeans.fit_predict(ms_patient_scores)

The size of each cluster is shown below.

ms_patient_scores['CLUSTER'] = clusters + 1
ms_patient_scores.CLUSTER.value_counts(sort=False)

1    424
2    407
3    331
4    565
Name: CLUSTER, dtype: int64

Let's compare these clusters to the human-created labels for MS subtypes: primary progressive, secondary progressive, etc. In a now familiar step, we merge the two relevant data tables.

ms_type_df = pd.read_csv('../data/diagnosis.csv')
ms_type_df = ms_type_df[~ms_type_df.MSTYPE.isnull()]
ms_type_df = ms_type_df.set_index(['SITE_ID','PATIENT_ID'])
merged_df = pd.merge(ms_patient_scores, ms_type_df[['MSTYPE']],
                     left_index=True, right_index=True)
merged_df.head()

Finally, we display the row profiles in a scatter plot, using the color of the point distinguish the MS types and the style of the marker to distinguish the clusters.

scatter = charts.Scatter(merged_df, x='FS1', y='FS2', color='MSTYPE', marker='CLUSTER',
                         title="K-means clustering on MCA factor scores",
                         plot_width=750, plot_height=500)
charts.show(scatter)

At least as far as the first two factor scores are concerned, there is no discernable cluster structure. This is reflected in the clustering produced by k-means, which does little more than chop the point cloud into four adjacent regions along the axis of the first factor score. Meanwhile, the MS types are spread across the point cloud, with the relapsing remitting type being especially well dispersed. (This explains why the corresponding averaged point RLP in the original MCA notebook is near the center of its respective plot.) An exception is the isolated syndrome type, which is mostly confined to the left-most cluster, associated with mild symptoms. The overall impression produced by this plot is not too different from the PCA plot from our first analysis of this data.

While this exercise in clustering has not been successful in its own right, it tells us something valuable about the symptoms of MS patients: they do not seem to fall into natural disjoint clusters. In particular, MS patients of types RLP, PRIM, and SCND can experience symptoms across the severity scale (even if the progressive forms of MS are more associated with severe symptoms).

		FS1	FS2	FS3	FS4
SITE_ID	PATIENT_ID
1	2	0.081593	0.081203	0.005208	-0.009286
	4	-0.400544	0.018177	0.007113	0.000632
	5	0.112070	0.023265	-0.006493	-0.018136
	8	0.338281	0.020394	-0.006947	0.008499
	9	-0.015434	-0.010832	-0.018098	-0.015582

		FS1	FS2	FS3	FS4	CLUSTER	MSTYPE
SITE_ID	PATIENT_ID
1	2	0.081593	0.081203	0.005208	-0.009286	4	RLP
	4	-0.400544	0.018177	0.007113	0.000632	3	RLP
	5	0.112070	0.023265	-0.006493	-0.018136	4	RLP
	8	0.338281	0.020394	-0.006947	0.008499	2	RLP
	9	-0.015434	-0.010832	-0.018098	-0.015582	1	RLP