MS Symptoms: Clustering on MCA scores

As a followup to the notebook "MS Symptoms: Hierarchical Clustering," we cluster the study participants by their symptoms. We use MCA factor scores as inputs to the k-means algorithm.

The first stage of the analysis proceeds exactly as in the notebook "MS Symptoms: MCA." As a sanity check, we verify that the same percentage of the variance is explained.

In [1]:
import mca
import pandas as pd

sym_df = pd.read_csv('../data/symptom.csv')
sym_df = sym_df[sym_df.DIAGDIS2].set_index(['SITE_ID','PATIENT_ID'])

sym_columns = [col for col in sym_df.columns if col.startswith('EVER')]
sym_only = sym_df[sym_columns]
sym_dummy = mca.dummy(sym_only)

mca_model = mca.MCA(sym_dummy, ncols=len(sym_only.columns))
mca_model.P = -mca_model.P; mca_model.Q = -mca_model.Q
mca_model.expl_var(greenacre=True).round(3)
Out[1]:
array([ 0.926,  0.016,  0.007,  0.003,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ])
In [2]:
from bokeh import charts
charts.output_notebook()
Loading BokehJS ...

We compute the top 4 factor scores for the row (subject) profiles. As before, we restrict attention to subjects that have been diagnosed with MS by their neurologist.

In [3]:
scores = mca_model.fs_r(N=4)
scores = pd.DataFrame(scores, columns=['FS1','FS2','FS3','FS4'], index=sym_df.index)
scores = pd.concat([sym_df[['DISEASE2']], scores], axis=1)

ms_patient_scores = scores[scores['DISEASE2'] == 'MS'].drop('DISEASE2', 1)
ms_patient_scores.head()
Out[3]:
FS1 FS2 FS3 FS4
SITE_ID PATIENT_ID
1 2 0.081593 0.081203 0.005208 -0.009286
4 -0.400544 0.018177 0.007113 0.000632
5 0.112070 0.023265 -0.006493 -0.018136
8 0.338281 0.020394 -0.006947 0.008499
9 -0.015434 -0.010832 -0.018098 -0.015582

We can now apply the standard k-means clustering algorithm to the factor scores. Somewhat arbitrarily, we choose $k=4$ clusters.

In [4]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=0)
clusters = kmeans.fit_predict(ms_patient_scores)

The size of each cluster is shown below.

In [5]:
ms_patient_scores['CLUSTER'] = clusters + 1
ms_patient_scores.CLUSTER.value_counts(sort=False)
Out[5]:
1    424
2    407
3    331
4    565
Name: CLUSTER, dtype: int64

Let's compare these clusters to the human-created labels for MS subtypes: primary progressive, secondary progressive, etc. In a now familiar step, we merge the two relevant data tables.

In [6]:
ms_type_df = pd.read_csv('../data/diagnosis.csv')
ms_type_df = ms_type_df[~ms_type_df.MSTYPE.isnull()]
ms_type_df = ms_type_df.set_index(['SITE_ID','PATIENT_ID'])
merged_df = pd.merge(ms_patient_scores, ms_type_df[['MSTYPE']],
                     left_index=True, right_index=True)
merged_df.head()
Out[6]:
FS1 FS2 FS3 FS4 CLUSTER MSTYPE
SITE_ID PATIENT_ID
1 2 0.081593 0.081203 0.005208 -0.009286 4 RLP
4 -0.400544 0.018177 0.007113 0.000632 3 RLP
5 0.112070 0.023265 -0.006493 -0.018136 4 RLP
8 0.338281 0.020394 -0.006947 0.008499 2 RLP
9 -0.015434 -0.010832 -0.018098 -0.015582 1 RLP

Finally, we display the row profiles in a scatter plot, using the color of the point distinguish the MS types and the style of the marker to distinguish the clusters.

In [7]:
scatter = charts.Scatter(merged_df, x='FS1', y='FS2', color='MSTYPE', marker='CLUSTER',
                         title="K-means clustering on MCA factor scores",
                         plot_width=750, plot_height=500)
charts.show(scatter)
Out[7]:

<Bokeh Notebook handle for In[7]>

At least as far as the first two factor scores are concerned, there is no discernable cluster structure. This is reflected in the clustering produced by k-means, which does little more than chop the point cloud into four adjacent regions along the axis of the first factor score. Meanwhile, the MS types are spread across the point cloud, with the relapsing remitting type being especially well dispersed. (This explains why the corresponding averaged point RLP in the original MCA notebook is near the center of its respective plot.) An exception is the isolated syndrome type, which is mostly confined to the left-most cluster, associated with mild symptoms. The overall impression produced by this plot is not too different from the PCA plot from our first analysis of this data.

While this exercise in clustering has not been successful in its own right, it tells us something valuable about the symptoms of MS patients: they do not seem to fall into natural disjoint clusters. In particular, MS patients of types RLP, PRIM, and SCND can experience symptoms across the severity scale (even if the progressive forms of MS are more associated with severe symptoms).