As a followup to the notebook "MS Symptoms: Hierarchical Clustering," we cluster the study participants by their symptoms. We use MCA factor scores as inputs to the k-means algorithm.
The first stage of the analysis proceeds exactly as in the notebook "MS Symptoms: MCA." As a sanity check, we verify that the same percentage of the variance is explained.
import mca import pandas as pd sym_df = pd.read_csv('../data/symptom.csv') sym_df = sym_df[sym_df.DIAGDIS2].set_index(['SITE_ID','PATIENT_ID']) sym_columns = [col for col in sym_df.columns if col.startswith('EVER')] sym_only = sym_df[sym_columns] sym_dummy = mca.dummy(sym_only) mca_model = mca.MCA(sym_dummy, ncols=len(sym_only.columns)) mca_model.P = -mca_model.P; mca_model.Q = -mca_model.Q mca_model.expl_var(greenacre=True).round(3)
array([ 0.926, 0.016, 0.007, 0.003, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])
We compute the top 4 factor scores for the row (subject) profiles. As before, we restrict attention to subjects that have been diagnosed with MS by their neurologist.
scores = mca_model.fs_r(N=4) scores = pd.DataFrame(scores, columns=['FS1','FS2','FS3','FS4'], index=sym_df.index) scores = pd.concat([sym_df[['DISEASE2']], scores], axis=1) ms_patient_scores = scores[scores['DISEASE2'] == 'MS'].drop('DISEASE2', 1) ms_patient_scores.head()
We can now apply the standard k-means clustering algorithm to the factor scores. Somewhat arbitrarily, we choose $k=4$ clusters.
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=4, random_state=0) clusters = kmeans.fit_predict(ms_patient_scores)
The size of each cluster is shown below.
ms_patient_scores['CLUSTER'] = clusters + 1 ms_patient_scores.CLUSTER.value_counts(sort=False)
1 424 2 407 3 331 4 565 Name: CLUSTER, dtype: int64
Let's compare these clusters to the human-created labels for MS subtypes: primary progressive, secondary progressive, etc. In a now familiar step, we merge the two relevant data tables.
ms_type_df = pd.read_csv('../data/diagnosis.csv') ms_type_df = ms_type_df[~ms_type_df.MSTYPE.isnull()] ms_type_df = ms_type_df.set_index(['SITE_ID','PATIENT_ID']) merged_df = pd.merge(ms_patient_scores, ms_type_df[['MSTYPE']], left_index=True, right_index=True) merged_df.head()
Finally, we display the row profiles in a scatter plot, using the color of the point distinguish the MS types and the style of the marker to distinguish the clusters.
scatter = charts.Scatter(merged_df, x='FS1', y='FS2', color='MSTYPE', marker='CLUSTER', title="K-means clustering on MCA factor scores", plot_width=750, plot_height=500) charts.show(scatter)
<Bokeh Notebook handle for In>
At least as far as the first two factor scores are concerned, there is no discernable cluster structure. This is reflected in the clustering produced by k-means, which does little more than chop the point cloud into four adjacent regions along the axis of the first factor score. Meanwhile, the MS types are spread across the point cloud, with the relapsing remitting type being especially well dispersed. (This explains why the corresponding averaged point
RLP in the original MCA notebook is near the center of its respective plot.) An exception is the isolated syndrome type, which is mostly confined to the left-most cluster, associated with mild symptoms. The overall impression produced by this plot is not too different from the PCA plot from our first analysis of this data.
While this exercise in clustering has not been successful in its own right, it tells us something valuable about the symptoms of MS patients: they do not seem to fall into natural disjoint clusters. In particular, MS patients of types
SCND can experience symptoms across the severity scale (even if the progressive forms of MS are more associated with severe symptoms).