K-means Cluster Analysis
Cluster analysis is an unsupervised learning method. The goal of cluster analysis is to group or cluster observations into subsets based on the similarity of responses on multiple variables. Observations that have similar response patterns are grouped together to form clusters.
Our end goal is to obtain clusters that have less variance within clusters and more variance between clusters.
Let’s look at how to implement k-means clustering with python step by step. We will be using mall customer dataset which has columns Age, Annual Income and Spending Score.
1. First we load the dataset in to a pandas dataframe and split the dataset into train and test sets.
clus_train, clus_test = train_test_split(df, test_size=.3, random_state=123)
2. Now since we do not know how many clusters we should use we need to use Elbow Method to identify how many clusters to chose as below.
from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[] for k in clusters:
model=KMeans(n_clusters=k)
model.fit(clus_train)
clusassign=model.predict(clus_train)
meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])
#Plot average distance from observations from the cluster centroidto use the Elbow Method to identify number of clusters to choose
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
In the resulting plot we see an elbow in when cluster number is 3. There is also an elbow when cluster numbers are 6. The best method will be to perform analysis for both cluster number 3 and 6 and after checking the accuracies and plots decide which cluster numbers perform best.
For this example we will take 3 clusters.
3. Let’s plot the scatter plot and see how the clusters are distributed.
model=KMeans(n_clusters=3)
model.fit(clus_train)
clusassign=model.predict(clus_train)
# plot clusters from sklearn.decomposition
import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()
Above is the resulting scatter plot.














