# Python 教學課程：使用 SQL 機器學習來建置模型以分類客戶

• 為 K-Means 演算法定義叢集數目
• 執行叢集
• 分析結果

## 定義叢集數目

K-means 的目標是將項目分組為 k 個叢集，讓相同叢集中的所有項目彼此相似，並盡可能與其他叢集中的項目不同。

``````################################################################################################
## Determine number of clusters using the Elbow method
################################################################################################

cdata = customer_data
K = range(1, 20)
KM = (sk_cluster.KMeans(n_clusters=k).fit(cdata) for k in K)
centroids = (k.cluster_centers_ for k in KM)

D_k = (sci_distance.cdist(cdata, cent, 'euclidean') for cent in centroids)
dist = (np.min(D, axis=1) for D in D_k)
avgWithinSS = [sum(d) / cdata.shape[0] for d in dist]
plt.plot(K, avgWithinSS, 'b*-')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')
plt.show()
``````

## 執行叢集

``````################################################################################################
## Perform clustering using Kmeans
################################################################################################

# It looks like k=4 is a good number to use based on the elbow graph.
n_clusters = 4

means_cluster = sk_cluster.KMeans(n_clusters=n_clusters, random_state=111)
columns = ["orderRatio", "itemsRatio", "monetaryRatio", "frequency"]
est = means_cluster.fit(customer_data[columns])
clusters = est.labels_
customer_data['cluster'] = clusters

# Print some data about the clusters:

# For each cluster, count the members.
for c in range(n_clusters):
cluster_members=customer_data[customer_data['cluster'] == c][:]
print('Cluster{}(n={}):'.format(c, len(cluster_members)))
print('-'* 17)
print(customer_data.groupby(['cluster']).mean())
``````

## 分析結果

``````Cluster0(n=31675):
-------------------
Cluster1(n=4989):
-------------------
Cluster2(n=1):
-------------------
Cluster3(n=671):
-------------------

customer  orderRatio  itemsRatio  monetaryRatio  frequency
cluster
0        50854.809882    0.000000    0.000000       0.000000   0.000000
1        51332.535779    0.721604    0.453365       0.307721   1.097815
2        57044.000000    1.000000    2.000000     108.719154   1.000000
3        48516.023845    0.136277    0.078346       0.044497   4.271237
``````

• orderRatio = 退貨訂單率 (部分退貨或全部退貨的訂單總數與訂單總數比較)
• itemsRatio = 退貨率 (退貨總數與購買項目數目比較)
• monetaryRatio = 退貨金額率 (退貨的貨幣金額總計與購買金額比較)
• frequency = 退貨頻率

• 叢集 0 似乎是不太活躍的客戶群組 (所有值皆為零)。
• 叢集 3 似乎是在退貨行為方面比較明顯的群組。

## 後續步驟

• 為 K-Means 演算法定義叢集數目
• 執行叢集
• 分析結果