Python k-means algorithm

Question

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.

I did a similar implementation for images. You can use 2d arrays instead of RGB values. It's very naive but works for me github.com/keremgocen/pattern-recog-notes. — mass, May 18 '15 at 1:35

theJollySin · Accepted Answer · 2015-08-04 22:50:27Z

up vote 52 down vote accepted

Scipy's clustering implementations work well, and they include a k-means implementation.

There's also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don't need to decide on the number of clusters ahead of time.

edited Aug 4 '15 at 22:50

theJollySin

3,89862451

answered Oct 9 '09 at 22:10

tom10

36.4k35586

add a comment |

Vebjorn Ljosa · Answer 2 · 2010-02-09 03:31:12Z

up vote 27 down vote

SciPy's kmeans2() has some numerical problems: others have reported error messages such as "Matrix is not positive definite - Cholesky decomposition cannot be computed" in version 0.6.0, and I just encountered the same in version 0.7.1.

For now, I would recommend using PyCluster instead. Example usage:

>>> import numpy
>>> import Pycluster
>>> points = numpy.vstack([numpy.random.multivariate_normal(mean, 
                                                            0.03 * numpy.diag([1,1]),
                                                            20) 
                           for mean in [(1, 1), (2, 4), (3, 2)]])
>>> labels, error, nfound = Pycluster.kcluster(points, 3)
>>> labels  # Cluster number for each point
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
>>> error   # The within-cluster sum of distances for the solution
1.7721661785401261
>>> nfound  # Number of times this solution was found
1

edited Feb 9 '10 at 3:31

answered Feb 8 '10 at 20:03

Vebjorn Ljosa

9,31675372

2

It also seems that the scipy cluster kmeans function does not accept a distance method and always uses Euclidean. Another reason to use PyCluster? – Sid Feb 22 '12 at 21:04

just hit the error mentioned... I see in your example the cluster groupings, but can you get the cluster "center"? – monkut May 24 '12 at 3:18

@monkup, numpy.vstack([points[labels == i].mean(0) for i in range(labels.max() + 1)]) to get the centers of the clusters. – Vebjorn Ljosa May 24 '12 at 9:04

1

You can get rid of the error in kmeans2 by using the keyword argument minit='points' – forefinger Aug 27 '14 at 5:44

add a comment |

Nathan · Answer 3 · 2010-04-09 05:21:50Z

For continuous data, k-means is very easy.

You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.

I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is

k*(1-(1/n)) + n*(1/n)

Here is the full code in Python

from __future__ import division
from random import random

# init means and data to random values
# use real data in your code
means = [random() for i in range(10)]
data = [random() for i in range(1000)]

param = 0.01 # bigger numbers make the means change faster
# must be between 0 and 1

for x in data:
    closest_k = 0;
    smallest_error = 9999; # this should really be positive infinity
    for k in enumerate(means):
        error = abs(x-k[1])
        if error < smallest_error:
            smallest_error = error
            closest_k = k[0]
        means[closest_k] = means[closest_k]*(1-param) + x*(param)

you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short 'a' vowel, the long 'o' vowel, and the 's' consonant. wierd!

this is a great online learning kmeans algorithm! But there is bug at last row of the code. should remove one tab on this row: means[closest_k] = means[closest_k]*(1-param) + x*(param) — lai, Jul 24 '15 at 9:49

Jacob · Answer 4 · 2009-10-09 19:26:39Z

up vote 5 down vote

From wikipedia, you could use scipy, K-means clustering an vector quantization

Or, you could use a Python wrapper for OpenCV, ctypes-opencv.

Or you could OpenCV's new Python interface, and their kmeans implementation.

edited Oct 9 '09 at 19:26

answered Oct 9 '09 at 19:21

Jacob

27.7k1183136

add a comment |

Community · Answer 5 · 2017-05-23 10:31:34Z

up vote 5 down vote

(Years later) this kmeans.py under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means is straightforward and reasonably fast; it uses any of the 20-odd metrics in scipy.spatial.distance.

edited May 23 at 10:31

Community♦

11

answered Jul 4 '11 at 14:43

denis

11.9k54159

add a comment |

George · Answer 6 · 2009-10-09 19:35:19Z

up vote 0 down vote

You can also use GDAL, which has many many functions to work with spatial data.

answered Oct 9 '09 at 19:35

George

1,63062553

add a comment |

gsilv · Answer 7 · 2017-02-12 12:45:48Z

SciKit Learn's KMeans() is the simplest way to apply k-means clustering in Python. Fitting clusters is simple as: kmeans = KMeans(n_clusters=2, random_state=0).fit(X).

This code snippet shows how to store centroid coordinates and predict clusters for an array of coordinates.

>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])

(courtesy of SciKit Learn's documentation, linked above)

Guest · Answer 8 · 2014-09-14 20:52:51Z

up vote -1 down vote

Python's Pycluster and pyplot can be used for k-means clustering and for visualization of 2D data. A recent blog post Stock Price/Volume Analysis Using Python and PyCluster gives an example of clustering using PyCluster on stock data.

edited Sep 14 '14 at 20:52

answered Sep 14 '14 at 20:47

Guest

11

add a comment |

Ali Osman Mollahüseyinoğlu · Answer 9 · 2016-03-29 10:44:55Z

*This Code K-Means With Pyhon *

from math import math

from functions import functions

class KMEANS:

@staticmethod
def KMeans(data,classterCount,globalCounter):
counter=0
classes=[]
cluster =[[]]
cluster_index=[]
tempClasses=[]
for i in range(0,classterCount):
globalCounter+=1
classes.append(cluster)
cluster_index.append(cluster)
tempClasses.append(cluster)
classes2=classes[:]
for i in range(0,len(classes)):
globalCounter=1
cluster = [data[i]]
classes[i]=cluster
functions.ResetClasterIndex(cluster_index,classterCount,globalCounter)
functions.ResetClasterIndex(classes2,classterCount,globalCounter)
def clusterFills(classeses,globalCounter,counter):
counter+=1
combinedOfClasses = functions.CopyTo(classeses)
functions.ResetClasterIndex(cluster_index,classterCount,globalCounter)
functions.ResetClasterIndex(tempClasses,classterCount,globalCounter)
avarage=[]
for k in range(0,len(combinedOfClasses)):
globalCounter+=1
avarage.append(functions.GetAvarage(combinedOfClasses[k]))
for i in range(0,len(data)):
globalCounter+=1
minimum=0
index=0
for k in range(0,len(avarage)):
total=0.0
for j in range(0,len(avarage[k])):
total += (avarage[k][j]-data[i][j]) **2
tempp=math.sqrt(total)
if(k==0):
minimu=tempp
if(tempp&lt;=minimu):
minimu=tempp
index=k
tempClasses[index].append(data[i])
cluster_index[index].append(i)
if(functions.CompareArray(tempClasses,combinedOfClasses)==1):
return clusterFills(tempClasses,globalCounter,counter)
returnArray = []
returnArray.append(tempClasses)
returnArray.append(cluster_index)
returnArray.append(avarage)
returnArray.append(counter)
return returnArray

cdcd = clusterFills(classes,globalCounter,counter)
if cdcd !=None:
return cdcd

@staticmethod
def KMeansPer(data,classterCount,globalCounter):
perData=data[0:int(float(len(data))/100*30)]
result = KMEANS.KMeans(perData,classterCount,globalCounter)
cluster_index=[]
tempClasses=[]
classes=[]
cluster =[[]]
for i in range(0,classterCount):
globalCounter+=1
classes.append(cluster)
cluster_index.append(cluster)
tempClasses.append(cluster)
classes2=classes[:]
for i in range(0,len(classes)):
globalCounter=1
cluster = [data[i]]
classes[i]=cluster
functions.ResetClasterIndex(cluster_index,classterCount,globalCounter)
functions.ResetClasterIndex(classes2,classterCount,globalCounter)
counter=0
def clusterFills(classeses,globalCounter,counter):
counter+=1
combinedOfClasses = functions.CopyTo(classeses)
functions.ResetClasterIndex(cluster_index,classterCount,globalCounter)
functions.ResetClasterIndex(tempClasses,classterCount,globalCounter)
avarage=[]
for k in range(0,len(combinedOfClasses)):
globalCounter+=1
avarage.append(functions.GetAvarage(combinedOfClasses[k]))
for i in range(0,len(data)):
globalCounter+=1
minimum=0
index=0
for k in range(0,len(avarage)):
total=0.0
for j in range(0,len(avarage[k])):
total += (avarage[k][j]-data[i][j]) **2
tempp=math.sqrt(total)
if(k==0):
minimu=tempp
if(tempp&lt;=minimu):
minimu=tempp
index=k
tempClasses[index].append(data[i])
cluster_index[index].append(i)
if(functions.CompareArray(tempClasses,combinedOfClasses)==1):
return clusterFills(tempClasses,globalCounter,counter)
returnArray = []
returnArray.append(tempClasses)
returnArray.append(cluster_index)
returnArray.append(avarage)
returnArray.append(counter)
return returnArray

cdcd = clusterFills(result[0],globalCounter,counter)
if cdcd !=None:
return cdcd

Read ...

asked	7 years, 8 months ago
viewed	76562 times
active	3 months ago

Python k-means algorithm

9 Answers 9

Your Answer

Not the answer you're looking for? Browse other questions tagged python algorithm cluster-analysis k-means or ask your own question.

Linked

Hot Network Questions

Python k-means algorithm

9 Answers 9

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python algorithm cluster-analysis k-means or ask your own question.

Linked

Related

Hot Network Questions