README.md
# KMeansPP [![Build Status](https://img.shields.io/travis/ollie/k_means_pp/master.svg)](https://travis-ci.org/ollie/k_means_pp) [![Code Climate](https://img.shields.io/codeclimate/github/ollie/k_means_pp.svg)](https://codeclimate.com/github/ollie/k_means_pp) [![Gem Version](https://img.shields.io/gem/v/k_means_pp.svg)](https://rubygems.org/gems/k_means_pp)
## What's this?
This is a Ruby implementation of the k-means++ algorithm for data clustering.
In other words: Grouping a bunch of X, Y points into K groups.
The code is a port of the Python version on [rosettacode.org][rosetta].
### K-means++ (from [Wikipedia][kmeans++])
> In data mining, k-means++ is an algorithm for choosing the initial values (or
> "seeds") for the k-means clustering algorithm. It was proposed in 2007 by
> David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the
> NP-hard k-means problem—a way of avoiding the sometimes poor clusterings found
> by the standard k-means algorithm.
>
> [...]
>
> The k-means problem is to find cluster centers that minimize the intra-class
> variance, i.e. the sum of squared distances from each data point being
> clustered to its cluster center (the center that is closest to it). Although
> finding an exact solution to the k-means problem for arbitrary input is
> NP-hard the standard approach to finding an approximate solution (often
> called [Lloyd's algorithm][lloyd] or the k-means algorithm) is used widely and
> frequently finds reasonable solutions quickly.
### K-means (from [Wikipedia][kmeans])
> k-means clustering is a method of vector quantization, originally from signal
> processing, that is popular for cluster analysis in data mining. k-means
> clustering aims to partition n observations into k clusters in which each
> observation belongs to the cluster with the nearest mean, serving as a
> prototype of the cluster. This results in a partitioning of the data space
> into Voronoi cells.
## Usage
See examples, too.
```ruby
points = [
[0.3968, 1.9431],
[9.3348, 6.7843],
[9.2882, 8.1347],
[7.6768, 2.7362],
[3.4434, 4.1910],
[1.8097, 5.0884],
[7.0698, 3.9285],
[9.3820, 7.6790],
[8.6092, 0.9651],
[9.1981, 7.7493]
]
clusters = KMeansPP.clusters(points, 3)
plot clusters
puts clusters
# Cluster (7.785266666666668, 2.5432666666666663): [
# [7.6768, 2.7362],
# [7.0698, 3.9285],
# [8.6092, 0.9651],
# ]
# Cluster (9.300774999999998, 7.586824999999999): [
# [9.3348, 6.7843],
# [9.2882, 8.1347],
# [9.382, 7.679],
# [9.1981, 7.7493],
# ]
# Cluster (1.8833, 3.7408333333333332): [
# [0.3968, 1.9431],
# [3.4434, 4.191],
# [1.8097, 5.0884],
# ]
cluster = clusters.first
p cluster.centroid.x # 7.785266666666668
p cluster.centroid.y # 2.5432666666666663
p cluster.points # [[7.6768, 2.7362], [7.0698, 3.9285], [8.6092, 0.9651]]
```
Or with custom structure:
```ruby
points = [
{ x: 0.3968, y: 1.9431 },
{ x: 9.3348, y: 6.7843 },
{ x: 9.2882, y: 8.1347 },
{ x: 7.6768, y: 2.7362 },
{ x: 3.4434, y: 4.1910 },
{ x: 1.8097, y: 5.0884 },
{ x: 7.0698, y: 3.9285 },
{ x: 9.3820, y: 7.6790 },
{ x: 8.6092, y: 0.9651 },
{ x: 9.1981, y: 7.7493 }
]
clusters = KMeansPP.clusters(points, 3) do |point|
[point[:x], point[:y]]
end
puts clusters
# Cluster (9.300774999999998, 7.586824999999999): [
# {:x=>9.3348, :y=>6.7843},
# {:x=>9.2882, :y=>8.1347},
# {:x=>9.382, :y=>7.679},
# {:x=>9.1981, :y=>7.7493},
# ]
# Cluster (1.8833, 3.7408333333333332): [
# {:x=>0.3968, :y=>1.9431},
# {:x=>3.4434, :y=>4.191},
# {:x=>1.8097, :y=>5.0884},
# ]
# Cluster (7.785266666666668, 2.5432666666666663): [
# {:x=>7.6768, :y=>2.7362},
# {:x=>7.0698, :y=>3.9285},
# {:x=>8.6092, :y=>0.9651},
# ]
```
## Running examples
If you want to run the examples, you will need `gnuplot` library and gem.
Don't forget to add the `--with-x` flag otherwise it won't show anything.
$ brew install gnuplot --with-x # Assuming OS X
$ gem install gnuplot
$ cd examples
$ ruby example_simple.rb
$ ruby example_block.rb
$ ruby example_csv.rb
$ ruby example_huge.rb
$ ruby example_debug.rb # Generates profiler reports
## Installation
Add this line to your application's Gemfile:
```ruby
gem 'k_means_pp'
```
And then execute:
$ bundle
Or install it yourself as:
$ gem install k_means_pp
## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run `bin/console` for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release` to create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
## Contributing
1. Fork it (https://github.com/ollie/k_means_pp/fork)
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create a new Pull Request
[rosetta]: http://rosettacode.org/wiki/K-means%2B%2B_clustering#Python
[kmeans++]: https://en.wikipedia.org/wiki/K-means%2B%2B
[kmeans]: https://en.wikipedia.org/wiki/K-means_clustering
[lloyd]: https://en.wikipedia.org/wiki/Lloyd%27s_algorithm