Speaker: Ryan Curtin (CSIP)
Title: Dual-tree k-means with bounded single-iteration runtime
k-means is a widely used clustering algorithm, but for k clusters and a dataset size of N, each iteration of Lloyd's algorithm costs O(kN) time. Although there are tree-based techniques to accelerate single Lloyd iterations, none of these techniques are tailored to the case of large k, which is increasingly common as dataset sizes grow. We propose a dual-tree algorithm that gives the exact result of Lloyd iterations; when using cover trees, we are able to bound the worst-case runtime of each algorithm as O(N + k log k) (this bound also depends on dataset-dependent quantities). To our knowledge these are the first sub-O(kN) bounds for exact Lloyd iterations. We then show that these theoretically favorable algorithms significantly outperform other approaches in practice, especially for large N and k.
Ryan Curtin is a nearly-finished Ph.D. student in the School of Electrical and Computer Engineering at Georgia Tech; his research focuses on the acceleration of core primitives that make up machine learning algorithms, generally via the use of trees and other hierarchical structures. He is also the primary developer and maintainer of the mlpack machine learning library (http://www.mlpack.org/). He doesn't enjoy writing biographies, and also is a pinball wizard.