Monday, October 22, 2007

Text Clustering Algorithms

1. Hierarchical Agglomerative Clustering

in the very beginning, treat each input as a cluster, in each step, merge two closest clusters

2. K-means

build clusters around k centroids iteratively

define a good similarity measure (such as cosine measure based on the vector space model) is the key of these two algorithms

3. Suffix Tree Clustering

group inputs based on the identical phrases they share, two main phrases: discover base clusters and merge base clusters. the approach is difficult to tune and relevant inputs without common phrases will not be included in that cluster.

4. Semantic Online Hierarchical Clustering

similiar to 3, but, instead of using the whole sentence, a concept named "complete phrase" is defined, three main phrases: discover complete phrases and their frequencies in inputs, discover orthogonal base clusters using "single value decomposition", and merge base clusters into hierarchical structure.

LINGO is a variation of this approach. Most algorithms form clusters first and then figure out what labels to use. However, both LINGO (Carrot2) and Vivisimo try to find meaningful descriptions of clusters first, and determine their contents based on the descriptions.

No comments: