Primary clustering tools for practical applications include K-means using scikit-learn or Faiss, agglomerative clustering leveraging cosine similarity with scikit-learn, and density-based methods like DBSCAN or HDBSCAN. For determining the optimal number of clusters, silhouette score is generally preferred over inertia-based visual heuristics, and it natively supports pre-computed distance matrices.
Links
K-means Clustering
- K-means is the most widely used clustering algorithm and is typically the first method to try for general clustering tasks.
- The scikit-learn KMeans implementation is suitable for small to medium-sized datasets, while Faiss's kmeans is more efficient and accurate for very large datasets.
- K-means requires the number of clusters to be specified in advance and relies on the Euclidean distance metric, which performs poorly in high-dimensional spaces.
- When document embeddings have high dimensionality (e.g., 768 dimensions from sentence transformers), K-means becomes less effective due to the limitations of Euclidean distance in such spaces.
Alternatives to K-means for High Dimensions
- For text embeddings with high dimensionality, agglomerative (hierarchical) clustering methods are preferable, particularly because they allow the use of different similarity metrics.
- Agglomerative clustering in scikit-learn accepts a pre-computed cosine similarity matrix, which is more appropriate for natural language processing.
- Constructing the pre-computed distance (or similarity) matrix involves normalizing vectors and computing dot products, which can be efficiently achieved with linear algebra libraries like PyTorch.
- Hierarchical algorithms do not use inertia in the same way as K-means and instead rely on external metrics, such as silhouette score.
- Other clustering algorithms exist, including spectral, mean shift, and affinity propagation, which are not covered in this episode.
Semantic Search and Vector Indexing
- Libraries such as Faiss, Annoy, and HNSWlib provide approximate nearest neighbor search for efficient semantic search on large-scale vector data.
- These systems create an index of your embeddings to enable rapid similarity search, often with the ability to specify cosine similarity as the metric.
- Sample code using these libraries with sentence transformers can be found in the UKP Lab sentence-transformers examples directory.
Determining the Optimal Number of Clusters
- Both K-means and agglomerative clustering require a predefined number of clusters, but this is often unknown beforehand.
- The "elbow" method involves running the clustering algorithm with varying cluster counts and plotting the inertia (sum of squared distances within clusters) to visually identify the point of diminishing returns; see kmeans.inertia_.
- The kneed package can automatically detect the "elbow" or "knee" in the inertia plot, eliminating subjective human judgment; sample code available here.
- The silhouette score, calculated via silhouette_score, considers both inter- and intra-cluster distances and allows for direct selection of the number of clusters with the maximum score.
- The silhouette score can be computed using a pre-computed distance matrix (such as from cosine similarities), making it well-suited for applications involving non-Euclidean metrics and hierarchical clustering.
Density-Based Clustering: DBSCAN and HDBSCAN
- DBSCAN is a hierarchical clustering method that does not require specifying the number of clusters, instead discovering clusters based on data density.
- HDBSCAN is a more popular and versatile implementation of density-based clustering, capable of handling various types of data without significant parameter tuning.
- DBSCAN and HDBSCAN can be preferable to K-means or agglomerative clustering when automatic determination of cluster count or robustness to noise is important.
- However, these algorithms may not perform well with all types of high-dimensional embedding data, as illustrated by the challenges faced when clustering 768-dimensional text embeddings.
Summary Recommendations and Links