Clustering algorithms clustering-algorithms
Clustering algorithms group data points into distinct clusters based on their similarities, enabling unsupervised learning to uncover patterns within the data. To create a clustering algorithm, use the type
parameter in the OPTIONS
clause to specify the algorithm you want to use for model training. Next, define the relevant parameters as key-value pairs to fine-tune the model.
K-Means kmeans
K-Means
is a clustering algorithm that partitions data points into a predefined number of clusters (k). It is one of the most commonly used algorithms for clustering due to its simplicity and efficiency.
Parameters
When using K-Means
, the following parameters can be set in the OPTIONS
clause:
MAX_ITER
20
TOL
0.0001
NUM_CLUSTERS
k
).2
DISTANCE_TYPE
euclidean
euclidean
, cosine
KMEANS_INIT_METHOD
k-means||
random
, k-means||
(A parallel version of k-means++)INIT_STEPS
k-means||
initialization mode (applicable only when KMEANS_INIT_METHOD
is k-means||
).2
PREDICTION_COL
prediction
SEED
-1689246527
WEIGHT_COL
not set
Example
CREATE MODEL modelname
OPTIONS(
type = 'kmeans',
MAX_ITERATIONS = 30,
NUM_CLUSTERS = 4
)
AS SELECT col1, col2, col3 FROM training-dataset;
Bisecting K-means bisecting-kmeans
Bisecting K-means is a hierarchical clustering algorithm that uses a divisive (or “top-down”) approach. All observations start in a single cluster, and splits are performed recursively as the hierarchy is built. Bisecting K-means can often be faster than regular K-means, but it typically produces different cluster results.
Parameters
MAX_ITER
WEIGHT_COL
1.0
.NUM_CLUSTERS
SEED
DISTANCE_MEASURE
euclidean
, cosine
MIN_DIVISIBLE_CLUSTER_SIZE
PREDICTION_COL
Example
Create MODEL modelname OPTIONS(
type = 'bisecting_kmeans',
) AS
select col1, col2, col3 from training-dataset
Gaussian Mixture Model gaussian-mixture-model
Gaussian Mixture Model represents a composite distribution where data points are drawn from one of k Gaussian sub-distributions, each with its own probability. It is used to model datasets that are assumed to be generated from a mixture of several Gaussian distributions.
Parameters
MAX_ITER
WEIGHT_COL
1.0
.NUM_CLUSTERS
SEED
AGGREGATION_DEPTH
PROBABILITY_COL
TOL
PREDICTION_COL
Example
Create MODEL modelname OPTIONS(
type = 'gaussian_mixture',
) AS
select col1, col2, col3 from training-dataset
Latent Dirichlet Allocation (LDA) latent-dirichlet-allocation
Latent Dirichlet Allocation (LDA) is a probabilistic model that captures the underlying topic structure from a collection of documents. It is a three-level hierarchical Bayesian model with word, topic, and document layers. LDA uses these layers, along with the observed documents, to build a latent topic structure.
Parameters
MAX_ITER
OPTIMIZER
"online"
(Online Variational Bayes) and "em"
(Expectation-Maximization).online
, em
(case-insensitive)NUM_CLUSTERS
CHECKPOINT_INTERVAL
DOC_CONCENTRATION
EM
optimizer, alpha values should be greater than 1.0 (default: uniformly distributed as (50/k) + 1), ensuring symmetric topic distributions. For the online
optimizer, alpha values can be 0 or greater (default: uniformly distributed as 1.0/k), allowing for more flexible topic initialization.KEEP_LAST_CHECKPOINT
em
optimizer. Deleting the checkpoint can cause failures if a data partition is lost. Checkpoints are automatically removed from storage when they are no longer needed, as determined by reference counting.true
true
, false
LEARNING_DECAY
online
optimizer, set as an exponential decay rate between (0.5, 1.0]
.(0.5, 1.0]
LEARNING_OFFSET
online
optimizer that downweights early iterations to make early iterations count less.SEED
OPTIMIZE_DOC_CONCENTRATION
online
optimizer: whether to optimize the docConcentration
(Dirichlet parameter for document-topic distribution) during training.false
true
, false
SUBSAMPLING_RATE
online
optimizer: the fraction of the corpus sampled and used in each iteration of mini-batch gradient descent, in the range (0, 1]
.(0, 1]
TOPIC_CONCENTRATION
EM
, values > 1.0 (default = 0.1 + 1). For online
, values ≥ 0 (default = 1.0/k).TOPIC_DISTRIBUTION_COL
Example
Create MODEL modelname OPTIONS(
type = 'lda',
) AS
select col1, col2, col3 from training-dataset
Next steps
After reading this document, you now know how to configure and use various clustering algorithms. Next, refer to the documents on classification and regression to learn about other types of advanced statistical models.