Classification algorithms classification-algorithms
This document provides an overview of various classification algorithms, focusing on their configuration, key parameters, and practical usage in advanced statistical models. Classification algorithms are used to assign categories to data points based on input features. Each section includes parameter descriptions and example code to help you implement and optimize these algorithms for tasks such as decision trees, random forest, and naive Bayes classification.
Decision Tree Classifier decision-tree-classifier
Decision Tree Classifier is a supervised learning approach used in statistics, data mining, and machine learning. In this approach, a decision tree is used as a predictive model for classification tasks, drawing conclusions from a set of observations.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of a Decision Tree Classifier.
MAX_BINS
CACHE_NODE_IDS
false
, the algorithm passes trees to executors to match instances with nodes. If true
, the algorithm caches node IDs for each instance, speeding up the training of deeper trees.false
true
, false
CHECKPOINT_INTERVAL
10
means that the cache is checkpointed every 10 iterations.IMPURITY
entropy
, gini
MAX_DEPTH
0
means 1 leaf node, and depth 1
means 1 internal node and 2 leaf nodes.MIN_INFO_GAIN
MIN_WEIGHT_FRACTION_PER_NODE
MIN_INSTANCES_PER_NODE
MAX_MEMORY_IN_MB
PREDICTION_COL
SEED
WEIGHT_COL
1.0
.ONE_VS_REST
false
true
, false
Example
Create MODEL modelname OPTIONS(
type = 'decision_tree_classifier'
) AS
select col1, col2, col3 from training-dataset
Factorization Machine Classifier factorization-machine-classifier
The Factorization Machine Classifier is a classification algorithm that supports normal gradient descent and the AdamW solver. The Factorization Machine classification model uses logistic loss, which can be optimized via gradient descent, and often includes regularization terms like L2 to prevent overfitting.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of the Factorization Machine Classifier.
TOL
1E-6
FACTOR_SIZE
FIT_INTERCEPT
true
true
, false
FIT_LINEAR
true
true
, false
INIT_STD
MAX_ITER
MINI_BATCH_FRACTION
(0, 1]
.REG_PARAM
SEED
SOLVER
gd
(gradient descent) and adamW
.gd
, adamW
STEP_SIZE
PROBABILITY_COL
PREDICTION_COL
RAW_PREDICTION_COL
ONE_VS_REST
true
, false
Example
CREATE MODEL modelname OPTIONS(
type = 'factorization_machines_classifier'
) AS
SELECT col1, col2, col3 FROM training-dataset
Gradient Boosted Tree Classifier gradient-boosted-tree-classifier
The Gradient Boosted Tree Classifier uses an ensemble of decision trees to improve the accuracy of classification tasks, combining multiple trees to enhance model performance.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of the Gradient Boosted Tree Classifier.
MAX_BINS
CACHE_NODE_IDS
false
, the algorithm passes trees to executors to match instances with nodes. If true
, the algorithm caches node IDs for each instance, speeding up the training of deeper trees.false
true
, false
CHECKPOINT_INTERVAL
10
means that the cache is checkpointed every 10 iterations.MAX_DEPTH
0
means 1 leaf node, and depth 1
means 1 internal node and 2 leaf nodes.MIN_INFO_GAIN
MIN_WEIGHT_FRACTION_PER_NODE
MIN_INSTANCES_PER_NODE
MAX_MEMORY_IN_MB
PREDICTION_COL
VALIDATION_INDICATOR_COL
false
indicates training, and true
indicates validation. If a value is not set, the default value is None
.RAW_PREDICTION_COL
LEAF_COL
FEATURE_SUBSET_STRATEGY
auto
(automatically determined based on the task), all
(use all features), onethird
(use one-third of the features), sqrt
(use the square root of the number of features), log2
(use the base-2 logarithm of the number of features), and n
(where n is either a fraction of the features if in the range (0, 1]
, or a specific number of features if in the range [1, total number of features]
).auto
, all
, onethird
, sqrt
, log2
, n
WEIGHT_COL
1.0
.LOSS_TYPE
logistic
(case-insensitive)STEP_SIZE
(0, 1]
, used to shrink the contribution of each estimator.MAX_ITER
SUBSAMPLING_RATE
(0, 1]
PROBABILITY_COL
ONE_VS_REST
false
true
, false
Example
Create MODEL modelname OPTIONS(
type = 'gradient_boosted_tree_classifier'
) AS
select col1, col2, col3 from training-dataset
Linear Support Vector Classifier (LinearSVC) linear-support-vector-classifier
The Linear Support Vector Classifier (LinearSVC) constructs a hyperplane to classify data in a high-dimensional space. You can use it to maximize the margin between classes to minimize classification errors.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of the Linear Support Vector Classifier (LinearSVC).
MAX_ITER
AGGREGATION_DEPTH
FIT_INTERCEPT
true
true
, false
TOL
MAX_BLOCK_SIZE_IN_MB
0
, the optimal value is automatically chosen (usually around 1 MB).REG_PARAM
STANDARDIZATION
true
true
, false
PREDICTION_COL
RAW_PREDICTION_COL
ONE_VS_REST
false
true
, false
Example
Create MODEL modelname OPTIONS(
type = 'linear_svc_classifier'
) AS
select col1, col2, col3 from training-dataset
Logistic Regression logistic-regression
Logistic Regression is a supervised algorithm used for binary classification tasks. It models the probability that an instance belongs to a class using the logistic function and assigns the instance to the class with the higher probability. This makes it suitable for problems where the goal is to separate data into one of two categories.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Logistic Regression.
MAX_ITER
REGPARAM
ELASTICNETPARAM
ElasticNet
mixing parameter controls the balance between L1 (Lasso) and L2 (Ridge) penalties. A value of 0 applies an L2 penalty (Ridge, which reduces the size of coefficients), while a value of 1 applies an L1 penalty (Lasso, which encourages sparsity by setting some coefficients to zero).Example
Create MODEL modelname OPTIONS(
type = 'logistic_reg'
) AS
select col1, col2, col3 from training-dataset
Multilayer Perceptron Classifier multilayer-perceptron-classifier
The Multilayer Perceptron Classifier (MLPC) is a feedforward artificial neural network classifier. It consists of multiple fully connected layers of nodes, where each node applies a weighted linear combination of inputs, followed by an activation function. MLPC is used for complex classification tasks requiring non-linear decision boundaries.
Parameters
MAX_ITER
BLOCK_SIZE
STEP_SIZE
gd
).TOL
1E-6
PREDICTION_COL
SEED
PROBABILITY_COL
RAW_PREDICTION_COL
ONE_VS_REST
false
true
, false
Example
CREATE MODEL modelname OPTIONS(
type = 'multilayer_perceptron_classifier'
) AS
select col1, col2, col3 from training-dataset
Naive Bayes Classifier naive-bayes-classifier
Naive Bayes Classifier is a simple probabilistic, multiclass classifier based on Bayes’ theorem with strong (naive) independence assumptions between features. It trains efficiently by computing conditional probabilities in a single pass over the training data to compute the conditional probability distribution of each feature given each label. For predictions, it uses Bayes’ theorem to compute the conditional probability distribution of each label given an observation.
Parameters
MODEL_TYPE
"multinomial"
, "complement"
, "bernoulli"
, and "gaussian"
. Model type is case-sensitive."multinomial"
, "complement"
, "bernoulli"
, "gaussian"
SMOOTHING
PROBABILITY_COL
WEIGHT_COL
1.0
.PREDICTION_COL
RAW_PREDICTION_COL
ONE_VS_REST
false
true
, false
Example
CREATE MODEL modelname OPTIONS(
type = 'naive_bayes_classifier'
) AS
SELECT col1, col2, col3 FROM training-dataset
Random Forest Classifier random-forest-classifier
Random Forest Classifier is an ensemble learning algorithm that builds multiple decision trees during training. It mitigates overfitting by averaging predictions and selecting the class chosen by the majority of trees for classification tasks.
Parameters
MAX_BINS
CACHE_NODE_IDS
false
, the algorithm passes trees to executors to match instances with nodes. If true
, the algorithm caches node IDs for each instance, speeding up training.false
true
, false
CHECKPOINT_INTERVAL
10
means that the cache is checkpointed every 10 iterations.IMPURITY
entropy
, gini
MAX_DEPTH
0
means 1 leaf node, and depth 1
means 1 internal node and 2 leaf nodes.MIN_INFO_GAIN
MIN_WEIGHT_FRACTION_PER_NODE
MIN_INSTANCES_PER_NODE
MAX_MEMORY_IN_MB
PREDICTION_COL
WEIGHT_COL
1.0
.SEED
BOOTSTRAP
true
true
, false
NUM_TREES
1
, then no bootstrapping is used. If greater than 1
, bootstrapping is applied.SUBSAMPLING_RATE
LEAF_COL
PROBABILITY_COL
RAW_PREDICTION_COL
ONE_VS_REST
false
true
, false
Example
Create MODEL modelname OPTIONS(
type = 'random_forest_classifier'
) AS
select col1, col2, col3 from training-dataset
Next steps
After reading this document, you now know how to configure and use various classification algorithms. Next, refer to the documents on regression and clustering to learn about other types of advanced statistical models.