Regression algorithms regression-algorithms

This document provides an overview of various regression algorithms, focusing on their configuration, key parameters, and practical usage in advanced statistical models. Regression algorithms are used to model the relationship between dependent and independent variables, predicting continuous outcomes based on observed data. Each section includes parameter descriptions and example code to help you implement and optimize these algorithms for tasks such as linear, random forest, and survival regression.

Decision Tree regression decision-tree-regression

Decision Tree learning is a supervised learning method used in statistics, data mining, and machine learning. In this approach, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of decision tree models.

Parameter
Description
Default value
Possible Values
MAX_BINS
This parameter specifies the maximum number of bins used to discretize continuous features and determine splits at each node. More bins result in finer granularity and detail.
32
Must be at least 2 and at least the number of categories in any categorical feature.
CACHE_NODE_IDS
This parameter determines whether to cache node IDs for each instance. If false, the algorithm passes trees to executors to match instances with nodes. If true, the algorithm caches node IDs for each instance, which can speed up the training of deeper trees. Users can configure how often the cache should be checkpointed or disable it by setting CHECKPOINT_INTERVAL.
false
true or false
CHECKPOINT_INTERVAL
This parameter specifies how often the cached node IDs should be checkpointed. For example, setting it to 10 means the cache will be checkpointed every 10 iterations. This is only applicable if CACHE_NODE_IDS is set to true and the checkpoint directory is configured in org.apache.spark.SparkContext.
10
(>=1)
IMPURITY
This parameter specifies the criterion used for calculating information gain. Supported values are entropy and gini.
gini
entropy, gini
MAX_DEPTH
This parameter specifies the maximum depth of the tree. For example, a depth of 0 means 1 leaf node, while a depth of 1 means 1 internal node and 2 leaf nodes. The depth must be within the range [0, 30].
5
[0, 30]
MIN_INFO_GAIN
This parameter sets the minimum information gain required for a split to be considered valid at a tree node.
0.0
(>=0.0)
MIN_WEIGHT_FRACTION_PER_NODE
This parameter specifies the minimum fraction of the weighted sample count that each child node must have after a split. If either child node has a fraction less than this value, the split will be discarded.
0.0
[0.0, 0.5]
MIN_INSTANCES_PER_NODE
This parameter sets the minimum number of instances that each child node must have after a split. If a split results in fewer instances than this value, the split will be discarded as invalid.
1
(>=1)
MAX_MEMORY_IN_MB
This parameter specifies the maximum memory, in megabytes (MB), allocated for histogram aggregation. If the memory is too small, only one node will be split per iteration, and its aggregates may exceed this size.
256
Any positive integer value
PREDICTION_COL
This parameter specifies the name of the column used for storing predictions.
“prediction”
Any string
SEED
This parameter sets the random seed used in the model.
None
Any 64-bit number
WEIGHT_COL
This parameter specifies the name of the weight column. If this parameter is not set or is empty, all instance weights are treated as 1.0.
Not set
Any string

Example

CREATE MODEL modelname OPTIONS(
  type = 'decision_tree_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Factorization Machines regression factorization-machines-regression

Factorization Machines is a regression learning algorithm that supports normal gradient descent and the AdamW solver. The algorithm is based on the paper by S. Rendle (2010), “Factorization Machines.”

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Factorization Machines regression.

Parameter
Description
Default value
Possible Values
TOL
This parameter specifies the convergence tolerance for the algorithm. Higher values may result in faster convergence but less precision.
1E-6
(>= 0)
FACTOR_SIZE
This parameter defines the dimensionality of the factors. Higher values increase model complexity.
8
(>= 0)
FIT_INTERCEPT
This parameter indicates whether the model should include an intercept term.
true
true, false
FIT_LINEAR
This parameter specifies whether to include a linear term (also called the 1-way term) in the model.
true
true, false
INIT_STD
This parameter defines the standard deviation of the initial coefficients used in the model.
0.01
(>= 0)
MAX_ITER
This parameter specifies the maximum number of iterations for the algorithm to run.
100
(>= 0)
MINI_BATCH_FRACTION
This parameter sets the mini-batch fraction, which determines the portion of data used in each batch. It must be in the range (0, 1].
1.0
(0, 1]
REG_PARAM
This parameter sets the regularization parameter to prevent overfitting.
0.0
(>= 0)
SEED
This parameter specifies the random seed used for model initialization.
None
Any 64-bit number
SOLVER
This parameter specifies the solver algorithm used for optimization.
“adamW”
gd (gradient descent), adamW
STEP_SIZE
This parameter specifies the initial step size (or learning rate) for the first optimization step.
1.0
Any positive value
PREDICTION_COL
This parameter specifies the name of the column where predictions are stored.
“prediction”
Any string

Example

CREATE MODEL modelname OPTIONS(
  type = 'factorization_machines_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Generalized Linear regression generalized-linear-regression

Unlike linear regression, which assumes that the outcome follows a normal (Gaussian) distribution, Generalized Linear Models (GLMs) allow the outcome to follow different types of distributions, such as Poisson or binomial, depending on the nature of the data.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Generalized Linear regression.

Parameter
Description
Default value
Possible Values
MAX_ITER
Sets the maximum number of iterations (applicable when using the solver irls).
25
(>= 0)
REG_PARAM
The regularization parameter.
NOT SET
(>= 0)
TOL
The convergence tolerance.
1E-6
(>= 0)
AGGREGATION_DEPTH
The suggested depth for treeAggregate.
2
(>= 2)
FAMILY
The family parameter, describing the error distribution used in the model. Supported options are gaussian, binomial, poisson, gamma, and tweedie.
“gaussian”
gaussian, binomial, poisson, gamma, tweedie
FIT_INTERCEPT
Whether to fit an intercept term.
true
true, false
LINK
The link function, which defines the relationship between the linear predictor and the mean of the distribution function. Supported options are identity, log, inverse, logit, probit, cloglog, and sqrt.
NOT SET
identity, log, inverse, logit, probit, cloglog, sqrt
LINK_POWER
This parameter specifies the index in the power link function. The parameter is applicable only to the Tweedie family. If not set, it defaults to 1 - variancePower, which aligns with the R statmod package. Specific link powers of 0, 1, -1, and 0.5 correspond to the Log, Identity, Inverse, and Sqrt links, respectively.
1
Any numeric value
SOLVER
The solver algorithm used for optimization. Supported option: irls (iteratively reweighted least squares).
“irls”
irls
VARIANCE_POWER
This parameter specifies the power in the variance function of the Tweedie distribution, defining the relationship between variance and mean. Supported values are 0 and [1, inf). Variance powers of 0, 1, and 2 correspond to Gaussian, Poisson, and Gamma families, respectively.
0.0
0, [1, inf)
LINK_PREDICTION_COL
The link prediction (linear predictor) column name.
NOT SET
Any string
OFFSET_COL
The offset column name. If not set, all instance offsets are treated as 0.0. The offset feature has a constant coefficient of 1.0.
NOT SET
Any string
WEIGHT_COL
The weight column name. If not set or empty, all instance weights are treated as 1.0. In the Binomial family, weights correspond to the number of trials, and non-integer weights are rounded in AIC calculation.
NOT SET
Any string

Example

CREATE MODEL modelname OPTIONS(
  type = 'generalized_linear_reg'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Gradient Boosted Tree regression gradient-boosted-tree-regression

Gradient-boosted trees (GBTs) are an effective method for classification and regression that combines the predictions of multiple decision trees to improve predictive accuracy and model performance.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Gradient Boosted Tree regression.

Parameter
Description
Default value
Possible Values
MAX_BINS
The maximum number of bins used to divide continuous features into discrete intervals, which helps determine how features are split at each decision tree node. More bins provide higher granularity.
32
Must be at least 2 and equal to or greater than the number of categories in any categorical feature.
CACHE_NODE_IDS
If false, the algorithm passes trees to executors to match instances with nodes. If true, the algorithm caches node IDs for each instance. Caching can speed up training of deeper trees.
false
true, false
CHECKPOINT_INTERVAL
Specifies how often to checkpoint the cached node IDs. For example, 10 means the cache is checkpointed every 10 iterations.
10
(>= 1)
MAX_DEPTH
The maximum depth of the tree (non-negative). For example, depth 0 means 1 leaf node, and depth 1 means 1 internal node with 2 leaf nodes.
5
(>= 0)
MIN_INFO_GAIN
The minimum information gain required for a split to be considered at a tree node.
0.0
(>= 0.0)
MIN_WEIGHT_FRACTION_PER_NODE
The minimum fraction of the weighted sample count that each child must have after a split. If a split causes the fraction of total weight in either child to be less than this value, it is discarded.
0.0
(>= 0.0, <= 0.5)
MIN_INSTANCES_PER_NODE
The minimum number of instances each child must have after a split. If a split results in fewer instances than this value, the split is discarded.
1
(>= 1)
MAX_MEMORY_IN_MB
The maximum memory, in MB, allocated to histogram aggregation. If this value is too small, only 1 node is split per iteration, and its aggregates may exceed this size.
256
Any positive integer value
PREDICTION_COL
The column name for prediction output.
“prediction”
Any string
VALIDATION_INDICATOR_COL
The column name indicating whether each row is used for training or validation. false for training and true for validation.
NOT SET
Any string
LEAF_COL
The column name for leaf indices. The predicted leaf index of each instance in each tree, generated by preorder traversal.
“”
Any string
FEATURE_SUBSET_STRATEGY
This parameter specifies the number of features to consider for splits at each tree node.
“auto”
auto, all, onethird, sqrt, log2, or a fraction between 0 and 1.0
SEED
The random seed.
NOT SET
Any 64-bit number
WEIGHT_COL
The column name, for instance, weights. If not set or empty, all instance weights are treated as 1.0.
NOT SET
Any string
LOSS_TYPE
This parameter specifies the loss function that the Gradient Boosted Tree model minimizes.
“squared”
squared (L2), and absolute (L1). Note: Values are case-insensitive.
STEP_SIZE
The step size (also known as the learning rate) in the range (0, 1], used to shrink the contribution of each estimator.
0.1
(0, 1]
MAX_ITER
The maximum number of iterations for the algorithm.
20
(>= 0)
SUBSAMPLING_RATE
The fraction of the training data used to learn each decision tree, in the range (0, 1].
1.0
(0, 1]

Example

CREATE MODEL modelname OPTIONS(
  type = 'gradient_boosted_tree_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Isotonic regression isotonic-regression

Isotonic Regression is an algorithm used to iteratively adjust distances while preserving the relative order of dissimilarities in the data.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Isotonic Regression.

Parameter
Description
Default value
Possible Values
ISOTONIC
Specifies whether the output sequence should be isotonic (increasing) when true or antitonic (decreasing) when false.
true
true, false
WEIGHT_COL
The column name, for instance, weights. If not set or empty, all instance weights are treated as 1.0.
NOT SET
Any string
PREDICTION_COL
The column name for prediction output.
“prediction”
Any string
FEATURE_INDEX
The index of the feature, applicable when featuresCol is a vector column. If not set, the default value is 0. Otherwise, it has no effect.
0
Any non-negative integer

Example

CREATE MODEL modelname OPTIONS(
  type = 'isotonic_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Linear regression linear-regression

Linear Regression is a supervised machine learning algorithm that fits a linear equation to data in order to model the relationship between a dependent variable and independent features.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Linear Regression.

Parameter
Description
Default value
Possible Values
MAX_ITER
The maximum number of iterations.
100
(>= 0)
REGPARAM
The regularization parameter used to control the complexity of the model.
0.0
(>= 0)
ELASTICNETPARAM
The ElasticNet mixing parameter, which controls the balance between L1 (Lasso) and L2 (Ridge) penalties. A value of 0 applies an L2 penalty, while a value of 1 applies an L1 penalty.
0.0
(>= 0, <= 1)

Example

CREATE MODEL modelname OPTIONS(
  type = 'linear_reg'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Random Forest Regression random-forest-regression

Random Forest Regression is an ensemble algorithm that builds multiple decision trees during training and returns the average prediction of those trees for regression tasks, helping to prevent overfitting.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Random Forest Regression.

Parameter
Description
Default value
Possible Values
MAX_BINS
The maximum number of bins used to discretize continuous features and determine how features are split at each node. More bins provide higher granularity.
32
Must be at least 2 and at least equal to the number of categories in any categorical feature.
CACHE_NODE_IDS
If false, the algorithm passes trees to executors to match instances with nodes. If true, the algorithm caches node IDs for each instance, speeding up the training of deeper trees.
false
true, false
CHECKPOINT_INTERVAL
Specifies how often to checkpoint the cached node IDs. For example, 10 means the cache is checkpointed every 10 iterations.
10
(>= 1)
IMPURITY
The criterion used for information gain calculation (case-insensitive).
“entropy”
entropy, gini
MAX_DEPTH
The maximum depth of the tree (non-negative). For example, depth 0 means 1 leaf node, and depth 1 means 1 internal node and 2 leaf nodes.
5
Any non-negative integer
MIN_INFO_GAIN
The minimum information gain required for a split to be considered at a tree node.
0.0
(>= 0.0)
MIN_WEIGHT_FRACTION_PER_NODE
The minimum fraction of the weighted sample count that each child must have after a split. If a split causes the fraction of the total weight in either child to be less than this value, it is discarded.
0.0
(>= 0.0, <= 0.5)
MIN_INSTANCES_PER_NODE
The minimum number of instances each child must have after a split. If a split results in fewer instances than this value, the split is discarded.
1
(>= 1)
MAX_MEMORY_IN_MB
The maximum memory, in MB, allocated to histogram aggregation. If this value is too small, only 1 node is split per iteration, and its aggregates may exceed this size.
256
(>= 1)
BOOTSTRAP
Whether to use bootstrap samples when building trees.
TRUE
true, false
NUM_TREES
The number of trees to train (at least 1). If 1, no bootstrapping is used. If greater than 1, bootstrapping is applied.
20
(>= 1)
SUBSAMPLING_RATE
The fraction of training data used to train each decision tree, in the range (0, 1].
1.0
(>= 0.0, <= 1)
LEAF_COL
The column name for leaf indices, which is the predicted leaf index of each instance in each tree, generated by preorder traversal.
“”
Any string
PREDICTION_COL
The column name for prediction output.
“prediction”
Any string
SEED
The random seed.
NOT SET
Any 64-bit number
WEIGHT_COL
The column name, for instance, weights. If not set or empty, all instance weights are treated as 1.0.
NOT SET
Any valid column name or leave empty.

Example

CREATE MODEL modelname OPTIONS(
  type = 'random_forest_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Survival Regression survival-regression

Survival Regression is used to fit a parametric survival regression model, known as the Accelerated Failure Time (AFT) model, based on the Weibull distribution. It can stack instances into blocks for enhanced performance.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Survival Regression.

Parameter
Description
Default value
Possible Values
MAX_ITER
The maximum number of iterations that the algorithm should run.
100
(>= 0)
TOL
The convergence tolerance.
1E-6
(>= 0)
AGGREGATION_DEPTH
The suggested depth for treeAggregate. If the feature dimensions or the number of partitions are large, this parameter can be set to a larger value.
2
(>= 2)
FIT_INTERCEPT
Whether to fit an intercept term.
TRUE
true, false
PREDICTION_COL
The column name for prediction output.
“prediction”
Any string
CENSOR_COL
The column name for censoring. A value of 1 indicates that the event has occurred (uncensored), while 0 means the event is censored.
“censor”
0, 1
MAX_BLOCK_SIZE_IN_MB
The maximum memory in MB for stacking input data into blocks. If the remaining data size in a partition is smaller, this value is adjusted accordingly. A value of 0 allows automatic adjustment.
0.0
(>= 0)

Example

CREATE MODEL modelname OPTIONS(
  type = 'survival_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Next steps

After reading this document, you now know how to configure and use various regression algorithms. Next, refer to the documents on classification and clustering to learn about other types of advanced statistical models.

recommendation-more-help
ccf2b369-4031-483f-af63-a93b5ae5e3fb