Regression algorithms
- Topics:
- Queries
CREATED FOR:
- Developer
This document provides an overview of various regression algorithms, focusing on their configuration, key parameters, and practical usage in advanced statistical models. Regression algorithms are used to model the relationship between dependent and independent variables, predicting continuous outcomes based on observed data. Each section includes parameter descriptions and example code to help you implement and optimize these algorithms for tasks such as linear, random forest, and survival regression.
Decision Tree regression
Decision Tree learning is a supervised learning method used in statistics, data mining, and machine learning. In this approach, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of decision tree models.
Parameter | Description | Default value | Possible Values |
---|---|---|---|
MAX_BINS | This parameter specifies the maximum number of bins used to discretize continuous features and determine splits at each node. More bins result in finer granularity and detail. | 32 | Must be at least 2 and at least the number of categories in any categorical feature. |
CACHE_NODE_IDS | This parameter determines whether to cache node IDs for each instance. If false , the algorithm passes trees to executors to match instances with nodes. If true , the algorithm caches node IDs for each instance, which can speed up the training of deeper trees. Users can configure how often the cache should be checkpointed or disable it by setting CHECKPOINT_INTERVAL . | false | true or false |
CHECKPOINT_INTERVAL | This parameter specifies how often the cached node IDs should be checkpointed. For example, setting it to 10 means the cache will be checkpointed every 10 iterations. This is only applicable if CACHE_NODE_IDS is set to true and the checkpoint directory is configured in org.apache.spark.SparkContext . | 10 | (>=1) |
IMPURITY | This parameter specifies the criterion used for calculating information gain. Supported values are entropy and gini . | gini | entropy , gini |
MAX_DEPTH | This parameter specifies the maximum depth of the tree. For example, a depth of 0 means 1 leaf node, while a depth of 1 means 1 internal node and 2 leaf nodes. The depth must be within the range [0, 30] . | 5 | [0, 30] |
MIN_INFO_GAIN | This parameter sets the minimum information gain required for a split to be considered valid at a tree node. | 0.0 | (>=0.0) |
MIN_WEIGHT_FRACTION_PER_NODE | This parameter specifies the minimum fraction of the weighted sample count that each child node must have after a split. If either child node has a fraction less than this value, the split will be discarded. | 0.0 | [0.0, 0.5] |
MIN_INSTANCES_PER_NODE | This parameter sets the minimum number of instances that each child node must have after a split. If a split results in fewer instances than this value, the split will be discarded as invalid. | 1 | (>=1) |
MAX_MEMORY_IN_MB | This parameter specifies the maximum memory, in megabytes (MB), allocated for histogram aggregation. If the memory is too small, only one node will be split per iteration, and its aggregates may exceed this size. | 256 | Any positive integer value |
PREDICTION_COL | This parameter specifies the name of the column used for storing predictions. | “prediction” | Any string |
SEED | This parameter sets the random seed used in the model. | None | Any 64-bit number |
WEIGHT_COL | This parameter specifies the name of the weight column. If this parameter is not set or is empty, all instance weights are treated as 1.0 . | Not set | Any string |
Example
CREATE MODEL modelname OPTIONS(
type = 'decision_tree_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Factorization Machines regression
Factorization Machines is a regression learning algorithm that supports normal gradient descent and the AdamW solver. The algorithm is based on the paper by S. Rendle (2010), “Factorization Machines.”
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Factorization Machines regression.
Parameter | Description | Default value | Possible Values |
---|---|---|---|
TOL | This parameter specifies the convergence tolerance for the algorithm. Higher values may result in faster convergence but less precision. | 1E-6 | (>= 0) |
FACTOR_SIZE | This parameter defines the dimensionality of the factors. Higher values increase model complexity. | 8 | (>= 0) |
FIT_INTERCEPT | This parameter indicates whether the model should include an intercept term. | true | true , false |
FIT_LINEAR | This parameter specifies whether to include a linear term (also called the 1-way term) in the model. | true | true , false |
INIT_STD | This parameter defines the standard deviation of the initial coefficients used in the model. | 0.01 | (>= 0) |
MAX_ITER | This parameter specifies the maximum number of iterations for the algorithm to run. | 100 | (>= 0) |
MINI_BATCH_FRACTION | This parameter sets the mini-batch fraction, which determines the portion of data used in each batch. It must be in the range (0, 1] . | 1.0 | (0, 1] |
REG_PARAM | This parameter sets the regularization parameter to prevent overfitting. | 0.0 | (>= 0) |
SEED | This parameter specifies the random seed used for model initialization. | None | Any 64-bit number |
SOLVER | This parameter specifies the solver algorithm used for optimization. | “adamW” | gd (gradient descent), adamW |
STEP_SIZE | This parameter specifies the initial step size (or learning rate) for the first optimization step. | 1.0 | Any positive value |
PREDICTION_COL | This parameter specifies the name of the column where predictions are stored. | “prediction” | Any string |
Example
CREATE MODEL modelname OPTIONS(
type = 'factorization_machines_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Generalized Linear regression
Unlike linear regression, which assumes that the outcome follows a normal (Gaussian) distribution, Generalized Linear Models (GLMs) allow the outcome to follow different types of distributions, such as Poisson or binomial, depending on the nature of the data.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Generalized Linear regression.
MAX_ITER
irls
).REG_PARAM
TOL
1E-6
AGGREGATION_DEPTH
treeAggregate
.FAMILY
gaussian
, binomial
, poisson
, gamma
, and tweedie
.gaussian
, binomial
, poisson
, gamma
, tweedie
FIT_INTERCEPT
true
true
, false
LINK
identity
, log
, inverse
, logit
, probit
, cloglog
, and sqrt
.identity
, log
, inverse
, logit
, probit
, cloglog
, sqrt
LINK_POWER
1 - variancePower
, which aligns with the R statmod
package. Specific link powers of 0, 1, -1, and 0.5 correspond to the Log, Identity, Inverse, and Sqrt links, respectively.SOLVER
irls
(iteratively reweighted least squares).irls
VARIANCE_POWER
[1, inf)
. Variance powers of 0, 1, and 2 correspond to Gaussian, Poisson, and Gamma families, respectively.[1, inf)
LINK_PREDICTION_COL
OFFSET_COL
WEIGHT_COL
1.0
. In the Binomial family, weights correspond to the number of trials, and non-integer weights are rounded in AIC calculation.Example
CREATE MODEL modelname OPTIONS(
type = 'generalized_linear_reg'
) AS
SELECT col1, col2, col3 FROM training-dataset
Gradient Boosted Tree regression
Gradient-boosted trees (GBTs) are an effective method for classification and regression that combines the predictions of multiple decision trees to improve predictive accuracy and model performance.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Gradient Boosted Tree regression.
MAX_BINS
CACHE_NODE_IDS
false
, the algorithm passes trees to executors to match instances with nodes. If true
, the algorithm caches node IDs for each instance. Caching can speed up training of deeper trees.false
true
, false
CHECKPOINT_INTERVAL
10
means the cache is checkpointed every 10 iterations.MAX_DEPTH
0
means 1 leaf node, and depth 1
means 1 internal node with 2 leaf nodes.MIN_INFO_GAIN
MIN_WEIGHT_FRACTION_PER_NODE
MIN_INSTANCES_PER_NODE
MAX_MEMORY_IN_MB
PREDICTION_COL
VALIDATION_INDICATOR_COL
false
for training and true
for validation.LEAF_COL
FEATURE_SUBSET_STRATEGY
auto
, all
, onethird
, sqrt
, log2
, or a fraction between 0 and 1.0SEED
WEIGHT_COL
1.0
.LOSS_TYPE
squared
(L2), and absolute
(L1). Note: Values are case-insensitive.STEP_SIZE
(0, 1]
, used to shrink the contribution of each estimator.(0, 1]
MAX_ITER
SUBSAMPLING_RATE
(0, 1]
.(0, 1]
Example
CREATE MODEL modelname OPTIONS(
type = 'gradient_boosted_tree_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Isotonic regression
Isotonic Regression is an algorithm used to iteratively adjust distances while preserving the relative order of dissimilarities in the data.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Isotonic Regression.
ISOTONIC
true
or antitonic (decreasing) when false
.true
true
, false
WEIGHT_COL
1.0
.PREDICTION_COL
FEATURE_INDEX
featuresCol
is a vector column. If not set, the default value is 0
. Otherwise, it has no effect.Example
CREATE MODEL modelname OPTIONS(
type = 'isotonic_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Linear regression
Linear Regression is a supervised machine learning algorithm that fits a linear equation to data in order to model the relationship between a dependent variable and independent features.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Linear Regression.
MAX_ITER
REGPARAM
ELASTICNETPARAM
Example
CREATE MODEL modelname OPTIONS(
type = 'linear_reg'
) AS
SELECT col1, col2, col3 FROM training-dataset
Random Forest Regression
Random Forest Regression is an ensemble algorithm that builds multiple decision trees during training and returns the average prediction of those trees for regression tasks, helping to prevent overfitting.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Random Forest Regression.
MAX_BINS
CACHE_NODE_IDS
false
, the algorithm passes trees to executors to match instances with nodes. If true
, the algorithm caches node IDs for each instance, speeding up the training of deeper trees.false
true
, false
CHECKPOINT_INTERVAL
10
means the cache is checkpointed every 10 iterations.IMPURITY
entropy
, gini
MAX_DEPTH
0
means 1 leaf node, and depth 1
means 1 internal node and 2 leaf nodes.MIN_INFO_GAIN
MIN_WEIGHT_FRACTION_PER_NODE
MIN_INSTANCES_PER_NODE
MAX_MEMORY_IN_MB
BOOTSTRAP
true
, false
NUM_TREES
1
, no bootstrapping is used. If greater than 1
, bootstrapping is applied.SUBSAMPLING_RATE
(0, 1]
.LEAF_COL
PREDICTION_COL
SEED
WEIGHT_COL
1.0
.Example
CREATE MODEL modelname OPTIONS(
type = 'random_forest_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Survival Regression
Survival Regression is used to fit a parametric survival regression model, known as the Accelerated Failure Time (AFT) model, based on the Weibull distribution. It can stack instances into blocks for enhanced performance.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Survival Regression.
MAX_ITER
TOL
1E-6
AGGREGATION_DEPTH
treeAggregate
. If the feature dimensions or the number of partitions are large, this parameter can be set to a larger value.FIT_INTERCEPT
true
, false
PREDICTION_COL
CENSOR_COL
1
indicates that the event has occurred (uncensored), while 0
means the event is censored.MAX_BLOCK_SIZE_IN_MB
0
allows automatic adjustment.Example
CREATE MODEL modelname OPTIONS(
type = 'survival_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Next steps
After reading this document, you now know how to configure and use various regression algorithms. Next, refer to the documents on classification and clustering to learn about other types of advanced statistical models.