Regression algorithms regression-algorithms
This document provides an overview of various regression algorithms, focusing on their configuration, key parameters, and practical usage in advanced statistical models. Regression algorithms are used to model the relationship between dependent and independent variables, predicting continuous outcomes based on observed data. Each section includes parameter descriptions and example code to help you implement and optimize these algorithms for tasks such as linear, random forest, and survival regression.
Decision Tree regression decision-tree-regression
Decision Tree learning is a supervised learning method used in statistics, data mining, and machine learning. In this approach, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of decision tree models.
MAX_BINS
CACHE_NODE_IDS
false
, the algorithm passes trees to executors to match instances with nodes. If true
, the algorithm caches node IDs for each instance, which can speed up the training of deeper trees. Users can configure how often the cache should be checkpointed or disable it by setting CHECKPOINT_INTERVAL
.true
or false
CHECKPOINT_INTERVAL
10
means the cache will be checkpointed every 10 iterations. This is only applicable if CACHE_NODE_IDS
is set to true
and the checkpoint directory is configured in org.apache.spark.SparkContext
.IMPURITY
entropy
and gini
.gini
entropy
, gini
MAX_DEPTH
0
means 1 leaf node, while a depth of 1
means 1 internal node and 2 leaf nodes. The depth must be within the range [0, 30]
.MIN_INFO_GAIN
MIN_WEIGHT_FRACTION_PER_NODE
MIN_INSTANCES_PER_NODE
MAX_MEMORY_IN_MB
PREDICTION_COL
SEED
WEIGHT_COL
1.0
.Example
CREATE MODEL modelname OPTIONS(
type = 'decision_tree_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Factorization Machines regression factorization-machines-regression
Factorization Machines is a regression learning algorithm that supports normal gradient descent and the AdamW solver. The algorithm is based on the paper by S. Rendle (2010), “Factorization Machines.”
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Factorization Machines regression.
TOL
1E-6
FACTOR_SIZE
FIT_INTERCEPT
true
true
, false
FIT_LINEAR
true
true
, false
INIT_STD
MAX_ITER
MINI_BATCH_FRACTION
(0, 1]
.(0, 1]
REG_PARAM
SEED
SOLVER
gd
(gradient descent), adamW
STEP_SIZE
PREDICTION_COL
Example
CREATE MODEL modelname OPTIONS(
type = 'factorization_machines_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Generalized Linear regression generalized-linear-regression
Unlike linear regression, which assumes that the outcome follows a normal (Gaussian) distribution, Generalized Linear Models (GLMs) allow the outcome to follow different types of distributions, such as Poisson or binomial, depending on the nature of the data.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Generalized Linear regression.
MAX_ITER
irls
).REG_PARAM
TOL
1E-6
AGGREGATION_DEPTH
treeAggregate
.FAMILY
gaussian
, binomial
, poisson
, gamma
, and tweedie
.gaussian
, binomial
, poisson
, gamma
, tweedie
FIT_INTERCEPT
true
true
, false
LINK
identity
, log
, inverse
, logit
, probit
, cloglog
, and sqrt
.identity
, log
, inverse
, logit
, probit
, cloglog
, sqrt
LINK_POWER
1 - variancePower
, which aligns with the R statmod
package. Specific link powers of 0, 1, -1, and 0.5 correspond to the Log, Identity, Inverse, and Sqrt links, respectively.SOLVER
irls
(iteratively reweighted least squares).irls
VARIANCE_POWER
[1, inf)
. Variance powers of 0, 1, and 2 correspond to Gaussian, Poisson, and Gamma families, respectively.[1, inf)
LINK_PREDICTION_COL
OFFSET_COL
WEIGHT_COL
1.0
. In the Binomial family, weights correspond to the number of trials, and non-integer weights are rounded in AIC calculation.Example
CREATE MODEL modelname OPTIONS(
type = 'generalized_linear_reg'
) AS
SELECT col1, col2, col3 FROM training-dataset
Gradient Boosted Tree regression gradient-boosted-tree-regression
Gradient-boosted trees (GBTs) are an effective method for classification and regression that combines the predictions of multiple decision trees to improve predictive accuracy and model performance.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Gradient Boosted Tree regression.
MAX_BINS
CACHE_NODE_IDS
false
, the algorithm passes trees to executors to match instances with nodes. If true
, the algorithm caches node IDs for each instance. Caching can speed up training of deeper trees.false
true
, false
CHECKPOINT_INTERVAL
10
means the cache is checkpointed every 10 iterations.MAX_DEPTH
0
means 1 leaf node, and depth 1
means 1 internal node with 2 leaf nodes.MIN_INFO_GAIN
MIN_WEIGHT_FRACTION_PER_NODE
MIN_INSTANCES_PER_NODE
MAX_MEMORY_IN_MB
PREDICTION_COL
VALIDATION_INDICATOR_COL
false
for training and true
for validation.LEAF_COL
FEATURE_SUBSET_STRATEGY
auto
, all
, onethird
, sqrt
, log2
, or a fraction between 0 and 1.0SEED
WEIGHT_COL
1.0
.LOSS_TYPE
squared
(L2), and absolute
(L1). Note: Values are case-insensitive.STEP_SIZE
(0, 1]
, used to shrink the contribution of each estimator.(0, 1]
MAX_ITER
SUBSAMPLING_RATE
(0, 1]
.(0, 1]
Example
CREATE MODEL modelname OPTIONS(
type = 'gradient_boosted_tree_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Isotonic regression isotonic-regression
Isotonic Regression is an algorithm used to iteratively adjust distances while preserving the relative order of dissimilarities in the data.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Isotonic Regression.
ISOTONIC
true
or antitonic (decreasing) when false
.true
true
, false
WEIGHT_COL
1.0
.PREDICTION_COL
FEATURE_INDEX
featuresCol
is a vector column. If not set, the default value is 0
. Otherwise, it has no effect.Example
CREATE MODEL modelname OPTIONS(
type = 'isotonic_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Linear regression linear-regression
Linear Regression is a supervised machine learning algorithm that fits a linear equation to data in order to model the relationship between a dependent variable and independent features.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Linear Regression.
MAX_ITER
REGPARAM
ELASTICNETPARAM
Example
CREATE MODEL modelname OPTIONS(
type = 'linear_reg'
) AS
SELECT col1, col2, col3 FROM training-dataset
Random Forest Regression random-forest-regression
Random Forest Regression is an ensemble algorithm that builds multiple decision trees during training and returns the average prediction of those trees for regression tasks, helping to prevent overfitting.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Random Forest Regression.
MAX_BINS
CACHE_NODE_IDS
false
, the algorithm passes trees to executors to match instances with nodes. If true
, the algorithm caches node IDs for each instance, speeding up the training of deeper trees.false
true
, false
CHECKPOINT_INTERVAL
10
means the cache is checkpointed every 10 iterations.IMPURITY
entropy
, gini
MAX_DEPTH
0
means 1 leaf node, and depth 1
means 1 internal node and 2 leaf nodes.MIN_INFO_GAIN
MIN_WEIGHT_FRACTION_PER_NODE
MIN_INSTANCES_PER_NODE
MAX_MEMORY_IN_MB
BOOTSTRAP
true
, false
NUM_TREES
1
, no bootstrapping is used. If greater than 1
, bootstrapping is applied.SUBSAMPLING_RATE
(0, 1]
.LEAF_COL
PREDICTION_COL
SEED
WEIGHT_COL
1.0
.Example
CREATE MODEL modelname OPTIONS(
type = 'random_forest_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Survival Regression survival-regression
Survival Regression is used to fit a parametric survival regression model, known as the Accelerated Failure Time (AFT) model, based on the Weibull distribution. It can stack instances into blocks for enhanced performance.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Survival Regression.
MAX_ITER
TOL
1E-6
AGGREGATION_DEPTH
treeAggregate
. If the feature dimensions or the number of partitions are large, this parameter can be set to a larger value.FIT_INTERCEPT
true
, false
PREDICTION_COL
CENSOR_COL
1
indicates that the event has occurred (uncensored), while 0
means the event is censored.MAX_BLOCK_SIZE_IN_MB
0
allows automatic adjustment.Example
CREATE MODEL modelname OPTIONS(
type = 'survival_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Next steps
After reading this document, you now know how to configure and use various regression algorithms. Next, refer to the documents on classification and clustering to learn about other types of advanced statistical models.