Documentation Experience Platform Query Service Guide

Regression algorithms

Last update: November 22, 2024

Topics:
Queries

CREATED FOR:

Developer

This document provides an overview of various regression algorithms, focusing on their configuration, key parameters, and practical usage in advanced statistical models. Regression algorithms are used to model the relationship between dependent and independent variables, predicting continuous outcomes based on observed data. Each section includes parameter descriptions and example code to help you implement and optimize these algorithms for tasks such as linear, random forest, and survival regression.

Decision Tree regression

Decision Tree learning is a supervised learning method used in statistics, data mining, and machine learning. In this approach, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of decision tree models.

Parameter	Description	Default value	Possible Values
`MAX_BINS`	This parameter specifies the maximum number of bins used to discretize continuous features and determine splits at each node. More bins result in finer granularity and detail.	32	Must be at least 2 and at least the number of categories in any categorical feature.
`CACHE_NODE_IDS`	This parameter determines whether to cache node IDs for each instance. If `false`, the algorithm passes trees to executors to match instances with nodes. If `true`, the algorithm caches node IDs for each instance, which can speed up the training of deeper trees. Users can configure how often the cache should be checkpointed or disable it by setting `CHECKPOINT_INTERVAL`.	false	`true` or `false`
`CHECKPOINT_INTERVAL`	This parameter specifies how often the cached node IDs should be checkpointed. For example, setting it to `10` means the cache will be checkpointed every 10 iterations. This is only applicable if `CACHE_NODE_IDS` is set to `true` and the checkpoint directory is configured in `org.apache.spark.SparkContext`.	10	(>=1)
`IMPURITY`	This parameter specifies the criterion used for calculating information gain. Supported values are `entropy` and `gini`.	`gini`	`entropy`, `gini`
`MAX_DEPTH`	This parameter specifies the maximum depth of the tree. For example, a depth of `0` means 1 leaf node, while a depth of `1` means 1 internal node and 2 leaf nodes. The depth must be within the range `[0, 30]`.	5	[0, 30]
`MIN_INFO_GAIN`	This parameter sets the minimum information gain required for a split to be considered valid at a tree node.	0.0	(>=0.0)
`MIN_WEIGHT_FRACTION_PER_NODE`	This parameter specifies the minimum fraction of the weighted sample count that each child node must have after a split. If either child node has a fraction less than this value, the split will be discarded.	0.0	[0.0, 0.5]
`MIN_INSTANCES_PER_NODE`	This parameter sets the minimum number of instances that each child node must have after a split. If a split results in fewer instances than this value, the split will be discarded as invalid.	1	(>=1)
`MAX_MEMORY_IN_MB`	This parameter specifies the maximum memory, in megabytes (MB), allocated for histogram aggregation. If the memory is too small, only one node will be split per iteration, and its aggregates may exceed this size.	256	Any positive integer value
`PREDICTION_COL`	This parameter specifies the name of the column used for storing predictions.	“prediction”	Any string
`SEED`	This parameter sets the random seed used in the model.	None	Any 64-bit number
`WEIGHT_COL`	This parameter specifies the name of the weight column. If this parameter is not set or is empty, all instance weights are treated as `1.0`.	Not set	Any string

Example

CREATE MODEL modelname OPTIONS(
  type = 'decision_tree_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Factorization Machines regression

Factorization Machines is a regression learning algorithm that supports normal gradient descent and the AdamW solver. The algorithm is based on the paper by S. Rendle (2010), “Factorization Machines.”

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Factorization Machines regression.

Parameter	Description	Default value	Possible Values
`TOL`	This parameter specifies the convergence tolerance for the algorithm. Higher values may result in faster convergence but less precision.	`1E-6`	(>= 0)
`FACTOR_SIZE`	This parameter defines the dimensionality of the factors. Higher values increase model complexity.	8	(>= 0)
`FIT_INTERCEPT`	This parameter indicates whether the model should include an intercept term.	`true`	`true`, `false`
`FIT_LINEAR`	This parameter specifies whether to include a linear term (also called the 1-way term) in the model.	`true`	`true`, `false`
`INIT_STD`	This parameter defines the standard deviation of the initial coefficients used in the model.	0.01	(>= 0)
`MAX_ITER`	This parameter specifies the maximum number of iterations for the algorithm to run.	100	(>= 0)
`MINI_BATCH_FRACTION`	This parameter sets the mini-batch fraction, which determines the portion of data used in each batch. It must be in the range `(0, 1]`.	1.0	`(0, 1]`
`REG_PARAM`	This parameter sets the regularization parameter to prevent overfitting.	0.0	(>= 0)
`SEED`	This parameter specifies the random seed used for model initialization.	None	Any 64-bit number
`SOLVER`	This parameter specifies the solver algorithm used for optimization.	“adamW”	`gd` (gradient descent), `adamW`
`STEP_SIZE`	This parameter specifies the initial step size (or learning rate) for the first optimization step.	1.0	Any positive value
`PREDICTION_COL`	This parameter specifies the name of the column where predictions are stored.	“prediction”	Any string

Example

CREATE MODEL modelname OPTIONS(
  type = 'factorization_machines_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Generalized Linear regression

Unlike linear regression, which assumes that the outcome follows a normal (Gaussian) distribution, Generalized Linear Models (GLMs) allow the outcome to follow different types of distributions, such as Poisson or binomial, depending on the nature of the data.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Generalized Linear regression.

Parameter

Description

Default value

Possible Values

MAX_ITER

Sets the maximum number of iterations (applicable when using the solver irls).

25

(>= 0)

REG_PARAM

The regularization parameter.

NOT SET

(>= 0)

TOL

The convergence tolerance.

1E-6

(>= 0)

AGGREGATION_DEPTH

The suggested depth for treeAggregate.

2

(>= 2)

FAMILY

The family parameter, describing the error distribution used in the model. Supported options are gaussian, binomial, poisson, gamma, and tweedie.

“gaussian”

gaussian, binomial, poisson, gamma, tweedie

FIT_INTERCEPT

Whether to fit an intercept term.

true

true, false

LINK

The link function, which defines the relationship between the linear predictor and the mean of the distribution function. Supported options are identity, log, inverse, logit, probit, cloglog, and sqrt.

NOT SET

identity, log, inverse, logit, probit, cloglog, sqrt

LINK_POWER

This parameter specifies the index in the power link function. The parameter is applicable only to the Tweedie family. If not set, it defaults to 1 - variancePower, which aligns with the R statmod package. Specific link powers of 0, 1, -1, and 0.5 correspond to the Log, Identity, Inverse, and Sqrt links, respectively.

1

Any numeric value

SOLVER

The solver algorithm used for optimization. Supported option: irls (iteratively reweighted least squares).

“irls”

irls

VARIANCE_POWER

This parameter specifies the power in the variance function of the Tweedie distribution, defining the relationship between variance and mean. Supported values are 0 and [1, inf). Variance powers of 0, 1, and 2 correspond to Gaussian, Poisson, and Gamma families, respectively.

0.0

0, [1, inf)

LINK_PREDICTION_COL

The link prediction (linear predictor) column name.

NOT SET

Any string

OFFSET_COL

The offset column name. If not set, all instance offsets are treated as 0.0. The offset feature has a constant coefficient of 1.0.

NOT SET

Any string

WEIGHT_COL

The weight column name. If not set or empty, all instance weights are treated as 1.0. In the Binomial family, weights correspond to the number of trials, and non-integer weights are rounded in AIC calculation.

NOT SET

Any string

Example

CREATE MODEL modelname OPTIONS(
  type = 'generalized_linear_reg'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Gradient Boosted Tree regression

Gradient-boosted trees (GBTs) are an effective method for classification and regression that combines the predictions of multiple decision trees to improve predictive accuracy and model performance.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Gradient Boosted Tree regression.

Parameter

Description

Default value

Possible Values

MAX_BINS

The maximum number of bins used to divide continuous features into discrete intervals, which helps determine how features are split at each decision tree node. More bins provide higher granularity.

32

Must be at least 2 and equal to or greater than the number of categories in any categorical feature.

CACHE_NODE_IDS

If false, the algorithm passes trees to executors to match instances with nodes. If true, the algorithm caches node IDs for each instance. Caching can speed up training of deeper trees.

false

true, false

CHECKPOINT_INTERVAL

Specifies how often to checkpoint the cached node IDs. For example, 10 means the cache is checkpointed every 10 iterations.

10

(>= 1)

MAX_DEPTH

The maximum depth of the tree (non-negative). For example, depth 0 means 1 leaf node, and depth 1 means 1 internal node with 2 leaf nodes.

5

(>= 0)

MIN_INFO_GAIN

The minimum information gain required for a split to be considered at a tree node.

0.0

(>= 0.0)

MIN_WEIGHT_FRACTION_PER_NODE

The minimum fraction of the weighted sample count that each child must have after a split. If a split causes the fraction of total weight in either child to be less than this value, it is discarded.

0.0

(>= 0.0, <= 0.5)

MIN_INSTANCES_PER_NODE

The minimum number of instances each child must have after a split. If a split results in fewer instances than this value, the split is discarded.

1

(>= 1)

MAX_MEMORY_IN_MB

The maximum memory, in MB, allocated to histogram aggregation. If this value is too small, only 1 node is split per iteration, and its aggregates may exceed this size.

256

Any positive integer value

PREDICTION_COL

The column name for prediction output.

“prediction”

Any string

VALIDATION_INDICATOR_COL

The column name indicating whether each row is used for training or validation. false for training and true for validation.

NOT SET

Any string

LEAF_COL

The column name for leaf indices. The predicted leaf index of each instance in each tree, generated by preorder traversal.

“”

Any string

FEATURE_SUBSET_STRATEGY

This parameter specifies the number of features to consider for splits at each tree node.

“auto”

auto, all, onethird, sqrt, log2, or a fraction between 0 and 1.0

SEED

The random seed.

NOT SET

Any 64-bit number

WEIGHT_COL

The column name, for instance, weights. If not set or empty, all instance weights are treated as 1.0.

NOT SET

Any string

LOSS_TYPE

This parameter specifies the loss function that the Gradient Boosted Tree model minimizes.

“squared”

squared (L2), and absolute (L1). Note: Values are case-insensitive.

STEP_SIZE

The step size (also known as the learning rate) in the range (0, 1], used to shrink the contribution of each estimator.

0.1

(0, 1]

MAX_ITER

The maximum number of iterations for the algorithm.

20

(>= 0)

SUBSAMPLING_RATE

The fraction of the training data used to learn each decision tree, in the range (0, 1].

1.0

(0, 1]

Example

CREATE MODEL modelname OPTIONS(
  type = 'gradient_boosted_tree_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Isotonic regression

Isotonic Regression is an algorithm used to iteratively adjust distances while preserving the relative order of dissimilarities in the data.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Isotonic Regression.

Parameter

Description

Default value

Possible Values

ISOTONIC

Specifies whether the output sequence should be isotonic (increasing) when true or antitonic (decreasing) when false.

true

true, false

WEIGHT_COL

The column name, for instance, weights. If not set or empty, all instance weights are treated as 1.0.

NOT SET

Any string

PREDICTION_COL

The column name for prediction output.

“prediction”

Any string

FEATURE_INDEX

The index of the feature, applicable when featuresCol is a vector column. If not set, the default value is 0. Otherwise, it has no effect.

0

Any non-negative integer

Example

CREATE MODEL modelname OPTIONS(
  type = 'isotonic_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Linear regression

Linear Regression is a supervised machine learning algorithm that fits a linear equation to data in order to model the relationship between a dependent variable and independent features.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Linear Regression.

Parameter

Description

Default value

Possible Values

MAX_ITER

The maximum number of iterations.

100

(>= 0)

REGPARAM

The regularization parameter used to control the complexity of the model.

0.0

(>= 0)

ELASTICNETPARAM

The ElasticNet mixing parameter, which controls the balance between L1 (Lasso) and L2 (Ridge) penalties. A value of 0 applies an L2 penalty, while a value of 1 applies an L1 penalty.

0.0

(>= 0, <= 1)

Example

CREATE MODEL modelname OPTIONS(
  type = 'linear_reg'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Random Forest Regression

Random Forest Regression is an ensemble algorithm that builds multiple decision trees during training and returns the average prediction of those trees for regression tasks, helping to prevent overfitting.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Random Forest Regression.

Parameter

Description

Default value

Possible Values

MAX_BINS

The maximum number of bins used to discretize continuous features and determine how features are split at each node. More bins provide higher granularity.

32

Must be at least 2 and at least equal to the number of categories in any categorical feature.

CACHE_NODE_IDS

If false, the algorithm passes trees to executors to match instances with nodes. If true, the algorithm caches node IDs for each instance, speeding up the training of deeper trees.

false

true, false

CHECKPOINT_INTERVAL

Specifies how often to checkpoint the cached node IDs. For example, 10 means the cache is checkpointed every 10 iterations.

10

(>= 1)

IMPURITY

The criterion used for information gain calculation (case-insensitive).

“entropy”

entropy, gini

MAX_DEPTH

The maximum depth of the tree (non-negative). For example, depth 0 means 1 leaf node, and depth 1 means 1 internal node and 2 leaf nodes.

5

Any non-negative integer

MIN_INFO_GAIN

The minimum information gain required for a split to be considered at a tree node.

0.0

(>= 0.0)

MIN_WEIGHT_FRACTION_PER_NODE

The minimum fraction of the weighted sample count that each child must have after a split. If a split causes the fraction of the total weight in either child to be less than this value, it is discarded.

0.0

(>= 0.0, <= 0.5)

MIN_INSTANCES_PER_NODE

The minimum number of instances each child must have after a split. If a split results in fewer instances than this value, the split is discarded.

1

(>= 1)

MAX_MEMORY_IN_MB

The maximum memory, in MB, allocated to histogram aggregation. If this value is too small, only 1 node is split per iteration, and its aggregates may exceed this size.

256

(>= 1)

BOOTSTRAP

Whether to use bootstrap samples when building trees.

TRUE

true, false

NUM_TREES

The number of trees to train (at least 1). If 1, no bootstrapping is used. If greater than 1, bootstrapping is applied.

20

(>= 1)

SUBSAMPLING_RATE

The fraction of training data used to train each decision tree, in the range (0, 1].

1.0

(>= 0.0, <= 1)

LEAF_COL

The column name for leaf indices, which is the predicted leaf index of each instance in each tree, generated by preorder traversal.

“”

Any string

PREDICTION_COL

The column name for prediction output.

“prediction”

Any string

SEED

The random seed.

NOT SET

Any 64-bit number

WEIGHT_COL

The column name, for instance, weights. If not set or empty, all instance weights are treated as 1.0.

NOT SET

Any valid column name or leave empty.

Example

CREATE MODEL modelname OPTIONS(
  type = 'random_forest_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Survival Regression

Survival Regression is used to fit a parametric survival regression model, known as the Accelerated Failure Time (AFT) model, based on the Weibull distribution. It can stack instances into blocks for enhanced performance.

Parameters

The table below outlines key parameters for configuring and optimizing the performance of Survival Regression.

Parameter

Description

Default value

Possible Values

MAX_ITER

The maximum number of iterations that the algorithm should run.

100

(>= 0)

TOL

The convergence tolerance.

1E-6

(>= 0)

AGGREGATION_DEPTH

The suggested depth for treeAggregate. If the feature dimensions or the number of partitions are large, this parameter can be set to a larger value.

2

(>= 2)

FIT_INTERCEPT

Whether to fit an intercept term.

TRUE

true, false

PREDICTION_COL

The column name for prediction output.

“prediction”

Any string

CENSOR_COL

The column name for censoring. A value of 1 indicates that the event has occurred (uncensored), while 0 means the event is censored.

“censor”

0, 1

MAX_BLOCK_SIZE_IN_MB

The maximum memory in MB for stacking input data into blocks. If the remaining data size in a partition is smaller, this value is adjusted accordingly. A value of 0 allows automatic adjustment.

0.0

(>= 0)

Example

CREATE MODEL modelname OPTIONS(
  type = 'survival_regression'
) AS
  SELECT col1, col2, col3 FROM training-dataset

Next steps

After reading this document, you now know how to configure and use various regression algorithms. Next, refer to the documents on classification and clustering to learn about other types of advanced statistical models.

recommendation-more-help