Regression algorithms regression-algorithms
This document provides an overview of various regression algorithms, focusing on their configuration, key parameters, and practical usage in advanced statistical models. Regression algorithms are used to model the relationship between dependent and independent variables, predicting continuous outcomes based on observed data. Each section includes parameter descriptions and example code to help you implement and optimize these algorithms for tasks such as linear, random forest, and survival regression.
Decision Tree regression decision-tree-regression
Decision Tree learning is a supervised learning method used in statistics, data mining, and machine learning. In this approach, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of decision tree models.
MAX_BINSCACHE_NODE_IDSfalse, the algorithm passes trees to executors to match instances with nodes. If true, the algorithm caches node IDs for each instance, which can speed up the training of deeper trees. Users can configure how often the cache should be checkpointed or disable it by setting CHECKPOINT_INTERVAL.true or falseCHECKPOINT_INTERVAL10 means the cache will be checkpointed every 10 iterations. This is only applicable if CACHE_NODE_IDS is set to true and the checkpoint directory is configured in org.apache.spark.SparkContext.IMPURITYentropy and gini.ginientropy, giniMAX_DEPTH0 means 1 leaf node, while a depth of 1 means 1 internal node and 2 leaf nodes. The depth must be within the range [0, 30].MIN_INFO_GAINMIN_WEIGHT_FRACTION_PER_NODEMIN_INSTANCES_PER_NODEMAX_MEMORY_IN_MBPREDICTION_COLSEEDWEIGHT_COL1.0.Example
CREATE MODEL modelname OPTIONS(
type = 'decision_tree_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Factorization Machines regression factorization-machines-regression
Factorization Machines is a regression learning algorithm that supports normal gradient descent and the AdamW solver. The algorithm is based on the paper by S. Rendle (2010), “Factorization Machines.”
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Factorization Machines regression.
TOL1E-6FACTOR_SIZEFIT_INTERCEPTtruetrue, falseFIT_LINEARtruetrue, falseINIT_STDMAX_ITERMINI_BATCH_FRACTION(0, 1].(0, 1]REG_PARAMSEEDSOLVERgd (gradient descent), adamWSTEP_SIZEPREDICTION_COLExample
CREATE MODEL modelname OPTIONS(
type = 'factorization_machines_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Generalized Linear regression generalized-linear-regression
Unlike linear regression, which assumes that the outcome follows a normal (Gaussian) distribution, Generalized Linear Models (GLMs) allow the outcome to follow different types of distributions, such as Poisson or binomial, depending on the nature of the data.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Generalized Linear regression.
MAX_ITERirls).REG_PARAMTOL1E-6AGGREGATION_DEPTHtreeAggregate.FAMILYgaussian, binomial, poisson, gamma, and tweedie.gaussian, binomial, poisson, gamma, tweedieFIT_INTERCEPTtruetrue, falseLINKidentity, log, inverse, logit, probit, cloglog, and sqrt.identity, log, inverse, logit, probit, cloglog, sqrtLINK_POWER1 - variancePower, which aligns with the R statmod package. Specific link powers of 0, 1, -1, and 0.5 correspond to the Log, Identity, Inverse, and Sqrt links, respectively.SOLVERirls (iteratively reweighted least squares).irlsVARIANCE_POWER[1, inf). Variance powers of 0, 1, and 2 correspond to Gaussian, Poisson, and Gamma families, respectively.[1, inf)LINK_PREDICTION_COLOFFSET_COLWEIGHT_COL1.0. In the Binomial family, weights correspond to the number of trials, and non-integer weights are rounded in AIC calculation.Example
CREATE MODEL modelname OPTIONS(
type = 'generalized_linear_reg'
) AS
SELECT col1, col2, col3 FROM training-dataset
Gradient Boosted Tree regression gradient-boosted-tree-regression
Gradient-boosted trees (GBTs) are an effective method for classification and regression that combines the predictions of multiple decision trees to improve predictive accuracy and model performance.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Gradient Boosted Tree regression.
MAX_BINSCACHE_NODE_IDSfalse, the algorithm passes trees to executors to match instances with nodes. If true, the algorithm caches node IDs for each instance. Caching can speed up training of deeper trees.falsetrue, falseCHECKPOINT_INTERVAL10 means the cache is checkpointed every 10 iterations.MAX_DEPTH0 means 1 leaf node, and depth 1 means 1 internal node with 2 leaf nodes.MIN_INFO_GAINMIN_WEIGHT_FRACTION_PER_NODEMIN_INSTANCES_PER_NODEMAX_MEMORY_IN_MBPREDICTION_COLVALIDATION_INDICATOR_COLfalse for training and true for validation.LEAF_COLFEATURE_SUBSET_STRATEGYauto, all, onethird, sqrt, log2, or a fraction between 0 and 1.0SEEDWEIGHT_COL1.0.LOSS_TYPEsquared (L2), and absolute (L1). Note: Values are case-insensitive.STEP_SIZE(0, 1], used to shrink the contribution of each estimator.(0, 1]MAX_ITERSUBSAMPLING_RATE(0, 1].(0, 1]Example
CREATE MODEL modelname OPTIONS(
type = 'gradient_boosted_tree_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Isotonic regression isotonic-regression
Isotonic Regression is an algorithm used to iteratively adjust distances while preserving the relative order of dissimilarities in the data.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Isotonic Regression.
ISOTONICtrue or antitonic (decreasing) when false.truetrue, falseWEIGHT_COL1.0.PREDICTION_COLFEATURE_INDEXfeaturesCol is a vector column. If not set, the default value is 0. Otherwise, it has no effect.Example
CREATE MODEL modelname OPTIONS(
type = 'isotonic_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Linear regression linear-regression
Linear Regression is a supervised machine learning algorithm that fits a linear equation to data in order to model the relationship between a dependent variable and independent features.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Linear Regression.
MAX_ITERREGPARAMELASTICNETPARAMExample
CREATE MODEL modelname OPTIONS(
type = 'linear_reg'
) AS
SELECT col1, col2, col3 FROM training-dataset
Random Forest Regression random-forest-regression
Random Forest Regression is an ensemble algorithm that builds multiple decision trees during training and returns the average prediction of those trees for regression tasks, helping to prevent overfitting.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Random Forest Regression.
MAX_BINSCACHE_NODE_IDSfalse, the algorithm passes trees to executors to match instances with nodes. If true, the algorithm caches node IDs for each instance, speeding up the training of deeper trees.falsetrue, falseCHECKPOINT_INTERVAL10 means the cache is checkpointed every 10 iterations.IMPURITYentropy, giniMAX_DEPTH0 means 1 leaf node, and depth 1 means 1 internal node and 2 leaf nodes.MIN_INFO_GAINMIN_WEIGHT_FRACTION_PER_NODEMIN_INSTANCES_PER_NODEMAX_MEMORY_IN_MBBOOTSTRAPtrue, falseNUM_TREES1, no bootstrapping is used. If greater than 1, bootstrapping is applied.SUBSAMPLING_RATE(0, 1].LEAF_COLPREDICTION_COLSEEDWEIGHT_COL1.0.Example
CREATE MODEL modelname OPTIONS(
type = 'random_forest_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Survival Regression survival-regression
Survival Regression is used to fit a parametric survival regression model, known as the Accelerated Failure Time (AFT) model, based on the Weibull distribution. It can stack instances into blocks for enhanced performance.
Parameters
The table below outlines key parameters for configuring and optimizing the performance of Survival Regression.
MAX_ITERTOL1E-6AGGREGATION_DEPTHtreeAggregate. If the feature dimensions or the number of partitions are large, this parameter can be set to a larger value.FIT_INTERCEPTtrue, falsePREDICTION_COLCENSOR_COL1 indicates that the event has occurred (uncensored), while 0 means the event is censored.MAX_BLOCK_SIZE_IN_MB0 allows automatic adjustment.Example
CREATE MODEL modelname OPTIONS(
type = 'survival_regression'
) AS
SELECT col1, col2, col3 FROM training-dataset
Next steps
After reading this document, you now know how to configure and use various regression algorithms. Next, refer to the documents on classification and clustering to learn about other types of advanced statistical models.