Optimizers¶
Overview¶
The scikit-learn library provides functionality for training linear models and a large number of related tools. The present module provides simplified interfaces for various linear model regression methods. These methods are set up in a way that work out of the box for typical problems in cluster expansion and force constant potential construction, including slight adjustments to scikit-learn default values.
If you need more flexibility, extended functionality or the ability to fine-tune parameters that are not included in this interface, it is possible to use scikit-learn directly as demonstrated by an example in the advanced topics section.
The list of available linear regression method through the hiphive interface is
Ordinary Least Squares (OLS)
Least Absolute Shrinkage and Selection Operator (LASSO)
Adaptive-LASSO
Ridge and Bayesian-ridge
Elasticnet
Recursive Feature Elimination (RFE)
Automatic Relevance Determination Regression (ARDR)
Fitting using Orthogonal Matching Pursuit (OMP)
L1-regularization with split-Bregman
The most commonly used fit methods in the present context are LASSO, automatic relevance determination regression (ARDR), recursive feature elimination with \(\ell_2\)-fitting (RFE-L2) as well as ordinary least-squares optimization (OLS). Their usage and performance is illustrated by the feature selection and learning curve examples in the advanced topics section. Below follows a short summary of the main algorithms. More information about the available linear models can be found in the scikit-learn documentation.
Least-squares¶
Ordinary least-squares (OLS) optimization is providing a solution to the linear problem
where \(\boldsymbol{A}\) is the sensing matrix, \(\boldsymbol{y}\) is the vector of target values, and \(\boldsymbol{x}\) is the solution (parameter vector) that one seeks to obtain. The objective is given by
The OLS method is chosen by setting the fit_method
keyword to
least-squares
.
LASSO¶
The least absolute shrinkage and selection operator (LASSO) is a method for performing variable selection and regularization in problems in statistics and machine learning. The optimization objective is given by
While the first term ensures that \(\boldsymbol{x}\) is a solution to the linear problem at hand, the second term introduces regularization and guides the algorithm toward finding sparse solutions, in the spirit of compressive sensing. In general, LASSO is suited for solving strongly underdetermined problems.
The LASSO optimizer is chosen by setting the fit_method
keyword to
lasso
. The \(\alpha\) parameter is set via the alpha
keyword. If no
value is specified a line scan will be carried out automatically to determine
the optimal value.
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
|
controls the sparsity of the solution vector |
|
Automatic relevance determination regression (ARDR)¶
Automatic relevance determination regression (ARDR) is an optimization algorithm provided by scikit-learn that is similar to Bayesian Ridge Regression, which provides a probabilistic model of the regression problem at hand. The method is also known as Sparse Bayesian Learning and Relevance Vector Machine.
The ARDR optimizer is chosen by setting the fit_method
keyword to ardr
.
The threshold lambda parameter, which controls the sparsity of the solution
vector, is set via the threshold_lambda
keyword (default: 1e6).
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
|
controls the sparsity of the solution vector |
|
split-Bregman¶
The split-Bregman method [GolOsh09] is designed to solve a broad class of \(\ell_1\)-regularized problems. The solution vector \(\boldsymbol{x}\) is given by
where \(\boldsymbol{d}\) is an auxiliary quantity, while \(\mu\) and \(\lambda\) are hyperparameters that control the sparseness of the solution and the efficiency of the algorithm.
The approach can be optimized by addition of a preconditioning step [ZhoSadAbe19].
This speed-up enables efficient hyperparameter optimization of mu values.
By default, the split-bregman
fit method will trial a range of
\(\mu\) values and choose the optimal based on cross validation.
The split-Bregman implementation supports the following keywords.
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
|
sparseness parameter |
|
|
|
weight of additional L2-norm in split-Bregman |
|
|
|
maximal number of split-Bregman iterations |
|
|
|
convergence criterion of iterative minimization |
|
|
|
convergence criterion of conjugate gradient step |
|
|
|
maximal number of conjugate gradient iterations |
|
|
|
number of CV splits for finding optimal mu value |
|
|
|
how often to print fitting information to stdout |
|
Recursive feature elimination¶
Recursive feature elimination (RFE) is a feature selection algorithm that obtains the optimal features by carrying out a series of fits, starting with the full set of parameters and then iteratively eliminating the less important ones. RFE needs to be combined with a specific fit method. Since RFE may require many hundreds of single fits its often advisable to use ordinary least-squares as training method, which is the default behavior. The present implementation is based on the implementation of feature selection in scikit-learn.
The RFE optimizer is chosen by setting the fit_method
keyword to
rfe
. The n_features
keyword allows one to specify the number of
features to select. If this parameter is left unspecified RFE with
cross-validation will be used to determine the optimal number of features.
After the optimal number of features has been determined the final model is
trained. The fit method for the final fit can be controlled via
final_estimator
. Here, estimator
and final_estimator
can be set to
any of the fit methods described in this section. For example,
estimator='lasso'
implies that a LASSO-CV scan is carried out for each fit
in the RFE algorithm.
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
|
number of features to select |
|
|
|
number parameters to eliminate |
|
|
percentage of parameters to eliminate |
|
|
|
|
number of CV splits (90/10) used when optimizing |
|
|
|
fit method to be used in RFE algorithm |
|
|
|
fit method to be used in the final fit |
= |
|
|
keyword arguments for fit method defined by |
|
|
|
keyword arguments for fit method defined by |
|
Note
When running on multi-core systems please be mindful of memory consumption. By default all CPUs will be used (n_jobs=-1), which will duplicate data and can require a lot of memory, potentially giving rise to errors. To prevent this behavior you can set the [n_jobs parameter](https://scikit-learn.org/stable/glossary.html#term-n-jobs) explicitly, which is handed over directly to scikit-learn.
Other methods¶
The optimizers furthermore support the ridge
method
(ridge
), the elastic net
method
(elasticnet
) as well as Bayesian ridge regression
(bayesian-ridge
).
Optimizer¶
-
class
hiphive.fitting.
Optimizer
(fit_data, fit_method='least-squares', standardize=True, train_size=0.9, test_size=None, train_set=None, test_set=None, check_condition=True, seed=42, **kwargs)[source] This optimizer finds a solution to the linear \(\boldsymbol{A}\boldsymbol{x}=\boldsymbol{y}\) problem.
One has to specify either train_size/test_size or train_set/test_set If either train_set or test_set (or both) is specified the fractions will be ignored.
Warning
Repeatedly setting up a Optimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple Optimizer instances.
- Parameters
fit_data (tuple(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
fit_method (str) – method to be used for training; possible choice are “ardr”, “bayesian-ridge”, “elasticnet”, “lasso”, “least-squares”, “omp”, “rfe”, “ridge”, “split-bregman”
standardize (bool) – if True the fit matrix and target values are standardized before fitting, meaning columns in the fit matrix and th target values are rescaled to have a standard deviation of 1.0.
train_size (float or int) – If float represents the fraction of fit_data (rows) to be used for training. If int, represents the absolute number of rows to be used for training.
test_size (float or int) – If float represents the fraction of fit_data (rows) to be used for testing. If int, represents the absolute number of rows to be used for testing.
train_set (tuple or list(int)) – indices of rows of A/y to be used for training
test_set (tuple or list(int)) – indices of rows of A/y to be used for testing
check_condition (bool) – if True the condition number will be checked (this can be sligthly more time consuming for larger matrices)
seed (int) – seed for pseudo random number generator
-
train_scatter_data
target and predicted value for each row in the training set
- Type
ScatterData
-
test_scatter_data
target and predicted value for each row in the test set
- Type
ScatterData
-
compute_rmse
(A, y) Returns the root mean squared error (RMSE) using \(\boldsymbol{A}\), \(\boldsymbol{y}\), and the vector of fitted parameters \(\boldsymbol{x}\), corresponding to \(\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{y}\|_2\).
-
property
fit_method
fit method
- Return type
str
-
get_contributions
(A) Returns the average contribution for each row of A to the predicted values from each element of the parameter vector.
-
property
n_nonzero_parameters
number of non-zero parameters
- Return type
int
-
property
n_parameters
number of parameters (=columns in A matrix)
- Return type
int
-
property
n_target_values
number of target values (=rows in A matrix)
- Return type
int
-
property
parameters
copy of parameter vector
- Return type
-
property
parameters_norm
the norm of the parameters
- Return type
float
-
predict
(A) Predicts data given an input matrix \(\boldsymbol{A}\), i.e., \(\boldsymbol{A}\boldsymbol{x}\), where \(\boldsymbol{x}\) is the vector of the fitted parameters. The method returns the vector of predicted values or a float if a single row provided as input.
-
property
rmse_test
root mean squared error for test set
- Return type
float
-
property
rmse_train
root mean squared error for training set
- Return type
float
-
property
seed
seed used to initialize pseudo random number generator
- Return type
int
-
property
standardize
if True standardize the fit matrix before fitting
- Return type
bool
-
property
summary
comprehensive information about the optimizer
- Return type
Dict
[str
,Any
]
-
property
test_fraction
fraction of rows included in test set
- Return type
float
-
property
test_set
indices of rows included in the test set
- Return type
List
[int
]
-
property
test_size
number of rows included in test set
- Return type
int
-
train
()[source] Carries out training.
- Return type
None
-
property
train_fraction
fraction of rows included in training set
- Return type
float
-
property
train_set
indices of rows included in the training set
- Return type
List
[int
]
-
property
train_size
number of rows included in training set
- Return type
int
-
write_summary
(fname) Writes summary dict to file
EnsembleOptimizer¶
-
class
hiphive.fitting.
EnsembleOptimizer
(fit_data, fit_method='least-squares', standardize=True, ensemble_size=50, train_size=1.0, bootstrap=True, check_condition=True, seed=42, **kwargs)[source] The ensemble optimizer carries out a series of single optimization runs using the
Optimizer
class in order to solve the linear \(\boldsymbol{A}\boldsymbol{x} = \boldsymbol{y}\) problem. Subsequently, it provides access to various ensemble averaged quantities such as errors and parameters.Warning
Repeatedly setting up a EnsembleOptimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple EnsembleOptimizer instances.
- Parameters
fit_data (tuple(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
fit_method (str) – method to be used for training; possible choice are “ardr”, “bayesian-ridge”, “elasticnet”, “lasso”, “least-squares”, “omp”, “rfe”, “ridge”, “split-bregman”
standardize (bool) – if True the fit matrix and target values are standardized before fitting, meaning columns in the fit matrix and th target values are rescaled to have a standard deviation of 1.0.
ensemble_size (int) – number of fits in the ensemble
train_size (float or int) – if float represents the fraction of fit_data (rows) to be used for training; if int, represents the absolute number of rows to be used for training
bootstrap (bool) – if True sampling will be carried out with replacement
check_condition (bool) – if True the condition number will be checked (this can be sligthly more time consuming for larger matrices)
seed (int) – seed for pseudo random number generator
-
property
bootstrap
True if sampling is carried out with replacement
- Return type
bool
-
compute_rmse
(A, y) Returns the root mean squared error (RMSE) using \(\boldsymbol{A}\), \(\boldsymbol{y}\), and the vector of fitted parameters \(\boldsymbol{x}\), corresponding to \(\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{y}\|_2\).
-
property
ensemble_size
number of train rounds
- Return type
int
-
property
error_matrix
matrix of fit errors where N is the number of target values and M is the number of fits (i.e., the size of the ensemble)
- Return type
-
property
fit_method
fit method
- Return type
str
-
get_contributions
(A) Returns the average contribution for each row of A to the predicted values from each element of the parameter vector.
-
property
n_nonzero_parameters
number of non-zero parameters
- Return type
int
-
property
n_parameters
number of parameters (=columns in A matrix)
- Return type
int
-
property
n_target_values
number of target values (=rows in A matrix)
- Return type
int
-
property
parameters
copy of parameter vector
- Return type
-
property
parameters_norm
the norm of the parameters
- Return type
float
-
property
parameters_splits
all parameters vectors in the ensemble
- Return type
List
[ndarray
]
-
property
parameters_std
standard deviation for each parameter
- Return type
-
predict
(A, return_std=False)[source] Predicts data given an input matrix \(oldsymbol{A}\), i.e., \(\boldsymbol{A}\boldsymbol{x}\), where \(\boldsymbol{x}\) is the vector of the fitted parameters. The method returns the vector of predicted values and optionally also the vector of standard deviations.
By using all parameter vectors in the ensemble a standard deviation of the prediction can be obtained.
-
property
rmse_test
ensemble average of root mean squared error over test sets
- Return type
float
-
property
rmse_test_ensemble
root mean squared test errors obtained during for each fit in ensemble
- Return type
-
property
rmse_train
ensemble average of root mean squared error over train sets
- Return type
float
-
property
rmse_train_ensemble
root mean squared train errors obtained during for each fit in ensemble
- Return type
-
property
seed
seed used to initialize pseudo random number generator
- Return type
int
-
property
standardize
if True standardize the fit matrix before fitting
- Return type
bool
-
property
summary
comprehensive information about the optimizer
- Return type
Dict
[str
,Any
]
-
train
()[source] Carries out ensemble training and construct the final model by averaging over all models in the ensemble.
- Return type
None
-
property
train_fraction
fraction of input data used for training; this value can differ slightly from the value set during initialization due to rounding
- Return type
float
-
property
train_size
number of rows included in train sets; note that this will be different from the number of unique rows if boostrapping
- Return type
int
-
write_summary
(fname) Writes summary dict to file
CrossValidationEstimator¶
-
class
hiphive.fitting.
CrossValidationEstimator
(fit_data, fit_method='least-squares', standardize=True, validation_method='k-fold', n_splits=10, check_condition=True, seed=42, **kwargs)[source] This class provides an optimizer with cross validation for solving the linear \(\boldsymbol{A}\boldsymbol{x} = \boldsymbol{y}\) problem. Cross-validation (CV) scores are calculated by splitting the available reference data in multiple different ways. It also produces the finalized model (using the full input data) for which the CV score is an estimation of its performance.
Warning
Repeatedly setting up a CrossValidationEstimator and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple CrossValidationEstimator instances.
- Parameters
fit_data (tuple(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
fit_method (str) – method to be used for training; possible choice are “ardr”, “bayesian-ridge”, “elasticnet”, “lasso”, “least-squares”, “omp”, “rfe”, “ridge”, “split-bregman”
standardize (bool) – if True the fit matrix and target values are standardized before fitting, meaning columns in the fit matrix and th target values are rescaled to have a standard deviation of 1.0.
validation_method (str) – method to use for cross-validation; possible choices are “shuffle-split”, “k-fold”
n_splits (int) – number of times the fit data set will be split for the cross-validation
check_condition (bool) – if True the condition number will be checked (this can be sligthly more time consuming for larger matrices)
seed (int) – seed for pseudo random number generator
-
train_scatter_data
contains target and predicted values from each individual traininig set in the cross-validation split;
ScatterData
is a namedtuple.- Type
ScatterData
-
validation_scatter_data
contains target and predicted values from each individual validation set in the cross-validation split;
ScatterData
is a namedtuple.- Type
ScatterData
-
compute_rmse
(A, y) Returns the root mean squared error (RMSE) using \(\boldsymbol{A}\), \(\boldsymbol{y}\), and the vector of fitted parameters \(\boldsymbol{x}\), corresponding to \(\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{y}\|_2\).
-
property
fit_method
fit method
- Return type
str
-
get_contributions
(A) Returns the average contribution for each row of A to the predicted values from each element of the parameter vector.
-
property
n_nonzero_parameters
number of non-zero parameters
- Return type
int
-
property
n_nonzero_parameters_splits
number of non-zero parameters for each split
- Return type
-
property
n_parameters
number of parameters (=columns in A matrix)
- Return type
int
-
property
n_splits
number of splits (folds) used for cross-validation
- Return type
int
-
property
n_target_values
number of target values (=rows in A matrix)
- Return type
int
-
property
parameters
copy of parameter vector
- Return type
-
property
parameters_norm
the norm of the parameters
- Return type
float
-
property
parameters_splits
all parameters obtained during cross-validation
- Return type
-
predict
(A) Predicts data given an input matrix \(\boldsymbol{A}\), i.e., \(\boldsymbol{A}\boldsymbol{x}\), where \(\boldsymbol{x}\) is the vector of the fitted parameters. The method returns the vector of predicted values or a float if a single row provided as input.
-
property
rmse_train
average root mean squared training error obtained during cross-validation
- Return type
float
-
property
rmse_train_final
root mean squared error when using the full set of input data
- Return type
float
-
property
rmse_train_splits
root mean squared training errors obtained during cross-validation
- Return type
-
property
rmse_validation
average root mean squared cross-validation error
- Return type
float
-
property
rmse_validation_splits
root mean squared validation errors obtained during cross-validation
- Return type
-
property
seed
seed used to initialize pseudo random number generator
- Return type
int
-
property
standardize
if True standardize the fit matrix before fitting
- Return type
bool
-
property
summary
comprehensive information about the optimizer
- Return type
Dict
[str
,Any
]
-
train
()[source] Constructs the final model using all input data available.
- Return type
None
-
validate
()[source] Runs validation.
- Return type
None
-
property
validation_method
validation method name
- Return type
str
-
write_summary
(fname) Writes summary dict to file