Optimizers¶
Overview¶
The scikitlearn library provides functionality for training linear models and a large number of related tools. The present module provides simplified interfaces for various linear model regression methods. These methods are set up in a way that work out of the box for typical problems in cluster expansion and force constant potential construction, including slight adjustments to scitkitlearn default values.
If you need more flexibility, extended functionality or the ability to finetune parameters that are not included in this interface, it is possible to use scikitlearn directly as demonstrated by an example in the advanced topics section.
The most commonly used fit methods in the present context are LASSO, automatic relevance determination regression (ARDR), recursive feature elimination with \(\ell_2\)fitting (RFEL2) as well as ordinary leastsquares optimization (OLS). Their usage and performance is illustrated by the feature selection and learning curve examples in the advanced topics section. Below follows a short summary of the main algorithms. More information about the available linear models can be found in the scikitlearn documentation.
Leastsquares¶
Ordinary leastsquares (OLS) optimization is providing a solution to the linear problem
where \(\boldsymbol{A}\) is the sensing matrix, \(\boldsymbol{y}\) is the vector of target values, and \(\boldsymbol{x}\) is the solution (parameter vector) that one seeks to obtain. The objective is given by
The OLS optimizer is chosen by setting the fit_method
keyword to
leastsquares
.
LASSO¶
The least absolute shrinkage and selection operator (LASSO) is a method for performing variable selection and regularization in problems in statistics and machine learning. The optimization objective is given by
While the first term ensures that \(\boldsymbol{x}\) is a solution to the linear problem at hand, the second term introduces regularization and guides the algorithm toward finding sparse solutions, in the spirit of compressive sensing. In general, LASSO is suited for solving strongly underdetermined problems.
The LASSO optimizer is chosen by setting the fit_method
keyword to
lasso
. The \(\alpha\) parameter is set via the alpha
keyword. If no
value is specified a line scan will be carried out automatically to determine
the optimal value. The training data can be automatically centered by setting
the keyword fit_intercept
to True
.
ARDR¶
Automatic relevance determination regression (ARDR) is an optimization algorithm provided by scikitlearn that is similar to Bayesian Ridge Regression, which provides a probabilistic model of the regression problem at hand. The method is also known as Sparse Bayesian Learning and Relevance Vector Machine.
The ARDR optimizer is chosen by setting the fit_method
keyword to ardr
.
The threshold lambda parameter that is forwarded to scikitlearn is set via the
threshold_lambda
keyword (default: 1e6). The training data can be
automatically centered by setting the keyword fit_intercept
to True
.
RFEL2¶
Recursive feature elimination (RFE) with \(\ell_2\)fitting (RFEL2) is a mix between first obtaining the important features using recursive feature elimination (RFE) as implemented in scikitlearn and then carrying out an ordinary leastsquare fit using the selected features.
The RFEL2 optimizer is chosen by setting the fit_method
keyword to
rfel2
. The n_features
keyword allows one to specify the number of
features to select. If this parameter is left unspecified RFE with
crossvalidation will be used to determine the optimal number of features. The
number of parameters to eliminate in each iteration is set via the step
keyword.
Other methods¶
Some other optimization methods are also available, including
 Elastic net (
elasticnet
)  splitBregman (
splitbregman
)  Bayesian ridge regression (
bayesianridge
)
Optimizer¶

class
hiphive.fitting.
Optimizer
(fit_data, fit_method='leastsquares', standardize=True, train_size=0.75, test_size=None, train_set=None, test_set=None, check_condition=True, seed=42, **kwargs)[source] Optimizer for single \(Ax = y\) fit.
One has to specify either train_size/test_size or train_set/test_set If either train_set or test_set (or both) is specified the fractions will be ignored.
Warning
Repeatedly setting up a Optimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple Optimizer instances.
Parameters:  fit_data (tupe(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
 fit_method (str) – method to be used for training; possible choice are “leastsquares”, “lasso”, “elasticnet”, “bayesianridge”, “ardr”, “rfel2”, “splitbregman”
 standardize (bool) – whether or not to standardize the fit matrix before fitting
 train_size (float or int) – If float represents the fraction of fit_data (rows) to be used for training. If int, represents the absolute number of rows to be used for training.
 test_size (float or int) – If float represents the fraction of fit_data (rows) to be used for testing. If int, represents the absolute number of rows to be used for testing.
 train_set (tuple or list(int)) – indices of rows of A/y to be used for training
 test_set (tuple or list(int)) – indices of rows of A/y to be used for testing
 check_condition (bool) –
whether or not to carry out a check of the condition number
N.B.: This can be sligthly more time consuming for larger matrices.
 seed (int) – seed for pseudo random number generator

train_scatter_data
¶ target and predicted value for each row in the training set
Type: ScatterData object (namedtuple)

test_scatter_data
¶ target and predicted value for each row in the test set
Type: ScatterData object (namedtuple)

compute_rmse
(A, y) Computes the root mean square error using \(A\), \(y\), and the vector of fitted parameters \(x\) corresponding to \(Axy_2\).
Parameters:  A (numpy.ndarray) – fit matrix (N,M array) where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x)
 y (numpy.ndarray) – vector of target values
Returns: root mean squared error
Return type: float

contributions_test
average contribution to the predicted values for the test set from each parameter
Type: numpy.ndarray

contributions_train
average contribution to the predicted values for the train set from each parameter
Type: numpy.ndarray

fit_method
fit method
Type: str

get_contributions
(A) Computes the average contribution to the predicted values from each element of the parameter vector.
Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters Returns: average contribution for each row of A from each parameter Return type: numpy.ndarray

n_nonzero_parameters
number of nonzero parameters
Type: int

n_parameters
number of parameters (=columns in A matrix)
Type: int

n_target_values
number of target values (=rows in A matrix)
Type: int

parameters
copy of parameter vector
Type: numpy.ndarray

predict
(A) Predicts data given an input matrix \(A\), i.e., \(Ax\), where \(x\) is the vector of the fitted parameters.
Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters Returns: vector of predicted values; float if single row provided as input Return type: numpy.ndarray or float

rmse_test
root mean squared error for test set
Type: float

rmse_train
root mean squared error for training set
Type: float

seed
seed used to initialize pseudo random number generator
Type: int

standardize
whether or not to standardize the fit matrix before fitting
Type: bool

summary
comprehensive information about the optimizer
Type: dict

test_fraction
fraction of rows included in test set
Type: float

test_set
indices of rows included in the test set
Type: list

test_size
number of rows included in test set
Type: int

train
()[source] Carries out training.

train_fraction
fraction of rows included in training set
Type: float

train_set
indices of rows included in the training set
Type: list

train_size
number of rows included in training set
Type: int
EnsembleOptimizer¶

class
hiphive.fitting.
EnsembleOptimizer
(fit_data, fit_method='leastsquares', standardize=True, ensemble_size=50, train_size=1.0, bootstrap=True, check_condition=True, seed=42, **kwargs)[source] Ensemble optimizer that carries out a series of single optimization runs using the
Optimizer
class and then provides access to various ensemble averaged quantities including e.g., errors and parameters.Warning
Repeatedly setting up a EnsembleOptimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple EnsembleOptimizer instances.
Parameters:  fit_data (tupe(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
 fit_method (str) – method to be used for training; possible choice are “leastsquares”, “lasso”, “elasticnet”, “bayesianridge”, “ardr”, “rfel2”, “splitbregman”
 standardize (bool) – whether or not to standardize the fit matrix before fitting
 ensemble_size (int) – number of fits in the ensemble
 train_size (float or int) – If float represents the fraction of fit_data (rows) to be used for training. If int, represents the absolute number of rows to be used for training.
 bootstrap (bool) – if True sampling will be carried out with replacement
 check_condition (bool) –
whether or not to carry out a check of the condition number
N.B.: This can be sligthly more time consuming for larger matrices.
 seed (int) – seed for pseudo random number generator

bootstrap
True if sampling is carried out with replacement
Type: bool

compute_rmse
(A, y) Computes the root mean square error using \(A\), \(y\), and the vector of fitted parameters \(x\) corresponding to \(Axy_2\).
Parameters:  A (numpy.ndarray) – fit matrix (N,M array) where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x)
 y (numpy.ndarray) – vector of target values
Returns: root mean squared error
Return type: float

ensemble_size
number of train rounds
Type: int

error_matrix
matrix of fit errors where N is the number of target values and M is the number of fits (i.e., the size of the ensemble)
Type: numpy.ndarray

fit_method
fit method
Type: str

get_contributions
(A) Computes the average contribution to the predicted values from each element of the parameter vector.
Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters Returns: average contribution for each row of A from each parameter Return type: numpy.ndarray

n_nonzero_parameters
number of nonzero parameters
Type: int

n_parameters
number of parameters (=columns in A matrix)
Type: int

n_target_values
number of target values (=rows in A matrix)
Type: int

parameter_vectors
all parameter vectors in the ensemble
Type: list(numpy.ndarray)

parameters
copy of parameter vector
Type: numpy.ndarray

parameters_std
standard deviation for each parameter
Type: numpy.ndarray

predict
(A, return_std=False)[source] Predicts data given an input matrix A, i.e., Ax, where x is the vector of the fitted parameters.
By using all parameter vectors in the ensemble a standard deviation of the prediction can be obtained.
Parameters:  A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
 return_std (bool) – whether or not to return the standard deviation of the prediction
Returns: vector of predicted values, vector of standard deviations
Return type: tuple(numpy.ndarray, numpy.ndarray) or tuple(float, float)

rmse_test
ensemble average of root mean squared error over test sets
Type: float

rmse_test_ensemble
root mean squared test errors obtained during for each fit in ensemble
Type: list(float)

rmse_train
ensemble average of root mean squared error over train sets
Type: float

rmse_train_ensemble
root mean squared train errors obtained during for each fit in ensemble
Type: list(float)

seed
seed used to initialize pseudo random number generator
Type: int

standardize
whether or not to standardize the fit matrix before fitting
Type: bool

summary
comprehensive information about the optimizer
Type: dict

train
()[source] Carries out ensemble training and construct the final model by averaging over all models in the ensemble.

train_fraction
fraction of input data used for training; this value can differ slightly from the value set during initialization due to rounding
Type: float

train_size
number of rows included in train sets. Note that this will be different from the number of unique rows if boostrapping
Type: int
CrossValidationEstimator¶

class
hiphive.fitting.
CrossValidationEstimator
(fit_data, fit_method='leastsquares', standardize=True, validation_method='kfold', n_splits=10, check_condition=True, seed=42, **kwargs)[source] Optimizer with cross validation.
This optimizer can compute crossvalidation (CV) scores in different ways. It can also produce the finalized model (using the full input data) for which the CV score is an estimation of its performance.
Warning
Repeatedly setting up a CrossValidationEstimator and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple CrossValidationEstimator instances.
Parameters:  fit_data (tupe(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
 fit_method (str) – method to be used for training; possible choice are “leastsquares”, “lasso”, “elasticnet”, “bayesianridge”, “ardr”, “rfel2”, “splitbregman”
 standardize (bool) – whether or not to standardize the fit matrix before fitting
 validation_method (str) – method to use for crossvalidation; possible choices are “shufflesplit”, “kfold”
 n_splits (int) – number of times the fit data set will be split for the crossvalidation
 check_condition (bool) –
whether or not to carry out a check of the condition number
N.B.: This can be sligthly more time consuming for larger matrices.
 seed (int) – seed for pseudo random number generator

train_scatter_data
¶ contains target and predicted values from each individual traininig set in the crossvalidation split;
ScatterData
is a namedtuple.Type: ScatterData

validation_scatter_data
¶ contains target and predicted values from each individual validation set in the crossvalidation split;
ScatterData
is a namedtuple.Type: ScatterData

compute_rmse
(A, y) Computes the root mean square error using \(A\), \(y\), and the vector of fitted parameters \(x\) corresponding to \(Axy_2\).
Parameters:  A (numpy.ndarray) – fit matrix (N,M array) where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x)
 y (numpy.ndarray) – vector of target values
Returns: root mean squared error
Return type: float

fit_method
fit method
Type: str

get_contributions
(A) Computes the average contribution to the predicted values from each element of the parameter vector.
Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters Returns: average contribution for each row of A from each parameter Return type: numpy.ndarray

n_nonzero_parameters
number of nonzero parameters
Type: int

n_parameters
number of parameters (=columns in A matrix)
Type: int

n_splits
number of splits (folds) used for crossvalidation
Type: str

n_target_values
number of target values (=rows in A matrix)
Type: int

parameters
copy of parameter vector
Type: numpy.ndarray

parameters_splits
all parameters obtained during crossvalidation
Type: numpy.ndarray

predict
(A) Predicts data given an input matrix \(A\), i.e., \(Ax\), where \(x\) is the vector of the fitted parameters.
Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters Returns: vector of predicted values; float if single row provided as input Return type: numpy.ndarray or float

rmse_train
average root mean squared training error obtained during crossvalidation
Type: float

rmse_train_final
root mean squared error when using the full set of input data
Type: float

rmse_train_splits
root mean squared training errors obtained during crossvalidation
Type: list(float)

rmse_validation
average root mean squared crossvalidation error
Type: float

rmse_validation_splits
root mean squared validation errors obtained during crossvalidation
Type: list(float)

seed
seed used to initialize pseudo random number generator
Type: int

standardize
whether or not to standardize the fit matrix before fitting
Type: bool

summary
comprehensive information about the optimizer
Type: dict

train
()[source] Constructs the final model using all input data available.

validate
()[source] Runs validation.

validation_method
validation method name
Type: str