# Optimizers¶

## Overview¶

The scikit-learn library provides functionality for training linear models and a large number of related tools. The present module provides simplified interfaces for various linear model regression methods. These methods are set up in a way that work out of the box for typical problems in cluster expansion and force constant potential construction, including slight adjustments to scitkit-learn default values.

If you need more flexibility, extended functionality or the ability to fine-tune parameters that are not included in this interface, it is possible to use scikit-learn directly as demonstrated by an example in the advanced topics section.

The most commonly used fit methods in the present context are LASSO, automatic relevance determination regression (ARDR), recursive feature elimination with $$\ell_2$$-fitting (RFE-L2) as well as ordinary least-squares optimization (OLS). Their usage and performance is illustrated by the feature selection and learning curve examples in the advanced topics section. Below follows a short summary of the main algorithms. More information about the available linear models can be found in the scikit-learn documentation.

### Least-squares¶

Ordinary least-squares (OLS) optimization is providing a solution to the linear problem

$\boldsymbol{A}\boldsymbol{x} = \boldsymbol{y},$

where $$\boldsymbol{A}$$ is the sensing matrix, $$\boldsymbol{y}$$ is the vector of target values, and $$\boldsymbol{x}$$ is the solution (parameter vector) that one seeks to obtain. The objective is given by

$\left\Vert\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\right\Vert^2_2$

The OLS optimizer is chosen by setting the fit_method keyword to least-squares.

### LASSO¶

The least absolute shrinkage and selection operator (LASSO) is a method for performing variable selection and regularization in problems in statistics and machine learning. The optimization objective is given by

$\frac{1}{2 n_\text{samples}} \left\Vert\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\right\Vert^2_2 + \alpha \Vert\boldsymbol{x}\Vert_1.$

While the first term ensures that $$\boldsymbol{x}$$ is a solution to the linear problem at hand, the second term introduces regularization and guides the algorithm toward finding sparse solutions, in the spirit of compressive sensing. In general, LASSO is suited for solving strongly underdetermined problems.

The LASSO optimizer is chosen by setting the fit_method keyword to lasso. The $$\alpha$$ parameter is set via the alpha keyword. If no value is specified a line scan will be carried out automatically to determine the optimal value. The training data can be automatically centered by setting the keyword fit_intercept to True.

### ARDR¶

Automatic relevance determination regression (ARDR) is an optimization algorithm provided by scikit-learn that is similar to Bayesian Ridge Regression, which provides a probabilistic model of the regression problem at hand. The method is also known as Sparse Bayesian Learning and Relevance Vector Machine.

The ARDR optimizer is chosen by setting the fit_method keyword to ardr. The threshold lambda parameter that is forwarded to scikit-learn is set via the threshold_lambda keyword (default: 1e6). The training data can be automatically centered by setting the keyword fit_intercept to True.

### RFE-L2¶

Recursive feature elimination (RFE) with $$\ell_2$$-fitting (RFE-L2) is a mix between first obtaining the important features using recursive feature elimination (RFE) as implemented in scikit-learn and then carrying out an ordinary least-square fit using the selected features.

The RFE-L2 optimizer is chosen by setting the fit_method keyword to rfe-l2. The n_features keyword allows one to specify the number of features to select. If this parameter is left unspecified RFE with cross-validation will be used to determine the optimal number of features. The number of parameters to eliminate in each iteration is set via the step keyword.

### Other methods¶

Some other optimization methods are also available, including

## Optimizer¶

class hiphive.fitting.Optimizer(fit_data, fit_method='least-squares', standardize=True, train_size=0.75, test_size=None, train_set=None, test_set=None, check_condition=True, seed=42, **kwargs)[source]

Optimizer for single $$Ax = y$$ fit.

One has to specify either train_size/test_size or train_set/test_set If either train_set or test_set (or both) is specified the fractions will be ignored.

Warning

Repeatedly setting up a Optimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple Optimizer instances.

Parameters: fit_data (tupe(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters fit_method (str) – method to be used for training; possible choice are “least-squares”, “lasso”, “elasticnet”, “bayesian-ridge”, “ardr”, “rfe-l2”, “split-bregman” standardize (bool) – whether or not to standardize the fit matrix before fitting train_size (float or int) – If float represents the fraction of fit_data (rows) to be used for training. If int, represents the absolute number of rows to be used for training. test_size (float or int) – If float represents the fraction of fit_data (rows) to be used for testing. If int, represents the absolute number of rows to be used for testing. train_set (tuple or list(int)) – indices of rows of A/y to be used for training test_set (tuple or list(int)) – indices of rows of A/y to be used for testing check_condition (bool) – whether or not to carry out a check of the condition number N.B.: This can be sligthly more time consuming for larger matrices. seed (int) – seed for pseudo random number generator
train_scatter_data

target and predicted value for each row in the training set

Type: ScatterData object (namedtuple)
test_scatter_data

target and predicted value for each row in the test set

Type: ScatterData object (namedtuple)
compute_rmse(A, y)

Computes the root mean square error using $$A$$, $$y$$, and the vector of fitted parameters $$x$$ corresponding to $$||Ax-y||_2$$.

Parameters: A (numpy.ndarray) – fit matrix (N,M array) where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x) y (numpy.ndarray) – vector of target values root mean squared error float
contributions_test

average contribution to the predicted values for the test set from each parameter

Type: numpy.ndarray
contributions_train

average contribution to the predicted values for the train set from each parameter

Type: numpy.ndarray
fit_method

fit method

Type: str
get_contributions(A)

Computes the average contribution to the predicted values from each element of the parameter vector.

Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters average contribution for each row of A from each parameter numpy.ndarray
n_nonzero_parameters

number of non-zero parameters

Type: int
n_parameters

number of parameters (=columns in A matrix)

Type: int
n_target_values

number of target values (=rows in A matrix)

Type: int
parameters

copy of parameter vector

Type: numpy.ndarray
predict(A)

Predicts data given an input matrix $$A$$, i.e., $$Ax$$, where $$x$$ is the vector of the fitted parameters.

Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters vector of predicted values; float if single row provided as input numpy.ndarray or float
rmse_test

root mean squared error for test set

Type: float
rmse_train

root mean squared error for training set

Type: float
seed

seed used to initialize pseudo random number generator

Type: int
standardize

whether or not to standardize the fit matrix before fitting

Type: bool
summary

Type: dict
test_fraction

fraction of rows included in test set

Type: float
test_set

indices of rows included in the test set

Type: list
test_size

number of rows included in test set

Type: int
train()[source]

Carries out training.

train_fraction

fraction of rows included in training set

Type: float
train_set

indices of rows included in the training set

Type: list
train_size

number of rows included in training set

Type: int

## EnsembleOptimizer¶

class hiphive.fitting.EnsembleOptimizer(fit_data, fit_method='least-squares', standardize=True, ensemble_size=50, train_size=1.0, bootstrap=True, check_condition=True, seed=42, **kwargs)[source]

Ensemble optimizer that carries out a series of single optimization runs using the Optimizer class and then provides access to various ensemble averaged quantities including e.g., errors and parameters.

Warning

Repeatedly setting up a EnsembleOptimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple EnsembleOptimizer instances.

Parameters: fit_data (tupe(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters fit_method (str) – method to be used for training; possible choice are “least-squares”, “lasso”, “elasticnet”, “bayesian-ridge”, “ardr”, “rfe-l2”, “split-bregman” standardize (bool) – whether or not to standardize the fit matrix before fitting ensemble_size (int) – number of fits in the ensemble train_size (float or int) – If float represents the fraction of fit_data (rows) to be used for training. If int, represents the absolute number of rows to be used for training. bootstrap (bool) – if True sampling will be carried out with replacement check_condition (bool) – whether or not to carry out a check of the condition number N.B.: This can be sligthly more time consuming for larger matrices. seed (int) – seed for pseudo random number generator
bootstrap

True if sampling is carried out with replacement

Type: bool
compute_rmse(A, y)

Computes the root mean square error using $$A$$, $$y$$, and the vector of fitted parameters $$x$$ corresponding to $$||Ax-y||_2$$.

Parameters: A (numpy.ndarray) – fit matrix (N,M array) where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x) y (numpy.ndarray) – vector of target values root mean squared error float
ensemble_size

number of train rounds

Type: int
error_matrix

matrix of fit errors where N is the number of target values and M is the number of fits (i.e., the size of the ensemble)

Type: numpy.ndarray
fit_method

fit method

Type: str
get_contributions(A)

Computes the average contribution to the predicted values from each element of the parameter vector.

Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters average contribution for each row of A from each parameter numpy.ndarray
n_nonzero_parameters

number of non-zero parameters

Type: int
n_parameters

number of parameters (=columns in A matrix)

Type: int
n_target_values

number of target values (=rows in A matrix)

Type: int
parameter_vectors

all parameter vectors in the ensemble

Type: list(numpy.ndarray)
parameters

copy of parameter vector

Type: numpy.ndarray
parameters_std

standard deviation for each parameter

Type: numpy.ndarray
predict(A, return_std=False)[source]

Predicts data given an input matrix A, i.e., Ax, where x is the vector of the fitted parameters.

By using all parameter vectors in the ensemble a standard deviation of the prediction can be obtained.

Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters return_std (bool) – whether or not to return the standard deviation of the prediction vector of predicted values, vector of standard deviations tuple(numpy.ndarray, numpy.ndarray) or tuple(float, float)
rmse_test

ensemble average of root mean squared error over test sets

Type: float
rmse_test_ensemble

root mean squared test errors obtained during for each fit in ensemble

Type: list(float)
rmse_train

ensemble average of root mean squared error over train sets

Type: float
rmse_train_ensemble

root mean squared train errors obtained during for each fit in ensemble

Type: list(float)
seed

seed used to initialize pseudo random number generator

Type: int
standardize

whether or not to standardize the fit matrix before fitting

Type: bool
summary

Type: dict
train()[source]

Carries out ensemble training and construct the final model by averaging over all models in the ensemble.

train_fraction

fraction of input data used for training; this value can differ slightly from the value set during initialization due to rounding

Type: float
train_size

number of rows included in train sets. Note that this will be different from the number of unique rows if boostrapping

Type: int

## CrossValidationEstimator¶

class hiphive.fitting.CrossValidationEstimator(fit_data, fit_method='least-squares', standardize=True, validation_method='k-fold', n_splits=10, check_condition=True, seed=42, **kwargs)[source]

Optimizer with cross validation.

This optimizer can compute cross-validation (CV) scores in different ways. It can also produce the finalized model (using the full input data) for which the CV score is an estimation of its performance.

Warning

Repeatedly setting up a CrossValidationEstimator and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple CrossValidationEstimator instances.

Parameters: fit_data (tupe(numpy.ndarray, numpy.ndarray)) – the first element of the tuple represents the fit matrix A (N, M array) while the second element represents the vector of target values y (N array); here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters fit_method (str) – method to be used for training; possible choice are “least-squares”, “lasso”, “elasticnet”, “bayesian-ridge”, “ardr”, “rfe-l2”, “split-bregman” standardize (bool) – whether or not to standardize the fit matrix before fitting validation_method (str) – method to use for cross-validation; possible choices are “shuffle-split”, “k-fold” n_splits (int) – number of times the fit data set will be split for the cross-validation check_condition (bool) – whether or not to carry out a check of the condition number N.B.: This can be sligthly more time consuming for larger matrices. seed (int) – seed for pseudo random number generator
train_scatter_data

contains target and predicted values from each individual traininig set in the cross-validation split; ScatterData is a namedtuple.

Type: ScatterData
validation_scatter_data

contains target and predicted values from each individual validation set in the cross-validation split; ScatterData is a namedtuple.

Type: ScatterData
compute_rmse(A, y)

Computes the root mean square error using $$A$$, $$y$$, and the vector of fitted parameters $$x$$ corresponding to $$||Ax-y||_2$$.

Parameters: A (numpy.ndarray) – fit matrix (N,M array) where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x) y (numpy.ndarray) – vector of target values root mean squared error float
fit_method

fit method

Type: str
get_contributions(A)

Computes the average contribution to the predicted values from each element of the parameter vector.

Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters average contribution for each row of A from each parameter numpy.ndarray
n_nonzero_parameters

number of non-zero parameters

Type: int
n_parameters

number of parameters (=columns in A matrix)

Type: int
n_splits

number of splits (folds) used for cross-validation

Type: str
n_target_values

number of target values (=rows in A matrix)

Type: int
parameters

copy of parameter vector

Type: numpy.ndarray
parameters_splits

all parameters obtained during cross-validation

Type: numpy.ndarray
predict(A)

Predicts data given an input matrix $$A$$, i.e., $$Ax$$, where $$x$$ is the vector of the fitted parameters.

Parameters: A (numpy.ndarray) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters vector of predicted values; float if single row provided as input numpy.ndarray or float
rmse_train

average root mean squared training error obtained during cross-validation

Type: float
rmse_train_final

root mean squared error when using the full set of input data

Type: float
rmse_train_splits

root mean squared training errors obtained during cross-validation

Type: list(float)
rmse_validation

average root mean squared cross-validation error

Type: float
rmse_validation_splits

root mean squared validation errors obtained during cross-validation

Type: list(float)
seed

seed used to initialize pseudo random number generator

Type: int
standardize

whether or not to standardize the fit matrix before fitting

Type: bool
summary

Type: dict
train()[source]

Constructs the final model using all input data available.

validate()[source]

Runs validation.

validation_method

validation method name

Type: str