Optimizers

Optimizer

class hiphive.fitting.Optimizer(fit_data, fit_method='least-squares', train_size=0.75, test_size=None, train_set=None, test_set=None, seed=42, **kwargs)[source]

Optimizer for single Ax = y fit.

One has to specify either train_size/test_size or train_set/test_set If either train_set or test_set (or both) is specified the fractions will be ignored.

Warning

Repeatedly setting up a Optimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple Optimizer instances.

Parameters:
  • fit_data (tuple of NumPy (N, M) array and NumPy (N) array) – the first element of the tuple represents the fit matrix A whereas the second element represents the vector of target values y; here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
  • fit_method (string) – method to be used for training; possible choice are “least-squares”, “lasso”, “elasticnet”, “bayesian-ridge”, “ardr”
  • train_size (float or int) – If float represents the fraction of fit_data (rows) to be used for training. If int, represents the absolute number of rows to be used for training.
  • test_size (float or int) – If float represents the fraction of fit_data (rows) to be used for testing. If int, represents the absolute number of rows to be used for testing.
  • train_set (tuple/list of ints) – indices of rows of A/y to be used for training
  • test_set (tuple/list of ints) – indices of rows of A/y to be used for testing
  • seed (int) – seed for pseudo random number generator
train_scatter_data

ScatterData object (namedtuple) – target and predicted value for each row in the training set

test_scatter_data

ScatterData object (namedtuple) – target and predicted value for each row in the test set

compute_rmse(A, y)

Compute the root mean square error using the A, y, and the vector of fitted parameters x corresponding to ||Ax-y||_2.

Parameters:
  • A (NumPy (N, M) array) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x)
  • y (NumPy (N) array) – vector of target values
Returns:

root mean squared error

Return type:

float

fit_method

string – fit method.

get_contributions(A)

Compute the average contribution to the predicted values from each element of the parameter vector.

Parameters:A (NumPy (N, M) array) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
Returns:average contribution for each row of A from each parameter
Return type:NumPy (N, M) array
number_of_parameters

int – number of parameters (=columns in A matrix).

number_of_target_values

int – number of target values (=rows in A matrix).

parameters

NumPy array – copy of parameter vector.

predict(A)

Predict data given an input matrix A, i.e., Ax, where x is the vector of the fitted parameters.

Parameters:A (NumPy (N, M) array or NumPy (M, )) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
Returns:vector of predicted values
Return type:NumPy (N) array, or float if single row is inputed
rmse_test

float – root mean squared error for test set.

rmse_train

float – root mean squared error for training set.

seed

int – seed used to initialize pseudo random number generator.

summary

dict – Comprehensive information about the optimizer.

test_fraction

float – fraction of rows included in test set.

test_set

list – indices of the rows included in the test set.

test_size

int – number of rows included in test set.

train()[source]

Carry out training.

train_fraction

float – fraction of rows included in training set.

train_set

list – indices of the rows included in the training set.

train_size

int – number of rows included in training set.

EnsembleOptimizer

class hiphive.fitting.EnsembleOptimizer(fit_data, fit_method='least-squares', ensemble_size=50, train_size=1.0, bootstrap=True, seed=42, **kwargs)[source]

Ensemble optimizer that carries out a series of single optimization runs using the Optimizer class and then provides access to various ensemble averaged quantities including e.g., errors and parameters.

Warning

Repeatedly setting up a EnsembleOptimizer and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple EnsembleOptimizer instances.

Parameters:
  • fit_data (tuple of (N, M) NumPy array and (N) NumPy array) – the first element of the tuple represents the fit matrix A whereas the second element represents the vector of target values y; here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
  • fit_method (string) – method to be used for training; possible choice are “least-squares”, “lasso”, “elasticnet”, “bayesian-ridge”, “ardr”
  • ensemble_size (int) – number of fits in the ensemble
  • train_size (float or int) – If float represents the fraction of fit_data (rows) to be used for training. If int, represents the absolute number of rows to be used for training.
  • bootstrap (boolean) – if True sampling will be carried out with replacement
  • seed (int) – seed for pseudo random number generator
bootstrap

boolean – True if sampling is carried out with replacement.

compute_rmse(A, y)

Compute the root mean square error using the A, y, and the vector of fitted parameters x corresponding to ||Ax-y||_2.

Parameters:
  • A (NumPy (N, M) array) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x)
  • y (NumPy (N) array) – vector of target values
Returns:

root mean squared error

Return type:

float

ensemble_size

int – number of train rounds.

error_matrix

NumPy (N,M) array – matrix of fit errors where N is the number of target values and M is the number of fits (i.e., the size of the ensemble)

fit_method

string – fit method.

get_contributions(A)

Compute the average contribution to the predicted values from each element of the parameter vector.

Parameters:A (NumPy (N, M) array) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
Returns:average contribution for each row of A from each parameter
Return type:NumPy (N, M) array
number_of_parameters

int – number of parameters (=columns in A matrix).

number_of_target_values

int – number of target values (=rows in A matrix).

parameter_vectors

list – all parameter vectors in the ensemble.

parameters

NumPy array – copy of parameter vector.

parameters_std

NumPy array – standard deviation for each parameter.

predict(A, return_std=False)[source]

Predict data given an input matrix A, i.e., Ax, where x is the vector of the fitted parameters.

By using all parameter vectors in the ensemble a standard deviation of the prediction can be obtained.

Parameters:
  • A (NumPy (N, M) array) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
  • return_std (bool) – whether or not to return the standard deviation of the prediction
Returns:

vector of predicted values, vector of standard deviations

Return type:

(NumPy (N) array, NumPy (N) array) or (float, float)

rmse_test

float – ensemble average of root mean squared error over test sets.

rmse_test_ensemble

list – root mean squared test errors obtained during for each fit in ensemble.

rmse_train

float – ensemble average of root mean squared error over train sets.

rmse_train_ensemble

list – root mean squared train errors obtained during for each fit in ensemble.

seed

int – seed used to initialize pseudo random number generator.

summary

dict – Comprehensive information about the optimizer.

train()[source]

Carry out ensemble training and construct the final model by averaging over all models in the ensemble.

train_fraction

float – fraction of input data used for training; this value can differ slightly from the value set during initialization due to rounding.

train_size

int – number of rows included in train sets. Note that this will be different from the number of unique rows if boostrapping.

CrossValidationEstimator

class hiphive.fitting.CrossValidationEstimator(fit_data, fit_method='least-squares', validation_method='k-fold', number_of_splits=10, seed=42, **kwargs)[source]

Optimizer with cross validation.

This optimizer first computes a cross-validation score and finally generates a model using the full set of input data.

Warning

Repeatedly setting up a CrossValidationEstimator and training without changing the seed for the random number generator will yield identical or correlated results, to avoid this please specify a different seed when setting up multiple CrossValidationEstimator instances.

Parameters:
  • fit_data (tuple of NumPy (N, M) array and NumPy (N) array) – the first element of the tuple represents the fit matrix A whereas the second element represents the vector of target values y; here N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
  • fit_method (string) – method to be used for training; possible choice are “least-squares”, “lasso”, “elasticnet”, “bayesian-ridge”, “ardr”
  • validation_method (string) – method to use for cross-validation; possible choices are “shuffle-split”, “k-fold”
  • number_of_splits (int) – number of times the fit data set will be split for the cross-validation
  • seed (int) – seed for pseudo random number generator
train_scatter_data

ScatterData object (namedtuple) – contains target and predicted values from each individual traininig set in the cross-validation split

validation_scatter_data

ScatterData object (namedtuple) – contains target and predicted values from each individual validation set in the cross-validation split

compute_rmse(A, y)

Compute the root mean square error using the A, y, and the vector of fitted parameters x corresponding to ||Ax-y||_2.

Parameters:
  • A (NumPy (N, M) array) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters (=elements of x)
  • y (NumPy (N) array) – vector of target values
Returns:

root mean squared error

Return type:

float

fit_method

string – fit method.

get_contributions(A)

Compute the average contribution to the predicted values from each element of the parameter vector.

Parameters:A (NumPy (N, M) array) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
Returns:average contribution for each row of A from each parameter
Return type:NumPy (N, M) array
number_of_parameters

int – number of parameters (=columns in A matrix).

number_of_splits

string – number of splits (folds) used for cross-validation.

number_of_target_values

int – number of target values (=rows in A matrix).

parameters

NumPy array – copy of parameter vector.

predict(A)

Predict data given an input matrix A, i.e., Ax, where x is the vector of the fitted parameters.

Parameters:A (NumPy (N, M) array or NumPy (M, )) – fit matrix where N (=rows of A, elements of y) equals the number of target values and M (=columns of A) equals the number of parameters
Returns:vector of predicted values
Return type:NumPy (N) array, or float if single row is inputed
rmse_train

float – average root mean squared training error obtained during cross-validation.

rmse_train_final

float – root mean squared error when using the full set of input data.

rmse_train_splits

list – root mean squared training errors obtained during cross-validation.

rmse_validation

float – average root mean squared cross-validation error.

rmse_validation_splits

list – root mean squared validation errors obtained during cross-validation.

seed

int – seed used to initialize pseudo random number generator.

summary

dict – Comprehensive information about the optimizer.

train()[source]

Construct the final model using all input data available.

validate()[source]

Run validation.

validation_method

string – validation method name.