Feature selection

In this tutorial we explore the concept of feature selection while considering three different methods

These methods solve the linear problem

\[\boldsymbol{A}\boldsymbol{x} = \boldsymbol{b}.\]

where \(\boldsymbol{A}\) is the fit matrix, \(\boldsymbol{x}\) represents the parameters to be determined, and \(\boldsymbol{b}\) are the target values.

The present tutorial has the objective to showcase the different algorithms and is not indented as a comprehensive introduction. The reader is rather strongly encouraged to turn to the existing literature on feature selection in machine learning. Here, the links include in the list above provide useful starting points. It should also be noted that for instance scikit-learn, which underpins the algorithms employed here, provides further interesting feature selection techniques, such as orthogonal matching pursuit, which can be used with hiPhive with little additional effort.


The least absolute shrinkage and selection operator (LASSO) minimizes the objective function

\[\frac{1}{2 N} || \boldsymbol{A}\boldsymbol{x} - \boldsymbol{b} ||^2_2 + \alpha ||\boldsymbol{x}||_1,\]

where \(\alpha\) is the regularization parameter which determines the sparsity of the solution and \(N\) is the number of rows in \(\boldsymbol{A}\).

Lasso thus aims to achieve sparse solutions by directly minimizing the \(\ell_1\)-norm of the model parameter vector

\[||\boldsymbol{x}||_1 = n^{-1} \sum_i |x_i|,\]

where \(n\) is the number of parameters (or elements of \(\boldsymbol{x}\)).

When fitting with LASSO one sometimes obtains the following warning from scikit-learn:

“ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.”“

This can be solved, as suggested by the warning, by increasing max_iter. Hence in this tutorial max_iter=100000 is used, which makes this tutorial and specifially the LASSO part a bit slow to run. It should be noted that the result will not differ too much when reducing max_iter, however, the convergence warnings from scikit-learn should be taken seriously.


Automatic relevance determination regression (ARDR) is a Bayesian regression techniques and thus seeks to find a solution, which assigns priors to both the model parameters and the hyperparameters. It is also known as sparse Bayesian learning (SBL) or relevance vector machine (RVM).

In ARDR the \(\lambda\)-threshold parameter can be used for removing (pruning) weights with high precision from the computation.


recursive feature elimination (RFE) is as the name suggests a feature selection algorithm. While RFE can be used in combination with any linear regression solver, here we use ordinary least squares. RFE has a hyper parameter n_features, which is the number of features to be selected. First, a fit is constructed with all features; the least important or weakest features are removed and a new fit is constructed. This is repeated until n_features are left.

In our optimizer if n_features is not specified the optimal value of this hyper-parameter will be found by cross-validation analysis.


The feature selection process looks like


Plotting the validation root mean squared error against the number of features reveals that LASSO has its minimum at roughly twice as many parameters compared to the other two algorithms. This may lead to a less transferable model.

It is also noteworthy that for fully dense solutions, ARDR performs better than the other two algorithms.


The obtained force constant models for a few selection parameters are shown below.


Source code

Structure preparation
from ase.io import write
from ase.cluster import Icosahedron
from ase.optimize import BFGS
from ase.calculators.emt import EMT
from hiphive.structure_generation import generate_rattled_structures

# parameters
structures_fname = 'rattled_structures.extxyz'
number_of_structures = 10
particle_size = 3
a0 = 4.05
rattle_std = 0.02

# setup
atoms_ideal = Icosahedron('Al', particle_size, a0)
calc = EMT()

# center atoms and add pbc
atoms_ideal.pbc = True

# relax particle
dyn = BFGS(atoms_ideal)
converged = dyn.run(fmax=0.0001, steps=1000)

# generate structures
structures = generate_rattled_structures(
    atoms_ideal, number_of_structures, rattle_std)
for structure in structures:
    forces = structure.get_forces()

    displacements = structure.positions - atoms_ideal.get_positions()
    structure.new_array('displacements', displacements)
    structure.new_array('forces', forces)

    structure.positions = atoms_ideal.get_positions()
    structure.calc = None

# save structures
write(structures_fname, structures)
Setup of structure container
from ase.io import read
from hiphive import ClusterSpace, StructureContainer

structures = read('rattled_structures.extxyz@:')
cutoffs = [8]

cs = ClusterSpace(structures[0], cutoffs)
sc = StructureContainer(cs)
for structure in structures:

Feature selection
Feature selection

Runs in 10 minutes on an Intel Core i5-4670K CPU.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from hiphive import StructureContainer, ForceConstantPotential
from hiphive.fitting import CrossValidationEstimator

def run_cve(fit_kwargs):
    cve = CrossValidationEstimator(
        sc.get_fit_data(), number_of_splits=n_splits,
        train_size=train_size, test_size=test_size,
        validation_method='shuffle-split', **fit_kwargs)
    nzp = [np.count_nonzero(params) for params in cve.parameters_splits]
    row = fit_kwargs
    row['rmse'] = cve.rmse_validation
    row['rmse_std'] = np.std(cve.rmse_validation_splits)
    row['nzp'] = np.mean(nzp)
    row['nzp_std'] = np.std(nzp)
    row['parameters'] = cve.parameters_splits[0]
    return row

# parameters
n_splits = 5
train_size = 600
fit_methods = ['lasso', 'ardr', 'rfe-l2']

# load SC
sc = StructureContainer.read('structure_container')
cs = sc.cluster_space
n_rows, n_cols = sc.data_shape
test_size = n_rows - train_size

# feature selection parameters
alphas = np.logspace(-5, -3.5, 20)          # LASSO
lambdas = np.logspace(4.3, 6, 20)           # ARDR
n_feature_vals = np.arange(50, n_cols, 20)  # RFE

# indices for which fcs will be plotted
alpha_inds = [4, 15]
lambda_inds = [6, 15]
n_features_inds = [2, 10]

summary_list = []
for alpha in alphas:
    print('lasso', alpha, flush=True)
    kwargs = dict(fit_method='lasso', alpha=alpha, max_iter=100000)
df_lasso = pd.DataFrame(summary_list)

summary_list = []
for threshold_lambda in lambdas:
    print('ardr', threshold_lambda, flush=True)
    kwargs = dict(fit_method='ardr', threshold_lambda=threshold_lambda)
df_ardr = pd.DataFrame(summary_list)

summary_list = []
for n_features in n_feature_vals:
    print('RFE', n_features, flush=True)
    kwargs = dict(fit_method='rfe-l2', n_features=n_features)
df_rfe = pd.DataFrame(summary_list)

# analyze the predicted force constant models for a selection parameters
lasso_orbits = dict()
for ind in alpha_inds:
    fcp = ForceConstantPotential(cs, df_lasso.parameters[ind])
    lasso_orbits[ind] = pd.DataFrame(fcp.orbit_data)

ardr_orbits = dict()
for ind in lambda_inds:
    fcp = ForceConstantPotential(cs, df_ardr.parameters[ind])
    ardr_orbits[ind] = pd.DataFrame(fcp.orbit_data)

rfe_orbits = dict()
for ind in n_features_inds:
    fcp = ForceConstantPotential(cs, df_rfe.parameters[ind])
    rfe_orbits[ind] = pd.DataFrame(fcp.orbit_data)

# plotting feature selection
figsize = (18, 4.8)
fs = 14
ms = 8
lw = 1.5
color1 = '#1f77b4'
color2 = '#ff7f0e'
alpha = 0.5

kwargs_rmse = dict(lw=lw, ms=ms, color=color1, label='RMSE Test')
kwargs_nzp = dict(lw=lw, ms=ms, color=color2, label='N')

fig = plt.figure(figsize=figsize)
ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)
ax1_y2 = ax1.twinx()
ax2_y2 = ax2.twinx()
ax3_y2 = ax3.twinx()

ylim1 = [0.008, 0.02]
ylim2 = [0, 1.1*n_cols]

ax1.semilogx(df_lasso.alpha, df_lasso.rmse, '-o', **kwargs_rmse)
ax1.fill_between(df_lasso.alpha, df_lasso.rmse - df_lasso.rmse_std,
                 df_lasso.rmse + df_lasso.rmse_std, color=color1, alpha=alpha)
ax1_y2.semilogx(df_lasso.alpha, df_lasso.nzp, '-o', **kwargs_nzp)
ax1_y2.fill_between(df_lasso.alpha, df_lasso.nzp - df_lasso.nzp_std,
                    df_lasso.nzp + df_lasso.nzp_std, color=color2, alpha=alpha)

# ardr
ax2.semilogx(df_ardr.threshold_lambda, df_ardr.rmse, '-o', **kwargs_rmse)
ax2.fill_between(df_ardr.threshold_lambda, df_ardr.rmse - df_ardr.rmse_std,
                 df_ardr.rmse + df_ardr.rmse_std, color=color1, alpha=alpha)
ax2_y2.semilogx(df_ardr.threshold_lambda, df_ardr.nzp, '-o', **kwargs_nzp)
ax2_y2.fill_between(df_ardr.threshold_lambda, df_ardr.nzp - df_ardr.nzp_std,
                    df_ardr.nzp + df_ardr.nzp_std, color=color2, alpha=alpha)

# rfe
ax3.plot(df_rfe.n_features, df_rfe.rmse, '-o', **kwargs_rmse)
ax3.fill_between(df_rfe.n_features, df_rfe.rmse - df_rfe.rmse_std,
                 df_rfe.rmse + df_rfe.rmse_std, color=color1, alpha=alpha)
ax3_y2.plot(df_rfe.n_features, df_rfe.nzp, '-o', **kwargs_nzp)

ax1.set_xlim([np.min(alphas), np.max(alphas)])
ax2.set_xlim([np.min(lambdas), np.max(lambdas)])
ax3.set_xlim([np.min(n_feature_vals), np.max(n_feature_vals)])

ax1.set_xlabel('Regularization parameter $\\alpha$', fontsize=fs)
ax2.set_xlabel(r'$\lambda$ - Threshold', fontsize=fs)
ax3.set_xlabel('Number of features', fontsize=fs)

ax1.set_title('LASSO', fontsize=fs)
ax2.set_title('automatic relevance determination', fontsize=fs)
ax3.set_title('recursive feature eliminiation', fontsize=fs)

for ax in [ax1, ax2, ax3]:
    ax.set_ylabel('RMSE validation (eV/Å)', color=color1, fontsize=fs)
    ax.tick_params(axis='y', labelcolor=color1)

for ax in [ax1_y2, ax2_y2, ax3_y2]:
    ax.set_ylabel('Number of non zero parameters', color=color2, fontsize=fs)
    ax.tick_params(labelsize=fs, labelcolor=color2)


# rmse - nzp curves
ms = 8
lw = 2.0
fig = plt.figure()
plt.plot(df_lasso.nzp, df_lasso.rmse, '-o', ms=ms, lw=lw, label='Lasso')
plt.plot(df_ardr.nzp, df_ardr.rmse, '-o', ms=ms, lw=lw, label='ARDR')
plt.plot(df_rfe.nzp, df_rfe.rmse, '-o', ms=ms, lw=lw, label='RFE')

plt.xlabel('Number of non zero parameters', fontsize=fs)
plt.ylabel('RMSE validation (eV/Å)', fontsize=fs)
plt.legend(loc=1, fontsize=fs)

# plot resulting models
ms = 10
alpha = 0.8
ylim = [2e-3, 20]
fig = plt.figure(figsize=figsize)
ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)

for ind, df in lasso_orbits.items():
    ax1.semilogy(df.radius, df.force_constant_norm, 'o', ms=ms, alpha=alpha,
                 label='alpha = {:.1e}'.format(alphas[ind]))

for ind, df in ardr_orbits.items():
    ax2.semilogy(df.radius, df.force_constant_norm, 'o', ms=ms, alpha=alpha,
                 label='f = {:.1e}'.format(lambdas[ind]))

for ind, df in rfe_orbits.items():
    ax3.semilogy(df.radius, df.force_constant_norm, 'o', ms=ms, alpha=alpha,
                 label='n_features = {:d}'.format(n_feature_vals[ind]))

ax1.set_title('LASSO', fontsize=fs)
ax2.set_title('automatic relevance determination', fontsize=fs)
ax3.set_title('recursive feature eliminiation', fontsize=fs)

for ax in [ax1, ax2, ax3]:
    ax.set_xlabel('Pair radius (A)', fontsize=fs)
ax1.set_ylabel('Force constant norm (eV/A$^2$)', fontsize=fs)