BoostedLinearModel with SimplePLS Algorithm Example¶

This example demonstrates the use of the BoostedLinearModel class. BoostedLinearModel is a subclass of BoostedModel that takes advantage of the fact that a sum of linear models is itself a linear model. It also provides additional functionality pertaining to linear models that can be used to help with variable selection.

To demonstrate, the SimplePLS modeling algorithm that is internal to the library is used for boosting. SimplePLS by default will fit a 1-variable linear regression to a dataset, where the single feature used will be the feature with the highest correlation with the target. Refer to the documentation for additional arguments, which allow for the specification of selecting more than one variable or filtering variables that are not as correlated with the target as the most correlated feature. Ill-conditioning due to multicollinearity is not an issue with SimplePLS. Furthermore, looking at the order (i.e., the boosting iteration) in which features enter the model provides a simple way to select features.

Logistic regression will be performed in the example using the same dataset that is used in the Binary Target example. Here though, shuffling is turned off so that the informative features are placed as the first columns in the returned dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.preprocessing import scale

from genestboost import BoostedLinearModel
from genestboost.weak_learners import SimplePLS
from genestboost.link_functions import LogitLink
from genestboost.loss_functions import LogLoss

%matplotlib inline

Create a Dummy Classification Dataset¶

X, y = make_classification(n_samples=20000,
                           n_features=50,
                           n_informative=20,
                           weights=(0.85, 0.15),
                           random_state=11,
                           shuffle=False)
X = scale(X)

Fit the Model¶

model = BoostedLinearModel(
    link=LogitLink(),
    loss=LogLoss(),
    model_callback=SimplePLS,
    model_callback_kwargs={},
    alpha=5.0,
    step_type="decaying",
    weights="newton",
    validation_fraction=0.30,
    validation_iter_stop=20,
    validation_stratify=True)
model.fit(X, y, iterations=2000);

Plot the loss history¶

fig = plt.figure(figsize=(6.5, 3.5), dpi=200)
ax = fig.add_subplot(111)
ax.plot(model.get_loss_history(), label=["Training", "Holdout"])
ax.legend(loc="best");

_images/boosted_linear_model_loss_history.png

Plot Coefficient History¶

The coefficients are scaled by the standard deviation of the corresponding features in the data set to get standardized coefficients.

fig = plt.figure(figsize=(6.5, 3.5), dpi=200)
ax = fig.add_subplot(111)
ax.plot(model.get_coefficient_history(scale=X.std(ddof=1, axis=0)), label=[f"Var {i:d}" for i in range(X.shape[1])])
ax.legend(loc="upper left", bbox_to_anchor=(1, 1), ncol=2, fontsize=6)
ax.set_xlabel("Boosting Iteration")
ax.set_ylabel("Standardized Coefficient");

_images/boosted_linear_model_coef_history.png

Order that Variables Entered the Model¶

print("Number of Selected Variables in the Model: {:d}".format(len(model.get_coefficient_order())))
model.get_coefficient_order()

Number of Selected Variables in the Model: 19

[8, 18, 3, 14, 5, 0, 1, 6, 19, 17, 10, 11, 16, 4, 2, 13, 9, 7, 12]

# Order by index number - 19 of the first 20 variables are selected (informative features)
sorted(model.get_coefficient_order())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19]