genestboost package

Subpackages

genestboost top-level module.

class genestboost.BoostedLinearModel(link, loss, model_callback, model_callback_kwargs=None, weights='none', alpha=1.0, step_type='default', step_decay_factor=0.6, init_type='mean', random_state=None, validation_fraction=0.0, validation_stratify=False, validation_iter_stop=10, tol=None)[source]

Bases: genestboost.boosted_model.BoostedModel

BoostedLinearModel class implementation.

__init__(link, loss, model_callback, model_callback_kwargs=None, weights='none', alpha=1.0, step_type='default', step_decay_factor=0.6, init_type='mean', random_state=None, validation_fraction=0.0, validation_stratify=False, validation_iter_stop=10, tol=None)[source]

Class initializer.

Parameters
  • link (BaseLink) – Link function to use in boosting iterations.

  • loss (BaseLoss) – Loss function to use in boosting iterations.

  • model_callback (Callable) – A callable that returns a model object that implements fit and predict methods. The model object that is returned must be a linear model that has coef_ and intercept_ attributes.

  • model_callback_kwargs (dict, optional (default=None)) – A dictionary of keyword arguments to pass to model_callback.

  • weights (Union[str, WeightsCallback]: str or Callable) – A string specificying the type of weights (one of “none” or “newton”), or a callable of the form ‘lambda yt, yp: weights`, where yt and yp are the actual target and predicted target arrays. These weights are multiplied element-wise by the model gradients to produce the pseudo-residuals that are to be predicted at each model iteration.

  • alpha (float (default=1.0)) – A parameter representing the intial trial learning rate. The learning rate that actually gets used at each iteration is dependent upon the value of step_type.

  • step_type (str (default="decaying")) – One of “decaying”, “constant”, or “best”. For “decaying”, the initial model iteration will start with alpha as the learning parameter. If alpha times step_decay_factor results in a greater reduction in loss than alpha, then use the former. This is repeated until performance does not improve, and the final chosen rate will serve as the ‘initial’ rate for the next boosting iteration. For “constant”, alpha will be used at every boosting iteration. Using “step_type” best implements the same process as “decaying”, except the next boosting iteration will reset the learning rate back to alpha as the initial trial learning rate.

  • step_decay_factor (float (default=0.48)) – The decaying factor to use with step_type “decaying” or “best”.

  • init_type (str (default="mean")) – The type of intial prediction to use. If “mean”, then the initial link prediction (prior to boosting iterations) will be taken as the link of the mean of the non-transformed target. If “residuals” or “zero”, the initial link prediction will be set to 0.0.

  • random_state (int, optional (default=None)) – Random seed to be used for reproducability when random methods are used internally.

  • validation_fraction (float (default=0.0)) – If 0.0, then no validation set will be used and training will be performed using the full training set. Otherwise, the fraction of observations to use as a holdout set for early stopping.

  • validation_stratify (bool (default=False)) – If true, than stratify the validation holdout set based on the target. Useful when the target is binary.

  • validation_iter_stop (int (default=10)) – Number of iterations to use for early stopping on the validation set. If the holdout loss is greater at the current iteration than validation_iter_stop iterations prior, then stop model fitting.

  • tol (float, optional (default=None)) – Early stopping criteria based on training loss. If training loss fails to improve by at least tol, then stop training. If None, then training loss criteria is not checked to determine early stopping.

boost(X, yt, yp, eta_p, model_callback, model_callback_kwargs, weights=None)[source]

Boost by one model iteration.

Creates a model using the model_callback callable with model_callback_kwargs, then fits this model to the pseudo-residuals. The learning rate is determined according to the chosen method, and current predictions are updated and returned such that stopping criteria can be evaluated and boosting continued. The fitted model resulting from the iteration is appended to the underlying model ensemble.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • yt (np.ndarray) – Observed target values.

  • yp (np.ndarray) – Predicted target values.

  • eta_p (np.ndarray) – The current link predictions corresponding to the model iteration. This can be found by transforming yp, but passing this as an argument avoids duplicate computations and improves performance.

  • model_callback (Callable) – A callable that returns a model object that implements fit and predict methods.

  • model_callback_kwargs (dict, optional (default=None)) – A dictionary of keyword arguments to pass to model_callback.

  • weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).

Returns

yp_next, eta_p_next – A tuple of the updated target predictions and target prediction links.

Return type

tuple(np.ndarray, np.ndarray)

decision_function(X, model_index=None)[source]

Get the link of computed model predictions.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • model_index (int, optional (default=None)) – If None, then return the full model prediction as the sum of all models in the ensemble plus the initial model prediction. If an int, then return only the predictions from models up to model_index (i.e., [:model_index]).

Returns

The link of the computed model predictions.

Return type

np.ndarray

get_coefficient_history(scale=None)[source]

Get the history of coefficient values.

Returns a matrix of coefficients, where each row is representative of the model coefficients at a specific boosting iteration (i.e., row 1 is after the first round of boosting, etc.)

Parameters

scale (np.ndarray, optional (default=None)) – Vector to scale the coefficients in the history calculation. If None, then coefficients are not scaled (or alternatively, all coefficients are scaled by a factor of 1.0).

Returns

coefficient_history

Return type

np.ndarray [n_boosting_iterations + 1, n_coefs]

get_coefficient_order(scale=None)[source]

Get the order that coefficients were selected for the model.

In the case that multiple coefficients were selected for the first time at the same model iteration, the “larger” coefficient will be considered to have been selected first. The scale argument can be used to standardize coefficients if models were fitted in a manner such that coefficients were not standardized.

Parameters

scale (np.ndarray, optional (default=None)) – Vector to scale the coefficients in the ordering calculation. If None, then coefficients are not scaled (or alternatively, all coefficients are scaled by a factor of 1.0).

Returns

coefficient_order – A list of zero-based indexes specifying the order that coefficients entered the boosted model.

Return type

List[int]

get_prediction_var_history(X, groups=None)[source]

Get the history of prediction variance on the model matrix X.

Returns a matrix of prediction variances at each round of boosting for each specified group of model coefficients. If groups is None, then each coefficient is considered separately as its own group.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • groups (List[int], optional (default=None)) – A list of indices representing coefficient groups. Indices for groups should start at zero and be sequenced in order to the number of groups minus one. If None, then each feature is its own group.

Returns

pred_var

Return type

np.ndarray [n_boosting_iterations, n_groups]

initialize_model(X, yt, weights=None)[source]

Initialize the boosted model.

This is called internally within the fit function. If manual boosting is being performed using the ‘boost’ method, then this method should be called before beginning the manual boosting procedure.

Parameters
  • X (np.ndarray [n_samples, n_features]) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • yt (np.ndarray) – Observed target values.

  • weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).

Returns

yp, eta_p – The initial target predictions and link of target prediction arrays.

Return type

tuple(np.ndarray, np.ndarray)

class genestboost.BoostedModel(link, loss, model_callback, model_callback_kwargs=None, weights='none', alpha=1.0, step_type='decaying', step_decay_factor=0.48, init_type='mean', random_state=None, validation_fraction=0.0, validation_stratify=False, validation_iter_stop=10, tol=None)[source]

Bases: object

General boosting model class implementation for any regression model.

class InitialModel(link, init_type='mean')[source]

Bases: object

InitialModel class implementation - internal to BoostedModel.

__init__(link, init_type='mean')[source]

Class initializer.

Parameters
  • link (BaseLink) – Link function to use for initialization.

  • init_type (str (default="mean")) – One of “mean”, “residuals”, or “zero”.

fit(X, yt, weights=None)[source]

Fit the InitialModel object.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • yt (np.ndarray) – Observed target values.

  • weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).

Returns

Return type

self

predict(X)[source]

Compute InitialModel predictions.

Parameters

X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

Returns

predictions

Return type

np.ndarray

__init__(link, loss, model_callback, model_callback_kwargs=None, weights='none', alpha=1.0, step_type='decaying', step_decay_factor=0.48, init_type='mean', random_state=None, validation_fraction=0.0, validation_stratify=False, validation_iter_stop=10, tol=None)[source]

Class initializer.

Parameters
  • link (BaseLink) – Link function to use in boosting iterations.

  • loss (BaseLoss) – Loss function to use in boosting iterations.

  • model_callback (Callable) – A callable that returns a model object that implements fit and predict methods.

  • model_callback_kwargs (dict, optional (default=None)) – A dictionary of keyword arguments to pass to model_callback.

  • weights (Union[str, WeightsCallback]: str or Callable) – A string specificying the type of weights (one of “none” or “newton”), or a callable of the form ‘lambda yt, yp: weights`, where yt and yp are the actual target and predicted target arrays. These weights are multiplied element-wise by the model gradients to produce the pseudo-residuals that are to be predicted at each model iteration.

  • alpha (float = 1.0) – A parameter representing the intial trial learning rate. The learning rate that actually gets used at each iteration is dependent upon the value of step_type.

  • step_type (str (default="decaying")) – One of “decaying”, “constant”, or “best”. For “decaying”, the initial model iteration will start with alpha as the learning parameter. If alpha times step_decay_factor results in a greater reduction in loss than alpha, then use the former. This is repeated until performance does not improve, and the final chosen rate will serve as the ‘initial’ rate for the next boosting iteration. For “constant”, alpha will be used at every boosting iteration. Using “step_type” best implements the same process as “decaying”, except the next boosting iteration will reset the learning rate back to alpha as the initial trial learning rate.

  • step_decay_factor (float (default=0.48)) – The decaying factor to use with step_type “decaying” or “best”.

  • init_type (str (default="mean")) – The type of intial prediction to use. If “mean”, then the initial link prediction (prior to boosting iterations) will be taken as the link of the mean of the non-transformed target. If “residuals” or “zero”, the initial link prediction will be set to 0.0.

  • random_state (int, optional (default=None)) – Random seed to be used for reproducability when random methods are used internally.

  • validation_fraction (float (default=0.0)) – If 0.0, then no validation set will be used and training will be performed using the full training set. Otherwise, the fraction of observations to use as a holdout set for early stopping.

  • validation_stratify (bool (default=False)) – If true, than stratify the validation holdout set based on the target. Useful when the target is binary.

  • validation_iter_stop (int (default=10)) – Number of iterations to use for early stopping on the validation set. If the holdout loss is greater at the current iteration than validation_iter_stop iterations prior, then stop model fitting.

  • tol (float, optional (default=None)) – Early stopping criteria based on training loss. If training loss fails to improve by at least tol, then stop training. If None, then training loss criteria is not checked to determine early stopping.

boost(X, yt, yp, eta_p, model_callback, model_callback_kwargs, weights=None)[source]

Boost by one model iteration.

Creates a model using the model_callback callable with model_callback_kwargs, then fits this model to the pseudo-residuals. The learning rate is determined according to the chosen method, and current predictions are updated and returned such that stopping criteria can be evaluated and boosting continued. The fitted model resulting from the iteration is appended to the underlying model ensemble.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • yt (np.ndarray) – Observed target values.

  • yp (np.ndarray) – Predicted target values.

  • eta_p (np.ndarray) – The current link predictions corresponding to the model iteration. This can be found by transforming yp, but passing this as an argument avoids duplicate computations and improves performance.

  • model_callback (Callable) – A callable that returns a model object that implements fit and predict methods.

  • model_callback_kwargs (dict, optional (default=None)) – A dictionary of keyword arguments to pass to model_callback.

  • weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).

Returns

yp_next, eta_p_next – A tuple of the updated target predictions and target prediction links.

Return type

tuple(np.ndarray, np.ndarray)

compute_gradients(yt, yp)[source]

Compute element-wise gradients.

Parameters
  • yt (np.ndarray) – Observed target values.

  • yp (np.ndarray) – Predicted target values.

Returns

Return type

np.ndarray

Compute the element-wise link function or inverse link function.

Parameters
  • yt (np.ndarray) – Observed target values.

  • yp (np.ndarray) – Predicted target values.

Returns

Return type

np.ndarray

compute_loss(yt, yp)[source]

Compute element-wise loss.

Parameters
  • yt (np.ndarray) – Observed target values.

  • yp (np.ndarray) – Predicted target values.

Returns

Return type

np.ndarray

compute_newton_weights(yt, yp)[source]

Compute newton weights.

This is a mixture of loss function derivatives and link function derivatives by application of chain rule. Using newton weights requires that the second derivatives of the loss and link function be defined. This method uses the computed second derivatives as-is - there are no adjustments to prevent the effects of ill-conditioning (very small second derivatives) or non-positive definiteness (negative second derivatives) on computed pseudo residuals.

Parameters
  • yt (np.ndarray) – Observed target values.

  • yp (np.ndarray) – Predicted target values.

Returns

The element-wise reciprocal of the second-derivative of the loss function with respect to the link function.

Return type

np.ndarray

compute_p_residuals(yt, yp)[source]

Calculate pseudo-residuals.

The psuedo-residuals are taken as the observation gradients times weights that are computed as per the selected weighting scheme (“none”, “newton”, callable).

Parameters
  • yt (np.ndarray) – Observed target values.

  • yp (np.ndarray) – Predicted target values.

Returns

Return type

np.ndarray

compute_weights(yt, yp)[source]

Compute model weights that will be multiplied by observation gradients.

The final result is the pseudo-residuals to be fit at the next boosting iteration. This essentially serves as a case/switch statement to redirect to the underlying calculation method.

Parameters
  • yt (np.ndarray) – Observed target values.

  • yp (np.ndarray) – Predicted target values.

decision_function(X, model_index=None)[source]

Get the link of computed model predictions.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • model_index (int, optional (default=None)) – If None, then return the full model prediction as the sum of all models in the ensemble plus the initial model prediction. If an int, then return only the predictions from models up to model_index (i.e., [:model_index]).

Returns

The link of the computed model predictions.

Return type

np.ndarray

decision_function_single(X, model_index=- 1, apply_learning_rate=True)[source]

Compute the link for a specific ensemble model by index.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • model_index (int (default=-1)) – The model iteration for which to compute the decision function. By default, it is -1. This corresponds to the model from the most recent boosting iteration.

  • apply_learning_rate (bool (default=True)) – If True, then the predictions from the selected model on X will be multiplied by the corresponding learning rate. Otherwise if False, the predictions of the selected model will be returned as if the learning rate was equal to 1.0.

Returns

The computed link values for the selected model index.

Return type

np.ndarray

fit(X, yt, iterations=100, weights=None, min_iterations=None)[source]

Fit the boosted model.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • iterations (int (default=100)) – The maximum number of boosting iterations to perform.

  • weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).

  • min_iterations (int, optional (default=None)) – The minimum number of boosting iterations to perform. If None (the default), then there is no minimum.

Returns

Return type

self

get_iterations()[source]

Get the current number of model boosting iterations.

get_loss_history()[source]

Get the loss history for the fitted model (training and validation loss).

Returns

A two-column array with with training and holdout loss in each column, respectively.

Return type

np.ndarray

Compute a matrix of model links (without applying learning rates).

Returns a matrix of model links, where each column index corresponds to the same index in the model ensemble.

Parameters

X (np.ndarray [n_samples, n_features]) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

Returns

link_matrix

Return type

np.ndarray [n_samples, n_boosting_iterations]

initialize_model(X, yt, weights=None)[source]

Initialize the boosted model.

This is called internally within the fit function. If manual boosting is being performed using the ‘boost’ method, then this method should be called before beginning the manual boosting procedure.

Parameters
  • X (np.ndarray [n_samples, n_features]) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • yt (np.ndarray) – Observed target values.

  • weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).

Returns

yp, eta_p – The initial target predictions and link of target prediction arrays.

Return type

tuple(np.ndarray, np.ndarray)

predict(X, model_index=None)[source]

Compute model predictions in the original target space.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • model_index (int, optional (default=None)) – If None, then return the full model prediction as the sum of all models in the ensemble plus the initial model prediction. If an int, then return only the predictions from models up to model_index (i.e., [:model_index]).

Returns

predictions

Return type

np.ndarray

prediction_history(X, links=False)[source]

Compute a prediction history.

This will compute a matrix of predictions with each column corresponding to the predictions up to the underlying ensemble at that column index.

Parameters
  • X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.

  • links (bool (default=False)) – If true, then return the links of the prediction history. Otherwise, return the non-transformed predictions.

reset_model()[source]

Reset the model fit status to False.

This will cause the model to reinitialize if the fit method is called after the reset_model method.

class genestboost.ModelDataSets(X, yt, weights=None, validation_fraction=0.0, validation_stratify=False, random_state=None)[source]

Bases: object

ModelDataSets class for abstracting data set implementation from BoostedModel.

__init__(X, yt, weights=None, validation_fraction=0.0, validation_stratify=False, random_state=None)[source]

Class initializer.

Parameters
  • X (numpy.ndarray, shape (n_samples, n_features)) – Feature matrix of type float.

  • yt (numpy.ndarray, shape (n_samples,)) – Target vector.

  • weights (numpy.ndarray (optional, default=None), shape (n_samples,)) – Sample weights to be used in the fitting process.

  • validation_fraction (float (optional, default=0.0)) – Fraction of dataset to use as validation set for early stopping.

  • validation_stratify (bool (default=False)) – If True, stratify the validation sample and the training sample using the model target. This only makes sense for classification problems.

  • random_state (int (optional, default=None)) – Set the random state of the instance so that the data set split can be reproduced.

has_validation_set()[source]

Return True if the validation fraction is greater than zero.