genestboost package¶
Subpackages¶
genestboost top-level module.
-
class
genestboost.
BoostedLinearModel
(link, loss, model_callback, model_callback_kwargs=None, weights='none', alpha=1.0, step_type='default', step_decay_factor=0.6, init_type='mean', random_state=None, validation_fraction=0.0, validation_stratify=False, validation_iter_stop=10, tol=None)[source]¶ Bases:
genestboost.boosted_model.BoostedModel
BoostedLinearModel class implementation.
-
__init__
(link, loss, model_callback, model_callback_kwargs=None, weights='none', alpha=1.0, step_type='default', step_decay_factor=0.6, init_type='mean', random_state=None, validation_fraction=0.0, validation_stratify=False, validation_iter_stop=10, tol=None)[source]¶ Class initializer.
- Parameters
link (BaseLink) – Link function to use in boosting iterations.
loss (BaseLoss) – Loss function to use in boosting iterations.
model_callback (Callable) – A callable that returns a model object that implements fit and predict methods. The model object that is returned must be a linear model that has coef_ and intercept_ attributes.
model_callback_kwargs (dict, optional (default=None)) – A dictionary of keyword arguments to pass to model_callback.
weights (Union[str, WeightsCallback]: str or Callable) – A string specificying the type of weights (one of “none” or “newton”), or a callable of the form ‘lambda yt, yp: weights`, where yt and yp are the actual target and predicted target arrays. These weights are multiplied element-wise by the model gradients to produce the pseudo-residuals that are to be predicted at each model iteration.
alpha (float (default=1.0)) – A parameter representing the intial trial learning rate. The learning rate that actually gets used at each iteration is dependent upon the value of step_type.
step_type (str (default="decaying")) – One of “decaying”, “constant”, or “best”. For “decaying”, the initial model iteration will start with alpha as the learning parameter. If alpha times step_decay_factor results in a greater reduction in loss than alpha, then use the former. This is repeated until performance does not improve, and the final chosen rate will serve as the ‘initial’ rate for the next boosting iteration. For “constant”, alpha will be used at every boosting iteration. Using “step_type” best implements the same process as “decaying”, except the next boosting iteration will reset the learning rate back to alpha as the initial trial learning rate.
step_decay_factor (float (default=0.48)) – The decaying factor to use with step_type “decaying” or “best”.
init_type (str (default="mean")) – The type of intial prediction to use. If “mean”, then the initial link prediction (prior to boosting iterations) will be taken as the link of the mean of the non-transformed target. If “residuals” or “zero”, the initial link prediction will be set to 0.0.
random_state (int, optional (default=None)) – Random seed to be used for reproducability when random methods are used internally.
validation_fraction (float (default=0.0)) – If 0.0, then no validation set will be used and training will be performed using the full training set. Otherwise, the fraction of observations to use as a holdout set for early stopping.
validation_stratify (bool (default=False)) – If true, than stratify the validation holdout set based on the target. Useful when the target is binary.
validation_iter_stop (int (default=10)) – Number of iterations to use for early stopping on the validation set. If the holdout loss is greater at the current iteration than validation_iter_stop iterations prior, then stop model fitting.
tol (float, optional (default=None)) – Early stopping criteria based on training loss. If training loss fails to improve by at least tol, then stop training. If None, then training loss criteria is not checked to determine early stopping.
-
boost
(X, yt, yp, eta_p, model_callback, model_callback_kwargs, weights=None)[source]¶ Boost by one model iteration.
Creates a model using the model_callback callable with model_callback_kwargs, then fits this model to the pseudo-residuals. The learning rate is determined according to the chosen method, and current predictions are updated and returned such that stopping criteria can be evaluated and boosting continued. The fitted model resulting from the iteration is appended to the underlying model ensemble.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
yt (np.ndarray) – Observed target values.
yp (np.ndarray) – Predicted target values.
eta_p (np.ndarray) – The current link predictions corresponding to the model iteration. This can be found by transforming yp, but passing this as an argument avoids duplicate computations and improves performance.
model_callback (Callable) – A callable that returns a model object that implements fit and predict methods.
model_callback_kwargs (dict, optional (default=None)) – A dictionary of keyword arguments to pass to model_callback.
weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).
- Returns
yp_next, eta_p_next – A tuple of the updated target predictions and target prediction links.
- Return type
tuple(np.ndarray, np.ndarray)
-
decision_function
(X, model_index=None)[source]¶ Get the link of computed model predictions.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
model_index (int, optional (default=None)) – If None, then return the full model prediction as the sum of all models in the ensemble plus the initial model prediction. If an int, then return only the predictions from models up to model_index (i.e., [:model_index]).
- Returns
The link of the computed model predictions.
- Return type
np.ndarray
-
get_coefficient_history
(scale=None)[source]¶ Get the history of coefficient values.
Returns a matrix of coefficients, where each row is representative of the model coefficients at a specific boosting iteration (i.e., row 1 is after the first round of boosting, etc.)
- Parameters
scale (np.ndarray, optional (default=None)) – Vector to scale the coefficients in the history calculation. If None, then coefficients are not scaled (or alternatively, all coefficients are scaled by a factor of 1.0).
- Returns
coefficient_history
- Return type
np.ndarray [n_boosting_iterations + 1, n_coefs]
-
get_coefficient_order
(scale=None)[source]¶ Get the order that coefficients were selected for the model.
In the case that multiple coefficients were selected for the first time at the same model iteration, the “larger” coefficient will be considered to have been selected first. The scale argument can be used to standardize coefficients if models were fitted in a manner such that coefficients were not standardized.
- Parameters
scale (np.ndarray, optional (default=None)) – Vector to scale the coefficients in the ordering calculation. If None, then coefficients are not scaled (or alternatively, all coefficients are scaled by a factor of 1.0).
- Returns
coefficient_order – A list of zero-based indexes specifying the order that coefficients entered the boosted model.
- Return type
List[int]
-
get_prediction_var_history
(X, groups=None)[source]¶ Get the history of prediction variance on the model matrix X.
Returns a matrix of prediction variances at each round of boosting for each specified group of model coefficients. If groups is None, then each coefficient is considered separately as its own group.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
groups (List[int], optional (default=None)) – A list of indices representing coefficient groups. Indices for groups should start at zero and be sequenced in order to the number of groups minus one. If None, then each feature is its own group.
- Returns
pred_var
- Return type
np.ndarray [n_boosting_iterations, n_groups]
-
initialize_model
(X, yt, weights=None)[source]¶ Initialize the boosted model.
This is called internally within the fit function. If manual boosting is being performed using the ‘boost’ method, then this method should be called before beginning the manual boosting procedure.
- Parameters
X (np.ndarray [n_samples, n_features]) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
yt (np.ndarray) – Observed target values.
weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).
- Returns
yp, eta_p – The initial target predictions and link of target prediction arrays.
- Return type
tuple(np.ndarray, np.ndarray)
-
-
class
genestboost.
BoostedModel
(link, loss, model_callback, model_callback_kwargs=None, weights='none', alpha=1.0, step_type='decaying', step_decay_factor=0.48, init_type='mean', random_state=None, validation_fraction=0.0, validation_stratify=False, validation_iter_stop=10, tol=None)[source]¶ Bases:
object
General boosting model class implementation for any regression model.
-
class
InitialModel
(link, init_type='mean')[source]¶ Bases:
object
InitialModel class implementation - internal to BoostedModel.
-
__init__
(link, init_type='mean')[source]¶ Class initializer.
- Parameters
link (BaseLink) – Link function to use for initialization.
init_type (str (default="mean")) – One of “mean”, “residuals”, or “zero”.
-
fit
(X, yt, weights=None)[source]¶ Fit the InitialModel object.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
yt (np.ndarray) – Observed target values.
weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).
- Returns
- Return type
self
-
-
__init__
(link, loss, model_callback, model_callback_kwargs=None, weights='none', alpha=1.0, step_type='decaying', step_decay_factor=0.48, init_type='mean', random_state=None, validation_fraction=0.0, validation_stratify=False, validation_iter_stop=10, tol=None)[source]¶ Class initializer.
- Parameters
link (BaseLink) – Link function to use in boosting iterations.
loss (BaseLoss) – Loss function to use in boosting iterations.
model_callback (Callable) – A callable that returns a model object that implements fit and predict methods.
model_callback_kwargs (dict, optional (default=None)) – A dictionary of keyword arguments to pass to model_callback.
weights (Union[str, WeightsCallback]: str or Callable) – A string specificying the type of weights (one of “none” or “newton”), or a callable of the form ‘lambda yt, yp: weights`, where yt and yp are the actual target and predicted target arrays. These weights are multiplied element-wise by the model gradients to produce the pseudo-residuals that are to be predicted at each model iteration.
alpha (float = 1.0) – A parameter representing the intial trial learning rate. The learning rate that actually gets used at each iteration is dependent upon the value of step_type.
step_type (str (default="decaying")) – One of “decaying”, “constant”, or “best”. For “decaying”, the initial model iteration will start with alpha as the learning parameter. If alpha times step_decay_factor results in a greater reduction in loss than alpha, then use the former. This is repeated until performance does not improve, and the final chosen rate will serve as the ‘initial’ rate for the next boosting iteration. For “constant”, alpha will be used at every boosting iteration. Using “step_type” best implements the same process as “decaying”, except the next boosting iteration will reset the learning rate back to alpha as the initial trial learning rate.
step_decay_factor (float (default=0.48)) – The decaying factor to use with step_type “decaying” or “best”.
init_type (str (default="mean")) – The type of intial prediction to use. If “mean”, then the initial link prediction (prior to boosting iterations) will be taken as the link of the mean of the non-transformed target. If “residuals” or “zero”, the initial link prediction will be set to 0.0.
random_state (int, optional (default=None)) – Random seed to be used for reproducability when random methods are used internally.
validation_fraction (float (default=0.0)) – If 0.0, then no validation set will be used and training will be performed using the full training set. Otherwise, the fraction of observations to use as a holdout set for early stopping.
validation_stratify (bool (default=False)) – If true, than stratify the validation holdout set based on the target. Useful when the target is binary.
validation_iter_stop (int (default=10)) – Number of iterations to use for early stopping on the validation set. If the holdout loss is greater at the current iteration than validation_iter_stop iterations prior, then stop model fitting.
tol (float, optional (default=None)) – Early stopping criteria based on training loss. If training loss fails to improve by at least tol, then stop training. If None, then training loss criteria is not checked to determine early stopping.
-
boost
(X, yt, yp, eta_p, model_callback, model_callback_kwargs, weights=None)[source]¶ Boost by one model iteration.
Creates a model using the model_callback callable with model_callback_kwargs, then fits this model to the pseudo-residuals. The learning rate is determined according to the chosen method, and current predictions are updated and returned such that stopping criteria can be evaluated and boosting continued. The fitted model resulting from the iteration is appended to the underlying model ensemble.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
yt (np.ndarray) – Observed target values.
yp (np.ndarray) – Predicted target values.
eta_p (np.ndarray) – The current link predictions corresponding to the model iteration. This can be found by transforming yp, but passing this as an argument avoids duplicate computations and improves performance.
model_callback (Callable) – A callable that returns a model object that implements fit and predict methods.
model_callback_kwargs (dict, optional (default=None)) – A dictionary of keyword arguments to pass to model_callback.
weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).
- Returns
yp_next, eta_p_next – A tuple of the updated target predictions and target prediction links.
- Return type
tuple(np.ndarray, np.ndarray)
-
compute_gradients
(yt, yp)[source]¶ Compute element-wise gradients.
- Parameters
yt (np.ndarray) – Observed target values.
yp (np.ndarray) – Predicted target values.
- Returns
- Return type
np.ndarray
-
compute_link
(yp, inverse=False)[source]¶ Compute the element-wise link function or inverse link function.
- Parameters
yt (np.ndarray) – Observed target values.
yp (np.ndarray) – Predicted target values.
- Returns
- Return type
np.ndarray
-
compute_loss
(yt, yp)[source]¶ Compute element-wise loss.
- Parameters
yt (np.ndarray) – Observed target values.
yp (np.ndarray) – Predicted target values.
- Returns
- Return type
np.ndarray
-
compute_newton_weights
(yt, yp)[source]¶ Compute newton weights.
This is a mixture of loss function derivatives and link function derivatives by application of chain rule. Using newton weights requires that the second derivatives of the loss and link function be defined. This method uses the computed second derivatives as-is - there are no adjustments to prevent the effects of ill-conditioning (very small second derivatives) or non-positive definiteness (negative second derivatives) on computed pseudo residuals.
- Parameters
yt (np.ndarray) – Observed target values.
yp (np.ndarray) – Predicted target values.
- Returns
The element-wise reciprocal of the second-derivative of the loss function with respect to the link function.
- Return type
np.ndarray
-
compute_p_residuals
(yt, yp)[source]¶ Calculate pseudo-residuals.
The psuedo-residuals are taken as the observation gradients times weights that are computed as per the selected weighting scheme (“none”, “newton”, callable).
- Parameters
yt (np.ndarray) – Observed target values.
yp (np.ndarray) – Predicted target values.
- Returns
- Return type
np.ndarray
-
compute_weights
(yt, yp)[source]¶ Compute model weights that will be multiplied by observation gradients.
The final result is the pseudo-residuals to be fit at the next boosting iteration. This essentially serves as a case/switch statement to redirect to the underlying calculation method.
- Parameters
yt (np.ndarray) – Observed target values.
yp (np.ndarray) – Predicted target values.
-
decision_function
(X, model_index=None)[source]¶ Get the link of computed model predictions.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
model_index (int, optional (default=None)) – If None, then return the full model prediction as the sum of all models in the ensemble plus the initial model prediction. If an int, then return only the predictions from models up to model_index (i.e., [:model_index]).
- Returns
The link of the computed model predictions.
- Return type
np.ndarray
-
decision_function_single
(X, model_index=- 1, apply_learning_rate=True)[source]¶ Compute the link for a specific ensemble model by index.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
model_index (int (default=-1)) – The model iteration for which to compute the decision function. By default, it is -1. This corresponds to the model from the most recent boosting iteration.
apply_learning_rate (bool (default=True)) – If True, then the predictions from the selected model on X will be multiplied by the corresponding learning rate. Otherwise if False, the predictions of the selected model will be returned as if the learning rate was equal to 1.0.
- Returns
The computed link values for the selected model index.
- Return type
np.ndarray
-
fit
(X, yt, iterations=100, weights=None, min_iterations=None)[source]¶ Fit the boosted model.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
iterations (int (default=100)) – The maximum number of boosting iterations to perform.
weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).
min_iterations (int, optional (default=None)) – The minimum number of boosting iterations to perform. If None (the default), then there is no minimum.
- Returns
- Return type
self
-
get_loss_history
()[source]¶ Get the loss history for the fitted model (training and validation loss).
- Returns
A two-column array with with training and holdout loss in each column, respectively.
- Return type
np.ndarray
-
get_model_links
(X)[source]¶ Compute a matrix of model links (without applying learning rates).
Returns a matrix of model links, where each column index corresponds to the same index in the model ensemble.
- Parameters
X (np.ndarray [n_samples, n_features]) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
- Returns
link_matrix
- Return type
np.ndarray [n_samples, n_boosting_iterations]
-
initialize_model
(X, yt, weights=None)[source]¶ Initialize the boosted model.
This is called internally within the fit function. If manual boosting is being performed using the ‘boost’ method, then this method should be called before beginning the manual boosting procedure.
- Parameters
X (np.ndarray [n_samples, n_features]) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
yt (np.ndarray) – Observed target values.
weights (np.ndarray, optional (default=None)) – Sample weights (by observation) to use for fitting. Should be positive. Observations with higher weights will affect the model fit more. If ‘None’, then all weights will be equal (1.0).
- Returns
yp, eta_p – The initial target predictions and link of target prediction arrays.
- Return type
tuple(np.ndarray, np.ndarray)
-
predict
(X, model_index=None)[source]¶ Compute model predictions in the original target space.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
model_index (int, optional (default=None)) – If None, then return the full model prediction as the sum of all models in the ensemble plus the initial model prediction. If an int, then return only the predictions from models up to model_index (i.e., [:model_index]).
- Returns
predictions
- Return type
np.ndarray
-
prediction_history
(X, links=False)[source]¶ Compute a prediction history.
This will compute a matrix of predictions with each column corresponding to the predictions up to the underlying ensemble at that column index.
- Parameters
X (np.ndarray) – The model matrix. If init_type was set as “residuals”, then the model scores for which to calculate residuals should form the last column of the input matrix.
links (bool (default=False)) – If true, then return the links of the prediction history. Otherwise, return the non-transformed predictions.
-
class
-
class
genestboost.
ModelDataSets
(X, yt, weights=None, validation_fraction=0.0, validation_stratify=False, random_state=None)[source]¶ Bases:
object
ModelDataSets class for abstracting data set implementation from BoostedModel.
-
__init__
(X, yt, weights=None, validation_fraction=0.0, validation_stratify=False, random_state=None)[source]¶ Class initializer.
- Parameters
X (numpy.ndarray, shape (n_samples, n_features)) – Feature matrix of type float.
yt (numpy.ndarray, shape (n_samples,)) – Target vector.
weights (numpy.ndarray (optional, default=None), shape (n_samples,)) – Sample weights to be used in the fitting process.
validation_fraction (float (optional, default=0.0)) – Fraction of dataset to use as validation set for early stopping.
validation_stratify (bool (default=False)) – If True, stratify the validation sample and the training sample using the model target. This only makes sense for classification problems.
random_state (int (optional, default=None)) – Set the random state of the instance so that the data set split can be reproduced.
-