pitci.xgboost.XGBoosterSplitLeafNodeScaledConformalPredictor

class pitci.xgboost.XGBoosterSplitLeafNodeScaledConformalPredictor(model, n_bins=3)[source]

Bases: pitci.base.SplitConformalPredictorMixin, pitci.xgboost.XGBoosterLeafNodeScaledConformalPredictor

Conformal interval predictor for an underlying xgb.Booster model using absolute error scaled by leaf node counts as the nonconformity measure.

Class implements inductive conformal intervals where a calibration dataset is used to learn the information that is used when generating intervals for new instances.

The predictor outputs varying width intervals for every new instance. This is done by multiplying the baseline_interval by a scaling factor that depends on the input data. The scaling function uses the reciporcal of the number of times that the leaf nodes used in making each prediction were visited on the calibration dataset, or when the underlying model was trained - see the train_data argument for the calibrate() method for more information.

The intuition behind this is that for rows that have higher leaf node counts from the calibration set - the model will be more ‘familiar’ with hence the interval for these rows should be smaller. The inverse is true for rows that have lower leaf node counts from the calibration set.

The currently supported xgboost objective functions, given the nonconformity measure that is based on absolute error, are defined in the SUPPORTED_OBJECTIVES attribute.

Intervals are split into bins, using the scaling factors, where each bin is calibrated at the required confidence level. This addresses the situation where the leaf node scaled conformal predictors are not well calibrated on subsets of the data, despite being calibrated at the required alpha confidence level overall.

Parameters
  • model (xgb.Booster) – Underlying xgb.Booster model to generate prediction intervals with.

  • n_bins (int) – Number of bins to split data into based on the scaling factors.

__version__

The version of the pitci package that generated the object.

Type

str

model

The underlying xgb.Booster model passed in initialising the object.

Type

xgb.Booster

leaf_node_counts

The number of times each leaf node in each tree was visited when making predictions on the calibration dataset. Each item in the list is a dict giving a mapping between leaf node index and counts for a given tree. The length of the list corresponds to the number of trees in model.

Type

list

alpha

The confidence level of the conformal intervals that will be produced. Attribute is set when the calibrate() method is run.

Type

int or float

SUPPORTED_OBJECTIVES

Booster supported objectives. If an xgb.Booster with a non-supported objective is passed when initialising the class object an error will be raised.

Type

list

n_bins

Number of bins to split data into based off the scaling factors.

Type

int

bin_quantiles

Quantiles of the scaling factor values that will be used to define the limits of the bins. Attribute is set when the {calibrate_link} method is run.

Type

float

baseline_interval

Baseline intervals calibrated for each of the n_bins subsets of the data. Set by the _calibrate_interval method.

Type

list

scaling_factor_cut_points

The edges of the scaling factor bins that define the data subsets that each of the values in baseline_interval are calibrated on. Set by the _calibrate_interval method.

Type

np.ndarray

__init__(model, n_bins=3)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(model[, n_bins])

Initialize self.

calibrate(data[, response, alpha, train_data])

Calibrate conformal intervals to a given sample of data at a given confidence level, alpha, between 0 and 1.

predict_with_interval(data)

Generate predictions with conformal intervals using the underlying model.

calibrate(data, response=None, alpha=0.95, train_data=None)[source]

Calibrate conformal intervals to a given sample of data at a given confidence level, alpha, between 0 and 1.

This method must be run before predict_with_interval() can be used to generate predictions.

There are 2 items to be calibrated; the leaf node counts stored in the leaf_node_counts attribute and the half interval width stored in the baseline_interval attribute.

The user has the option to specify the training sample that was used to buid the model in the train_data argument. This is to allow the leaf_node_counts to be calibrated on the same data the underlying model was built on, rather than a separate calibration set which is what will be passed in the data argument. The default interval width for a given alpha has to be set on a separate sample to what was used to build the model. If not, the errors will be smaller than they otherwise would be, on a sample the underlying model has not seen before. However for the leaf_node_counts, ideally we want counts from the train sample - we’re not ‘learning’ anything new here, just recreating stats from when the model was built originally.

If response is not passed then the method will attempt to extract the response values from data using the get_label method.

The baseline_interval values are each calibrated to the required alpha level on the subsets of the data where the scaling factor values fall into the range for that particular bucket.

Parameters
  • data (xgb.DMatrix) – Dataset to use to set baselines.

  • response (np.ndarray, pd.Series or None, default = None) – The response values for the records in data.

  • alpha (int or float, default = 0.95) – Confidence level for the intervals.

  • train_data (xgb.DMatrix or None, default = None) – Optional dataset that can be passed to set baseline leaf_node_counts from, separate to the data argument used to set baseline_interval width.

predict_with_interval(data)[source]

Generate predictions with conformal intervals using the underlying model.

Parameters

data (xgb.DMatrix) – Dataset to generate predictions with intervals on.

Returns

predictions_with_interval – Array of predictions with intervals for each row in data. Output array will have 3 columns where the first is the lower interval, second are the predictions and the third is the upper interval.

Return type

np.ndarray