pitci.xgboost.XGBoosterLeafNodeSplitConformalPredictor¶
-
class
pitci.xgboost.XGBoosterLeafNodeSplitConformalPredictor(model, n_bins=3)[source]¶ Bases:
pitci.base.SplitConformalPredictor,pitci.xgboost.XGBoosterLeafNodeScaledConformalPredictorConformal interval predictor for an underlying
xgb.Boostermodel using absolute error scaled by leaf node counts as the nonconformity measure. Intervals are also split into bins based off the scaling factors and calibrated separately for each bin.Class implements inductive conformal intervals where a calibration dataset is used to learn the information that is used when generating intervals for new instances.
The predictor outputs varying width intervals for every new instance. The scaling function uses the number of times that the leaf nodes were visited for each tree in making the prediction, for that row, were visited in the calibration dataset.
Intuitively, for rows that have higher leaf node counts from the calibration set - the model will be more ‘familiar’ with hence the interval for these rows will be shrunk. The inverse is true for rows that have lower leaf node counts from the calibration set.
Intervals are split into bins, using the scaling factors, where each bin is calibrated at the required confidence level. This addresses the situation where the leaf node scaled conformal predictors are not well calibrated on subsets of the data, despite being calibrated at the required
alphaconfidence level overall.The currently supported lgboost objective functions, given the nonconformity measure that is based on absolute error, are defined in the SUPPORTED_OBJECTIVES attribute.
- Parameters
model (
xgb.Booster) – Underlyingxgb.Boostermodel to generate prediction intervals with.n_bins (int) – Number of bins to split data into based on the scaling factors.
-
__version__¶ The version of the
pitcipackage that generated the object.- Type
str
-
model¶ The underlying
xgb.Boostermodel passed in initialising the object.- Type
xgb.Booster
-
leaf_node_counts¶ The number of times each leaf node in each tree was visited when making predictions on the calibration dataset. Each item in the list is a
dictgiving a mapping between leaf node index and counts for a given tree. The length of the list corresponds to the number of trees inmodel.- Type
list
-
baseline_intervals¶ The default or baseline conformal half interval widths that depend on the scaling factor values. When making prediction intervals the correct interval will be looked up based off the scaling factor values, this is then multiplied by the scaling factor.
- Type
list
-
alpha¶ The confidence level of the conformal intervals that will be produced. Attribute is set when the
calibratemethod is run.- Type
int or float
-
n_bins¶ Number of bins to split data into based off the scaling factors.
- Type
int
-
bin_quantiles¶ Quantiles of the scaling factor values that will be used to define the limits of the bins. Attribute is set when the
calibratemethod is run.- Type
float
-
SUPPORTED_OBJECTIVES¶ Booster supported objectives. If a model with a non-supported objective is passed when initialising the class object an error will be raised.
- Type
list
-
__init__(model, n_bins=3)¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__(model[, n_bins])Initialize self.
calibrate(data[, response, alpha, train_data])Calibrate conformal intervals to a given sample of
dataat a given confidence level,alpha, between 0 and 1.predict_with_interval(data)Generate predictions with conformal intervals for the passed
data.-
calibrate(data, response=None, alpha=0.95, train_data=None)[source]¶ Calibrate conformal intervals to a given sample of
dataat a given confidence level,alpha, between 0 and 1.This method must be run before
predict_with_interval()can be used to generate predictions.There are 2 items to be calibrated; the leaf node counts stored in the
leaf_node_countsattribute and the half interval width stored in thebaseline_intervalsattribute.The user has the option to specify the training sample that was used to buid the model in the
train_dataargument. This is to allow theleaf_node_countsto be calibrated on the same data, as the underlying model was built on, rather than a separate calibration set which is what will be passed in thedataargument. The default interval width for a givenalphahas to be set on a separate sample to what was used to build the model. If not, the errors will be smaller than they otherwise would be, on a sample the underlying model has not seen before. However for theleaf_node_counts, ideally we want counts from the train sample - we’re not ‘learning’ anything new here, just recreating stats from when the model was built originally.If
responseis not passed then the method will attempt to extract the response values fromdatausing theget_labelmethod.The
baseline_intervalsare each calibrated to the requiredalphalevel on the subsets of the data where the scaling factor values fall into the range for that particular bucket.- Parameters
data (xgb.DMatrix) – Dataset to use to set baselines.
response (np.ndarray, pd.Series or None, default = None) – The response values for the records in
data.alpha (int or float, default = 0.95) – Confidence level for the intervals.
train_data (xgb.DMatrix or None, default = None) – Optional dataset that can be passed to set baseline
leaf_node_countsfrom, separate to thedataarg used to setbaseline_intervalswidth.
-
predict_with_interval(data)[source]¶ Generate predictions with conformal intervals for the passed
data.Each prediction is produced with an associated conformal interval. The default intervals are of a fixed width (
baseline_intervalsattribute) and this is scaled differently for each row. The scaling factors are calculated by counting the number of times each leaf node, visited to make the prediction, was visited in the calibration dataset - looking up values from theleaf_node_countslist. For theSplitConformalPredictorclass the baseline intervals also depend on the sclaing factors - rather than there being one interval as in theLeafNodeScaledConformalPredictorclass.The method is very similar to the
predict_with_interval()method, with the only difference being that the baseline interval is looked up from the possible values using the scaling factors for each row.- Parameters
data (xgb.DMatrix) – Data to generate predictions with conformal intervals on.
- Returns
predictions_with_interval – Array of predictions with intervals for each row in
data. Output array will have 3 columns where the first is the lower interval, second are the predictions and the third is the upper interval.- Return type
np.ndarray