An aspect that is important but often overlooked in applied machine learning is intervals for predictions, be it confidence or prediction intervals. For classification tasks, beginning practitioners quite often conflate probability with confidence: probability of 0.5 is taken to mean that we are uncertain about the prediction, while a prediction of 1.0 means we are absolutely certain in the outcome. But there are two concepts being mixed up here. A prediction of 0.5 could mean that we have learned very little about a given instance, due to observing no or only a few data points about it. Or it could be that we have a lot of data, and the response is fundamentally uncertain, like flipping a coin.
For regression, a prediction returning a single value (typically meant to minimize the squared error) likewise does not relay any information about the underlying distribution of the data or the range of response values we might later see in the test data.
Looking at the following plots, both the left and right plot represent similar, learned models for predicting Y from X. But while the model predictions would be similar, confidence in them would be quite different for obvious reasons: we have much less and more spread out data in the second case.
A useful concept for quantifying the latter issue is prediction intervals. A prediction interval is an estimate of an interval into which the future observations will fall with a given probability. In other words, it can quantify our confidence or certainty in the prediction. Unlike confidence intervals from classical statistics, which are about a parameter of population (such as the mean), prediction intervals are about individual predictions.
For linear regression, calculating the predictions intervals is straightforward (under certain assumptions like the normal distribution of the residuals) and included in most libraries, such as R’s predict method for linear models.
Quantile regression forests
A general method for finding confidence intervals for decision tree based methods is Quantile Regression Forests.
The idea behind quantile regression forests is simple: instead of recording the mean value of response variables in each tree leaf in the forest, record all observed responses in the leaf. The prediction can then return not just the mean of the response variables, but the full conditional distribution (P(Y leq y mid X = x)) of response values for every (x). Using the distribution, it is trivial to create prediction intervals for new instances simply by using the appropriate percentiles of the distribution. For example, the 95% prediction intervals would be the range between 2.5 and 97.5 percentiles of the distribution of the response variables in the leaves. And of course one could calculate other estimates on the distribution, such as median, standard deviation etc. Unfortunately, quantile regression forests do not enjoy too wild of a popularity. While it is available in R’s quantreg packages, most machine learning packages do not seem to include the method.
Random forests as quantile regression forests
But here’s a nice thing: one can use a random forest as quantile regression forest simply by expanding the tree fully so that each leaf has exactly one value. (And expanding the trees fully is in fact what Breiman suggested in his original random forest paper.) Then a prediction trivially returns individual response variables from which the distribution can be built if the forest is large enough. One caveat is that expanding the tree fully can overfit: if that does happen, the intervals will be useless, just as the predictions. The nice thing is that just like accuracy and precision, the intervals can be cross-validated.
Let’s look at the well-known Boston housing dataset and try to create prediction intervals using vanilla random forest from scikit-learn:
from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_boston boston = load_boston() X = boston["data"] Y = boston["target"] size = len(boston["data"])
We’ll use 400 samples for training, leaving 106 samples for test. The size of the forest should be relatively large, so let’s use 1000 trees.
trainsize = 400 idx = range(size) #shuffle the data np.random.shuffle(idx) rf = RandomForestRegressor(n_estimators=1000, min_samples_leaf=1) rf.fit(X[idx[:trainsize]], Y[idx[:trainsize]])
We can now define a function to calculate prediction intervals for every prediction:
def pred_ints(model, X, percentile=95): err_down =  err_up =  for x in range(len(X)): preds =  for pred in model.estimators_: preds.append(pred.predict(X[x])) err_down.append(np.percentile(preds, (100 - percentile) / 2. )) err_up.append(np.percentile(preds, 100 - (100 - percentile) / 2.)) return err_down, err_up
Let’s compute 90% prediction intervals and test how many observations in the test set fall into the interval.
err_down, err_up = pred_ints(rf, X[idx[trainsize:]], percentile=90) truth = Y[idx[trainsize:]] correct = 0. for i, val in enumerate(truth): if err_down[i] <= val <= err_up[i]: correct += 1 print correct/len(truth)
This is pretty close to what we expected: 90.6% of observations fell into the prediction intervals. Plotting the true values and predictions together with error bars visualizes this nicely.
If we set prediction interval to be 50% for the same model and test data, we see 51% of predictions fall into the interval, again very close to the expected value . Plotting the error bars again, we see that they are significantly smaller:
What can also be observed on the plot that on average, predictions that are more accurate have smaller prediction intervals since these are usually “easier” predictions to make. The correlation between absolute prediction error and prediction interval size is ~0.6 for this dataset.
And again, just as one can and should use cross-validation for estimating the accuracy of the model, one should also cross-validate the intervals to make sure that they give unbiased results on the particular dataset at hand. And just like one can do probability calibration, interval calbiration can also be done.
There are situations when the tree is not expanded fully, such that there is more than one data point per leaf. This can happen because either
a) a node is already pure, so splitting further makes no sense, or
b) the node is not pure, but the feature vector is exactly the same for all responses, so there isn’t anything to do a further split on.
In case of a), we know the response and node size, so we still know the distribution perfectly and can use it for calculating the intervals. If case b) happens, we are in trouble, since we don’t know the distribution of responses in the non-expanded leaf. Luckily, the latter very rarely happens with real world datasets and is easy to check for.
Utilizing prediction intervals can be very beneficial in many machine learning and data science tasks, since they can tell a lot about the underlying data that we are learning about and provide a simple way to sanity check our results. While they seem to enjoy relatively widespread use for linear models due to the ease of access with these methods, they tend to be underutilized for tree-based methods such as random forests. But actually they are relatively straightforward to use (keeping the caveats in mind) by utilizing the fact that a random forest can return a conditional distribution instead of just the conditional mean. In fact, estimating the intervals this way can be more robust than prediction intervals for linear methods, since it does not rely on the assumption of normally distributed residuals.