Kyurae Kim | Uncertainty Quantification is the Red Herring of Bayesian Machine Learning

Will Conformal Predictions Replace Bayesian Inference?

With the rise of conformal predictions, I hear doubts about the Bayesian approach to machine learning. This is especially true for Bayesian deep learning, where the Bayesian approach is barely making progress to provide a computationally-feasible baseline for predictive uncertainty quantification.

Uncertainty Quantification is a Red Herring

The problem I have with these “doubts” about the future of Bayesian machine learning is that they are founded on a false premise. For me, Bayesian machine learning was never about predictive uncertainty quantification. Okay, maybe the “never” is a bit of a stretch. But I do feel that there has been too much focus on the predictive uncertainty quantification aspect of Bayesian machine learning that it has completely overtaken the Bayesian cause.

For me, the Bayesian framework provides the following:

Uncertainty estimates of the parameters.
Uncertainty estimates of the predictions.
Data-driven regularization through marginalization.
Principled model comparison through Bayes factors.
Principled (with principles founded on probability theory) model design.
Decision-theoretic performance guarantees.

Uncertainty quantification is just one of these. Explaining what each bullet exactly means would be too long to qualify as a blog post. Nevertheless, let me discuss the third point, “Data-driven regularization through marginalization,” as I believe it is especially important for machine learning.

Going Bayesian Improves Accuracy

In the Bayesian framework, one makes predictions \(p(y \mid \mathcal{D})\) by marginalizing over the posterior \(p(\theta \mid \mathcal{D})\) such as \(\begin{equation} p(y \mid \mathcal{D}) = \int p\left(y \mid \theta\right) \, p\left( \theta \mid \mathcal{D} \right) \, \mathrm{d}\theta. \end{equation}\) Here, \(p(\theta \mid \mathcal{D})\) automatically takes the parameter uncertainty into account, essentially regularizing the prediction. Thus, assuming the model is sound, fully Bayesian predictions should improve the predictive accuracy compared to naive point estimates. Personally, whenever a non-Bayesian model receives the Bayesian treatment, I expect the predictive accuracy to improve. In general, I don’t care about the predictive uncertainty, I just expect those numbers to go up!

My favorite examples of this are the classic matrix factorization algorithms. For example, Bayesian principled component analysis (Bishop, 1998) and Bayesian non-negative matrix factorization (Schmidt et al., 2009) have shown to be straight upgrades from their original maximum-likelihood variants. This has also been shown for neural networks by non-other than Radford Neal himself (Neal, 2006).

For modern deep neural networks, it took some time to figure out whether such improvement could be obtained. However, with the computational power of Google, Andrew G. Wilson’s group has shown that convolutional neural networks achieve better predictive performance (Izmailov et al., 2021).

Conclusions

Nonetheless, conformal predictions seem to be a promising approach for obtaining predictive uncertainty estimates. And this is fine; Bayesian machine learning has its unique agenda. So keep drinking the Bayesian Kool-Aid!

References

Bayesian PCA

Christopher Bishop.

In Advances in Neural Information Processing Systems 1998
Bayesian Non-Negative Matrix Factorization

Mikkel N. Schmidt, Ole Winther, and Lars Kai Hansen.

In Independent Component Analysis and Signal Separation 2009
Classification with Bayesian Neural Networks

Radford M. Neal.

In Machine Learning Challenges: Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment 2006
What Are Bayesian Neural Network Posteriors Really Like?

Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Gordon Wilson.

In Proceedings of the 38th International Conference on Machine Learning, Jul 2021

PDF