With the rise of conformal predictions, I hear doubts about the Bayesian approach to machine learning. This is especially true for Bayesian deep learning, where the Bayesian approach is barely making progress to provide a computationally-feasible baseline for predictive uncertainty quantification.
The problem I have with these “doubts” about the future of Bayesian machine learning is that they are founded on a false premise. For me, Bayesian machine learning was never about predictive uncertainty quantification. Okay, maybe the “never” is a bit of a stretch. But I do feel that there has been too much focus on the predictive uncertainty quantification aspect of Bayesian machine learning that it has completely overtaken the Bayesian cause.
For me, the Bayesian framework provides the following:
Uncertainty quantification is just one of these. Explaining what each bullet exactly means would be too long to qualify as a blog post. Nevertheless, let me discuss the third point, “Data-driven regularization through marginalization,” as I believe it is especially important for machine learning.
In the Bayesian framework, one makes predictions \(p(y \mid \mathcal{D})\) by marginalizing over the posterior \(p(\theta \mid \mathcal{D})\) such as \(\begin{equation} p(y \mid \mathcal{D}) = \int p\left(y \mid \theta\right) \, p\left( \theta \mid \mathcal{D} \right) \, \mathrm{d}\theta. \end{equation}\) Here, \(p(\theta \mid \mathcal{D})\) automatically takes the parameter uncertainty into account, essentially regularizing the prediction. Thus, assuming the model is sound, fully Bayesian predictions should improve the predictive accuracy compared to naive point estimates. Personally, whenever a non-Bayesian model receives the Bayesian treatment, I expect the predictive accuracy to improve. In general, I don’t care about the predictive uncertainty, I just expect those numbers to go up!
My favorite examples of this are the classic matrix factorization algorithms. For example, Bayesian principled component analysis (Bishop, 1998) and Bayesian non-negative matrix factorization (Schmidt et al., 2009) have shown to be straight upgrades from their original maximum-likelihood variants. This has also been shown for neural networks by non-other than Radford Neal himself (Neal, 2006).
For modern deep neural networks, it took some time to figure out whether such improvement could be obtained. However, with the computational power of Google, Andrew G. Wilson’s group has shown that convolutional neural networks achieve better predictive performance (Izmailov et al., 2021).
Nonetheless, conformal predictions seem to be a promising approach for obtaining predictive uncertainty estimates. And this is fine; Bayesian machine learning has its unique agenda. So keep drinking the Bayesian Kool-Aid!
Although this “intuitive” explanation has been somewhat scientifically disputed multiple times until now. I do not now where the scientific concensus is now on this subject. Because of this, I have never got too deep into this topic, and I only knew the attribution to Keskar et al. However, it turned out that J. Schmidhuber and S. Hochreiter came up with this idea in… 1994! Hell, they even have a paper named “Flat minima” (Hochreiter & Schmidhuber, 1997). Even better, one of their paper on the topic had been presented at NIPS’94 (Hochreiter & Schmidhuber, 1994). To me personally, this sets a whole new standard on being Schmidhubered… (Fortunately though, the two papers by Schmidhuber co. did and still do get properly cited.)
Although I do not consider myself to be a statistician, I regularly read statistics papers in order to renew my Bayesian membership card. (If anybody is wants to register too, please send me an email.) And Charles Stein has made various important contributions to the Bayesian cause.
For example, the James-Stein estimator (Stein, 1956), although not directly a Bayesian idea, has a nice empirical Bayesian (Efron & Morris, 1973) interpretation. (This would be part of a running joke that Bayesians tend to claim anything that works is actually a Bayesian method in disguise.) In more detail, James-Stein estimator has shown that some types of biased estimators can have quite dramatically less error than the maximum likelihood estimator under certain loss functions. This has directly motivated all kinds of regularized/shrinkage estimators.
James-Stein estimator was apparently a big surprise to the statisticias back in the day. Even Michael M. Jordan mentioned it to be “mysterious and beautiful”.
The contribution of Stein that I’m perhaps the most used to, is Stein’s method (Stein, 1972). Stein method is a way to measure the difference between two distributions with respect to an arbitrary function in the form of
\[\begin{equation} d\left(p, q\right) = \sup_{f \in \mathcal{H}} \left| \int f\,dp - \int f\,dq \right|. \end{equation}\]This generalizes many other classic distance measures such as total variation. The key element is the freedom to choose \(f \in \mathcal{H}\).
While Stein’s method has been a textbook (or more exactly, graduate-level mathematical statistics textbook) thing for some time, it recently sparked new interest. Mainly, Stein variational (Liu & Wang, 2016) methods (or more generally variational particle methods) and kernelized discrepancies (Gorham & Mackey, 2015). Both of these have created their own fields and have now become active, promising lines of research that are in the heart of Bayesian front. (An interesting trivia is that Gorham and Mackey originally implemented their method in an early version of Julia.)
During self-isolation, I stumbled upon this paper called A Conversation with Charles Stein (DeGroot, 1986), which is more or less a transcript of an interview with Charles Stein by Morris DeGroot (also a very famous statistician), that was published in Statistical Science. The interview has been published in 1986, which seems to be around Charles Stein retired, but amazingly enough, Charles lived until 2016!
DeGroot: Since you brought up the subject, what is your view of the Bayesian philosophical framework?
Stein: Well, it’s rather negative. Of course a lot of good work has been done by people from that point of view. But basically the Bayesian point of view is often accompanied by an insistence that people ought to agree to a certain doctrine, even without really knowing what that doctrine is. For example, would you be able to specify to me what you would consider an authoritative presentation of the Bayesian view that we are so often approached to accept
It is not clear to me which doctrine Charles is talking about. But I would argue that the frequentist framework also involves lots of doctrines (also known as assumptions). But it’s true that the frequentist doctrines are, in general, much less controversial if you come from certain backgrounds. For example, it is easier to assume that the “true” parameter exists, rather than to admit that there is no such thing.
DeGroot: Well, I’m not being interviewed. [Laughs] I could put in a plug for my book on Bayesian decision theory that gives an axiomatic system for probability and utility theory which together imply the entire decision-making process. I mean, normatively anyway.
Stein: Yes, but of course that is the thing. One is asked to accept something normatively before one knows what that thing really is, rather than the attitude that we have toward other systems where we set out axioms or definitions and use them for the purpose of developing a system, and then if the system turns out to be interesting we pursue this. But we never ask whether those axioms are true or not; rather, we ask if we can find instances in which this axiomatic development is useful. If so, we accept it. In particular, we try to judge the consequences. Whereas, as you know, there are grave difficulties in trying to apply the Bayesian notions to interesting problems because of the difficulty of choosing a prior distribution. There is one point of view specified by Jeffreys who seems to be saying that there is a prescription, which he did not invent but which he seems to endorse, for choosing a (usually improper) prior distribution, and that simply does not work in general. The alternative is that the choice of a prior distribution must be made subjectively, and that I find completely inappropriate. Well, what can one say? Of course, statistical decision theory gives us, within a certain class of problems, an indication of how prior distributions do enter statistics from another point of view. And so in some ways the difference between Wald’s decision-theoretic approach and the Bayesian approach is small.
From this, we can clearly see that Charles Stein has the classic critical frequentist view on the Bayesian approach. However, we have to consider that the fundamentals of Bayesian theory only started to mature in the 90s and this interview took place way before that. Plus, the Bayesian framework has now established lots of connections with the frequentist framework (whether that is the appropriate attitude is, interestingly, quite a controversial subject). Furthermore, using subjective priors has been shown to be sensible as long as the model is evaluated objectively and extensively (Gelman et al., 2020).
I find the point about prior selection, however, still a valid critism. Although some people treat prior selection to be a “solved problem”, in practice, it is still a very difficult subject that needs lots of work on a problem-by-problem basis. Fortunately, many are working on principled procedures for eliciting subjective informative priors (Mikkola et al., 2021) and designing priors with good frequentist properties (Consonni et al., 2018). But, we need to mind that frequentist methods equally involve “manual work” to establish frequentist guarantees on a method-by-method basis. Therefore, on a practical note, neither is less difficult than the other.
An interesting bit is the comment on what we now call Jeffreys priors. They are known to ocassionally mess up model comparison with Bayes factors, which I believe is what Charles Stein is discussing here. (Please let me know if you think he is talking about a different aspect of Jeffreys priors.)
DeGroot: Because Wald used priors as a technical device for determining optimal procedures.
Stein: Yes, and therefore we are considering the same procedures. Roughly speaking, the basic theorems of decision theory say that in some sense good procedures and Bayes procedures are very much the same thing. This is, of course, a gross over simplification, but it does enable us to understand how prior distributions come in.
Interestingly, it seems that Charles Stein is sympathetic to (possibly subjective) Bayesian approaches when coming from a decision-theoric perspective.
DeGroot: Let’s talk about probability for a moment. You say that the notion of subjective probability is unacceptable to you. What definition of probability do you use?
Stein: Essentially Kolmogorov’s. That it is a mathematical system.
DeGroot: Simply any set of numbers that satisfies the axioms of the calculus of probabilities.
Stein: Yes.
DeGroot: But what do these numbers represent in the real world?
Stein: Well, there is no unqiue interpretation. And of course I’m talking about Kolmogorov’s old interpretation of probability and not the complexity interpretation. In his book he mentions briefly two aspects of the interpretation. The first is the traditional relative frequency of occurrence in the long run. And the second is that when one puts forward a probabilistic model that is to be taken completely seriously for a real world phenomenon, then one is asserting in principle that any single specified event having very small probability will not occur. This, of course, combined with the law of large numbers, weak or strong, really is a broader interpretation than the frequency notion. So, in fact, the frequency interpretation in that sense is redundant. This doesn’t answer the question, “When I say the probability is 1/6 that this die will come up 6 on the next toss, what does that statement mean?” But then in no serious work in any science do we answer the question, “What does this statement mean?” It is an erroneous philosophical point of view that leads to this sort of question.
I find this comment from Charles Stein quite surprising. After all, Science has branched out of philosophy, and there is therefore no reason to shy away from philosophical questions and philosophical point of views. In fact, philosophical discussions are scattered everywhere in the history of science. I always thought that statistics was science more than mathematics (therefore some statistics departements prefer to call themselves statistical science). From this perspective, perhaps Charles Stein considered him to be more of a mathematician than a statistican.
DeGroot: But surely that means that there is a subjective element entering into the development of the models and the numerical probabilities.
Stein: But, you see, that’s something very different from saying that one is absolutely never permitted to consider a probabilistic model in which anything is unknown, and that is the strict interpretation of the Bayesian point of view. Some statisticians seem to try to accept this point of view as sort of a philosophical position but to deny it in practice, which is not reasonable.
This comment is interesting because, in practice, it is very rare to encounter a problem where absolutely no prior information is available. Often times (or rather, all the time) we at least have an idea about the support or the extreme values of the data. And even this much information is pretty useful as far as prior information goes.
DeGroot: Do you think your political views influence the kind of problems you work on and your scientific philosophy at all, or are they separate?
Stein: I’d say they are largely separate. Of course, I don’t do military work, not even nominally. That is, I haven’t been on a military contract for 18 years. Actually, even before that it was distasteful but I allowed myself to be talked into it. But this is a hard question to answer. I would admit that my work is largely independent of my political attitudes. I don’t agree with Lindley that a subjective approach to probability is a consequence of being a Marxist or socialist.
Lindley would be surprised to know how much ground the “Marxists” have gained to this day. But seriously, interesting to know that Lindley thought this way.
Stein: Yes. I may want to modify some of my answers when I see the transcript of this conversation. Somehow I don’t seem to think along the same lines as other people, which is useful. It’s good that different people think differently.
DeGroot: Thank you, Charles
From this nice interview by DeGroot, I could see how Charles Stein, and probably many well respected statistians of that day, thought of Bayesian approaches. I would be interested to know whether Charles Stein actively worked after this interview. Unfortunately, after a quick search, I couldn’t find a complete bibliogrphay of Charles Stein’s.
Currently, the Gaussian process echosystem in Julia is somewhat fragmented. We have GaussianProcesses.jl, which is a standalone package that does just GPs, AbstractGPs that tries to combine multiple GP related libraries into a standardized API, AugmentedGaussianProcesses.jl that provides some advanced GP algorithms on top of AbstractGPs. Unfortunately, none of these libraries currently work on GPUs. This is way behind the norm of Python where GPyTorch supports GPUs quite well.
Here is a summary of the current trends for implementing GPs in Julia.
KernelFunctions.jl
is the key here.The main issue is that most GP libraries (including KernelFunctions.jl
) rely on Distances.jl, which is a package for efficiently computing Gram matrices (or distance matrices).
Although Distances.jl
is heavily optimized, it’s optimized too much.
It is very difficult to make it compatible with CUDA.jl (an amazing package that is a very good reason to convert to Julia).
This bottleneck has been the showstopper since everbody is pretty much relying on KernelFunctions.jl
.
There is some dicussion to ditch Distances.jl
in favor of Tullio.jl, but this also has the following downsides:
KernelFunctions.jl
moves on to Tullio.jl
, there is not much to expect at this point.To summarize,
Regardless of the current GPU support, I urgently needed GPs to work on GPUs right now. The things we normally expect from GPU support for GPs are these two things:
2 and 3 work (pretty much) out of the box in Julia. 1 and 4 is the tricky part. So, I ended up spending a few days writing a few CUDA kernels using KernelAbstractions.jl.
The implementation can be found here.
It supports the two following covariance kernels:
\(\begin{align}
k\left(\mathbf{x}, \mathbf{y}\right) &= \sigma^2 k_{\text{SE ARD}}\left(\mathbf{x}, \mathbf{y} \right) + \epsilon^2
\newline
k\left(\mathbf{x}, \mathbf{y}\right) &= \sigma^2 k_{\text{Matern 5/2 ARD}}\left(\mathbf{x}, \mathbf{y} \right) + \epsilon^2
\end{align}\)
where SE ARD
and Matern 5/2
stand for the squared-exponential and Matern 5/2 kernels with automatic relevance determination (ARD), which are, arguably, the most widely used covariance kernels.
We have \(D + 2\) hyperparameters here: the \(D\) ARD length scales, the noise variance \(\epsilon^2\), and the function variance \(\sigma^2\).
The log likelihood of a Gaussian process prior is \begin{equation} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) = -\frac{1}{2}\mathbf{y}^{\top} \mathbf{K^{-1}} \mathbf{y} - \frac{1}{2} \log \mathrm{det} \mathbf{K} - \frac{N}{2} \log 2 \pi. \end{equation} This is implemented as
You can use the squared exponential kernel by swapping matern52_gpu
into se_gpu
and gram_matern52_derivative_gpu
into gram_se_derivative_gpu
.
The other routines are self-contained in gpu_cuda_utils.jl
.
For the gradients, the GPML (Rasmussen & Williams, 2006) book shows how to differentiate the log likelihood. For the record, the gradients for the kernel hypeparameters are \(\begin{align} \nabla_{\mathbf{y}} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) &= \mathbf{K^{-1}} \mathbf{y} \\ \nabla_{\epsilon^2} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) &= \mathbf{y}^{\top} \, \mathbf{K}^{-1} \mathbf{K}^{-1} \, \mathbf{y} - \mathrm{tr}\left( \mathbf{K}^{-1} \right) \\ \nabla_{\sigma^2} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) &= \mathbf{y}^{\top} \, \mathbf{K}^{-1} \mathbf{K} \mathbf{K}^{-1} \, \mathbf{y} - \mathrm{tr}\left( \mathbf{K}^{-1} \mathbf{K} \right) \\ \nabla_{\ell^2} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) &= \mathbf{y}^{\top} \, \mathbf{K}^{-1} \frac{\partial \mathbf{K}}{\partial \ell^2} \mathbf{K}^{-1} \, \mathbf{y} - \mathrm{tr}\left( \mathbf{K}^{-1} \frac{\partial \mathbf{K}}{\partial \ell^2} \right), \end{align}\) where, clearly, there is lots of opportunities for reuse. Therefore, writing our own gradients should be far more efficient for GPs both in terms of time and memory.
You can compute the gradients using Zygote
as
Note that the gradients with respect to X_dev
are not implemented, but shouldn’t be too hard to do.
I will now compare the GPU implementation against AbtractGPs
.
I will use 32-bit floating point numbers since most GPUs perform very poorly with 64-bits.
Since I will use my poor little GTX 1050 GPU, the numbers should be much better on a proper workstation with a beefier GPU.
To get proper performance measurements, I turned off frequency scaling and paused Youtube.
(Imagined how bored I was during the experiments.)
In terms of numerical accuracy, the GPU version is close to the result of AbstractGPs
at 1e-4 tolerance level:
In terms of performance, here is a execution time comparison:
The error bars are the 80% empirical quantiles and \(N\) is the number of datapoints. We can see that GPUs quickly becomes more efficient for \(N>100\). In general, it is about 10 times faster, which is pretty good for a simple implementation without any GPU-specific optimization (not even using shared memory!). Since GTX 1050 is supposed to achieve 1TFLOPS and most modern CPUs achieve around 200GFLOPS, this is close to the most we can get.
The main.jl file in the repository contains a realistic example with predictions. I performed MAP-II hyperparameter optimization using Optim.jl on the Boston housing dataset. Here are the results:
┌ Info: MAP-II Hyperparameter Optimization Result
│ likelihood_before = -544.3303199616416
│ likelihood_after = -116.86849745187607
│ rmse_before = 0.60338885f0
│ rmse_after = 0.3102568f0
│ lpd_before = -0.8926057396811591
└ lpd_after = -0.16185267732364805
before
is the initial hyperparameters used without optimization and after
is the result of MAP-II.
We can see that everything is working in order.
When the Cholesky fails, the current implementation does not throw.
Instead, it spits a -Inf
for the likelihood and CUDA.zeros
arrays for the gradients.
Cholesky
with CUDAUpdate: this has been fixed by sethaxen. See also the issues at ChainRules.jl, Zygote.jl
While doing this, I ran into a bug that prevents Cholesky
being differentiated by Zygote
, which I reported.
A quick fix is to use the following snippet:
Update: this has been fixed by myself
The weird part of my solution here is the two calls to triu
, which create a normal Matrix
that is upper triangular, in contrast to the UpperTriangular
adaptor.
This is necessary because, currently, multiplying two UpperTriangular
matrices on the GPU is extremely slow.
Running the profiler seems to show that there is a weird device memory copy somewhere that takes forever, but I didn’t pursue the matter further.