Jekyll2023-12-01T20:40:20+00:00https://red-portal.github.io/feed.xmlKyurae KimKyurae Kim's blog. Uncertainty Quantification is the Red Herring of Bayesian Machine Learning2022-12-11T00:00:00+00:002022-12-11T00:00:00+00:00https://red-portal.github.io/blog/2022/red_herringWill Conformal Predictions Replace Bayesian Inference?

With the rise of conformal predictions, I hear doubts about the Bayesian approach to machine learning. This is especially true for Bayesian deep learning, where the Bayesian approach is barely making progress to provide a computationally-feasible baseline for predictive uncertainty quantification.

### Uncertainty Quantification is a Red Herring

The problem I have with these “doubts” about the future of Bayesian machine learning is that they are founded on a false premise. For me, Bayesian machine learning was never about predictive uncertainty quantification. Okay, maybe the “never” is a bit of a stretch. But I do feel that there has been too much focus on the predictive uncertainty quantification aspect of Bayesian machine learning that it has completely overtaken the Bayesian cause.

For me, the Bayesian framework provides the following:

• Uncertainty estimates of the parameters.
• Uncertainty estimates of the predictions.
• Data-driven regularization through marginalization.
• Principled model comparison through Bayes factors.
• Principled (with principles founded on probability theory) model design.
• Decision-theoretic performance guarantees.

Uncertainty quantification is just one of these. Explaining what each bullet exactly means would be too long to qualify as a blog post. Nevertheless, let me discuss the third point, “Data-driven regularization through marginalization,” as I believe it is especially important for machine learning.

### Going Bayesian Improves Accuracy

In the Bayesian framework, one makes predictions $$p(y \mid \mathcal{D})$$ by marginalizing over the posterior $$p(\theta \mid \mathcal{D})$$ such as $$\begin{equation} p(y \mid \mathcal{D}) = \int p\left(y \mid \theta\right) \, p\left( \theta \mid \mathcal{D} \right) \, \mathrm{d}\theta. \end{equation}$$ Here, $$p(\theta \mid \mathcal{D})$$ automatically takes the parameter uncertainty into account, essentially regularizing the prediction. Thus, assuming the model is sound, fully Bayesian predictions should improve the predictive accuracy compared to naive point estimates. Personally, whenever a non-Bayesian model receives the Bayesian treatment, I expect the predictive accuracy to improve. In general, I don’t care about the predictive uncertainty, I just expect those numbers to go up!

My favorite examples of this are the classic matrix factorization algorithms. For example, Bayesian principled component analysis (Bishop, 1998) and Bayesian non-negative matrix factorization (Schmidt et al., 2009) have shown to be straight upgrades from their original maximum-likelihood variants. This has also been shown for neural networks by non-other than Radford Neal himself (Neal, 2006).

For modern deep neural networks, it took some time to figure out whether such improvement could be obtained. However, with the computational power of Google, Andrew G. Wilson’s group has shown that convolutional neural networks achieve better predictive performance (Izmailov et al., 2021).

### Conclusions

Nonetheless, conformal predictions seem to be a promising approach for obtaining predictive uncertainty estimates. And this is fine; Bayesian machine learning has its unique agenda. So keep drinking the Bayesian Kool-Aid!

1. Bayesian PCA
In Advances in Neural Information Processing Systems 1998
2. Bayesian Non-Negative Matrix Factorization
In Independent Component Analysis and Signal Separation 2009
3. Classification with Bayesian Neural Networks
In Machine Learning Challenges: Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment 2006
4. What Are Bayesian Neural Network Posteriors Really Like?
In Proceedings of the 38th International Conference on Machine Learning, Jul 2021
]]>
Being Schmidhubered on Deep Learning and Flat Minimas2022-10-04T00:00:00+00:002022-10-04T00:00:00+00:00https://red-portal.github.io/blog/2022/flat_minima_schmidhuberUntil very recently, the incredible generalization capabilities of deep neural networks have been attributed to flat minimas. That is, “flat minimas” generalize better because of the intuitive explanation that flatter minimas are more robust to perturbation (train dataset sampling). For some time, this discovery has been attributed to (Keskar et al., 2017), often accompanying their sketch of the intuition.

Although this “intuitive” explanation has been somewhat scientifically disputed multiple times until now. I do not now where the scientific concensus is now on this subject. Because of this, I have never got too deep into this topic, and I only knew the attribution to Keskar et al. However, it turned out that J. Schmidhuber and S. Hochreiter came up with this idea in… 1994! Hell, they even have a paper named “Flat minima” (Hochreiter & Schmidhuber, 1997). Even better, one of their paper on the topic had been presented at NIPS’94 (Hochreiter & Schmidhuber, 1994). To me personally, this sets a whole new standard on being Schmidhubered… (Fortunately though, the two papers by Schmidhuber co. did and still do get properly cited.)

## References

1. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Proceedings of the International conference on learning representations, Feb 2017
2. Flat Minima
Neural Computation, Jan 1997
3. Simplifying Neural Nets by Discovering Flat Minima
In Advances in Neural Information Processing Systems, Jan 1994
]]>
Was Charles Stein a Bayesian?2022-07-11T00:00:00+00:002022-07-11T00:00:00+00:00https://red-portal.github.io/blog/2022/was_charles_stein_bayesianI recently contracted COVID (probably while watching the Rolling Stones perform at Hyde Park…) and therefore had to self-isolate. During this time, I was curious whether the legendary Charles M. Stein (1920-2016) was a Bayesian, given his huge contribution to Bayesian statistics and machine learning.

## Charles Stein and The Bayesians

Although I do not consider myself to be a statistician, I regularly read statistics papers in order to renew my Bayesian membership card. (If anybody is wants to register too, please send me an email.) And Charles Stein has made various important contributions to the Bayesian cause.

### James-Stein Estimator

For example, the James-Stein estimator (Stein, 1956), although not directly a Bayesian idea, has a nice empirical Bayesian (Efron & Morris, 1973) interpretation. (This would be part of a running joke that Bayesians tend to claim anything that works is actually a Bayesian method in disguise.) In more detail, James-Stein estimator has shown that some types of biased estimators can have quite dramatically less error than the maximum likelihood estimator under certain loss functions. This has directly motivated all kinds of regularized/shrinkage estimators.

James-Stein estimator was apparently a big surprise to the statisticias back in the day. Even Michael M. Jordan mentioned it to be “mysterious and beautiful”.

### Stein’s Method

The contribution of Stein that I’m perhaps the most used to, is Stein’s method (Stein, 1972). Stein method is a way to measure the difference between two distributions with respect to an arbitrary function in the form of

$\begin{equation} d\left(p, q\right) = \sup_{f \in \mathcal{H}} \left| \int f\,dp - \int f\,dq \right|. \end{equation}$

This generalizes many other classic distance measures such as total variation. The key element is the freedom to choose $$f \in \mathcal{H}$$.

While Stein’s method has been a textbook (or more exactly, graduate-level mathematical statistics textbook) thing for some time, it recently sparked new interest. Mainly, Stein variational (Liu & Wang, 2016) methods (or more generally variational particle methods) and kernelized discrepancies (Gorham & Mackey, 2015). Both of these have created their own fields and have now become active, promising lines of research that are in the heart of Bayesian front. (An interesting trivia is that Gorham and Mackey originally implemented their method in an early version of Julia.)

## Was Charles Stein a Bayesian?

During self-isolation, I stumbled upon this paper called A Conversation with Charles Stein (DeGroot, 1986), which is more or less a transcript of an interview with Charles Stein by Morris DeGroot (also a very famous statistician), that was published in Statistical Science. The interview has been published in 1986, which seems to be around Charles Stein retired, but amazingly enough, Charles lived until 2016!

### Was Charles Stein a Bayesian?

DeGroot: Since you brought up the subject, what is your view of the Bayesian philosophical framework?

Stein: Well, it’s rather negative. Of course a lot of good work has been done by people from that point of view. But basically the Bayesian point of view is often accompanied by an insistence that people ought to agree to a certain doctrine, even without really knowing what that doctrine is. For example, would you be able to specify to me what you would consider an authoritative presentation of the Bayesian view that we are so often approached to accept

It is not clear to me which doctrine Charles is talking about. But I would argue that the frequentist framework also involves lots of doctrines (also known as assumptions). But it’s true that the frequentist doctrines are, in general, much less controversial if you come from certain backgrounds. For example, it is easier to assume that the “true” parameter exists, rather than to admit that there is no such thing.

DeGroot: Well, I’m not being interviewed. [Laughs] I could put in a plug for my book on Bayesian decision theory that gives an axiomatic system for probability and utility theory which together imply the entire decision-making process. I mean, normatively anyway.

Stein: Yes, but of course that is the thing. One is asked to accept something normatively before one knows what that thing really is, rather than the attitude that we have toward other systems where we set out axioms or definitions and use them for the purpose of developing a system, and then if the system turns out to be interesting we pursue this. But we never ask whether those axioms are true or not; rather, we ask if we can find instances in which this axiomatic development is useful. If so, we accept it. In particular, we try to judge the consequences. Whereas, as you know, there are grave difficulties in trying to apply the Bayesian notions to interesting problems because of the difficulty of choosing a prior distribution. There is one point of view specified by Jeffreys who seems to be saying that there is a prescription, which he did not invent but which he seems to endorse, for choosing a (usually improper) prior distribution, and that simply does not work in general. The alternative is that the choice of a prior distribution must be made subjectively, and that I find completely inappropriate. Well, what can one say? Of course, statistical decision theory gives us, within a certain class of problems, an indication of how prior distributions do enter statistics from another point of view. And so in some ways the difference between Wald’s decision-theoretic approach and the Bayesian approach is small.

From this, we can clearly see that Charles Stein has the classic critical frequentist view on the Bayesian approach. However, we have to consider that the fundamentals of Bayesian theory only started to mature in the 90s and this interview took place way before that. Plus, the Bayesian framework has now established lots of connections with the frequentist framework (whether that is the appropriate attitude is, interestingly, quite a controversial subject). Furthermore, using subjective priors has been shown to be sensible as long as the model is evaluated objectively and extensively (Gelman et al., 2020).

I find the point about prior selection, however, still a valid critism. Although some people treat prior selection to be a “solved problem”, in practice, it is still a very difficult subject that needs lots of work on a problem-by-problem basis. Fortunately, many are working on principled procedures for eliciting subjective informative priors (Mikkola et al., 2021) and designing priors with good frequentist properties (Consonni et al., 2018). But, we need to mind that frequentist methods equally involve “manual work” to establish frequentist guarantees on a method-by-method basis. Therefore, on a practical note, neither is less difficult than the other.

An interesting bit is the comment on what we now call Jeffreys priors. They are known to ocassionally mess up model comparison with Bayes factors, which I believe is what Charles Stein is discussing here. (Please let me know if you think he is talking about a different aspect of Jeffreys priors.)

DeGroot: Because Wald used priors as a technical device for determining optimal procedures.

Stein: Yes, and therefore we are considering the same procedures. Roughly speaking, the basic theorems of decision theory say that in some sense good procedures and Bayes procedures are very much the same thing. This is, of course, a gross over simplification, but it does enable us to understand how prior distributions come in.

Interestingly, it seems that Charles Stein is sympathetic to (possibly subjective) Bayesian approaches when coming from a decision-theoric perspective.

DeGroot: Let’s talk about probability for a moment. You say that the notion of subjective probability is unacceptable to you. What definition of probability do you use?

Stein: Essentially Kolmogorov’s. That it is a mathematical system.

DeGroot: Simply any set of numbers that satisfies the axioms of the calculus of probabilities.

Stein: Yes.

DeGroot: But what do these numbers represent in the real world?

Stein: Well, there is no unqiue interpretation. And of course I’m talking about Kolmogorov’s old interpretation of probability and not the complexity interpretation. In his book he mentions briefly two aspects of the interpretation. The first is the traditional relative frequency of occurrence in the long run. And the second is that when one puts forward a probabilistic model that is to be taken completely seriously for a real world phenomenon, then one is asserting in principle that any single specified event having very small probability will not occur. This, of course, combined with the law of large numbers, weak or strong, really is a broader interpretation than the frequency notion. So, in fact, the frequency interpretation in that sense is redundant. This doesn’t answer the question, “When I say the probability is 1/6 that this die will come up 6 on the next toss, what does that statement mean?” But then in no serious work in any science do we answer the question, “What does this statement mean?” It is an erroneous philosophical point of view that leads to this sort of question.

I find this comment from Charles Stein quite surprising. After all, Science has branched out of philosophy, and there is therefore no reason to shy away from philosophical questions and philosophical point of views. In fact, philosophical discussions are scattered everywhere in the history of science. I always thought that statistics was science more than mathematics (therefore some statistics departements prefer to call themselves statistical science). From this perspective, perhaps Charles Stein considered him to be more of a mathematician than a statistican.

DeGroot: But surely that means that there is a subjective element entering into the development of the models and the numerical probabilities.

Stein: But, you see, that’s something very different from saying that one is absolutely never permitted to consider a probabilistic model in which anything is unknown, and that is the strict interpretation of the Bayesian point of view. Some statisticians seem to try to accept this point of view as sort of a philosophical position but to deny it in practice, which is not reasonable.

This comment is interesting because, in practice, it is very rare to encounter a problem where absolutely no prior information is available. Often times (or rather, all the time) we at least have an idea about the support or the extreme values of the data. And even this much information is pretty useful as far as prior information goes.

DeGroot: Do you think your political views influence the kind of problems you work on and your scientific philosophy at all, or are they separate?

Stein: I’d say they are largely separate. Of course, I don’t do military work, not even nominally. That is, I haven’t been on a military contract for 18 years. Actually, even before that it was distasteful but I allowed myself to be talked into it. But this is a hard question to answer. I would admit that my work is largely independent of my political attitudes. I don’t agree with Lindley that a subjective approach to probability is a consequence of being a Marxist or socialist.

Lindley would be surprised to know how much ground the “Marxists” have gained to this day. But seriously, interesting to know that Lindley thought this way.

Stein: Yes. I may want to modify some of my answers when I see the transcript of this conversation. Somehow I don’t seem to think along the same lines as other people, which is useful. It’s good that different people think differently.

DeGroot: Thank you, Charles

## Conclusions

From this nice interview by DeGroot, I could see how Charles Stein, and probably many well respected statistians of that day, thought of Bayesian approaches. I would be interested to know whether Charles Stein actively worked after this interview. Unfortunately, after a quick search, I couldn’t find a complete bibliogrphay of Charles Stein’s.

## References

1. Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution
Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Jan 1956
2. Stein’s Estimation Rule and Its Competitors–an Empirical Bayes Approach
Journal of the American Statistical Association, Mar 1973
3. A Bound for the Error in the Normal Approximation to the Distribution of a Sum of Dependent Random Variables
Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, Jan 1972
4. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
In Proceedings of the 30th International Conference on Neural Information Processing Systems, Jan 2016
5. Measuring Sample Quality with Stein’ s Method
In Advances in Neural Information Processing Systems, Jan 2015
6. A Conversation with Charles Stein
Statistical Science, Jan 1986
7. Bayesian Workflow
, Nov 2020
8. Prior Knowledge Elicitation: The Past, Present, and Future
, Dec 2021
9. Prior Distributions for Objective Bayesian Analysis
Bayesian Analysis, Jun 2018
]]>
An Attempt to Make Gaussian Processes on Julia GPU-compatible and Differentiable2022-04-26T00:00:00+00:002022-04-26T00:00:00+00:00https://red-portal.github.io/blog/2022/gp_cudaCurrently, there isn’t a way to implement Gaussian processes in Julia in a way that supports both GPU acceleration and differentiation. To fill the void, I implemented a very minimal package (or a snippet rather). The implementation can be found here.

## The State of Gaussian Processes in Julia

Currently, the Gaussian process echosystem in Julia is somewhat fragmented. We have GaussianProcesses.jl, which is a standalone package that does just GPs, AbstractGPs that tries to combine multiple GP related libraries into a standardized API, AugmentedGaussianProcesses.jl that provides some advanced GP algorithms on top of AbstractGPs. Unfortunately, none of these libraries currently work on GPUs. This is way behind the norm of Python where GPyTorch supports GPUs quite well.

Here is a summary of the current trends for implementing GPs in Julia.

• Use KernelFunctions.jl for crafting your covariance kernels and computing the Gram matrix.
• Use PDMats.jl for computing the Cholesky, solving systems, computing quadratics, and etc..
• Use AbstractGPs.jl for abstracting all of the GP manipulations. Frankly speaking, KernelFunctions.jl is the key here.

The main issue is that most GP libraries (including KernelFunctions.jl) rely on Distances.jl, which is a package for efficiently computing Gram matrices (or distance matrices). Although Distances.jl is heavily optimized, it’s optimized too much. It is very difficult to make it compatible with CUDA.jl (an amazing package that is a very good reason to convert to Julia). This bottleneck has been the showstopper since everbody is pretty much relying on KernelFunctions.jl. There is some dicussion to ditch Distances.jl in favor of Tullio.jl, but this also has the following downsides:

• It doesn’t support differentiation for complex multiline expressions. It does only symbolic differentiation.
• It’s not very efficient on GPUs, especially for gradients. So even if KernelFunctions.jl moves on to Tullio.jl, there is not much to expect at this point.

To summarize,

• GPU support for Gaussian processes on Julia is non-existent.
• Efficient GPU support is not to be expected in the short term.

## A Minimal CUDA-Compatible GP Implementation

### Overview

Regardless of the current GPU support, I urgently needed GPs to work on GPUs right now. The things we normally expect from GPU support for GPs are these two things:

1. faster Gram matrix computation,
2. faster Cholesky decomposition,
3. faster backward/forward substitution, and
4. support differentiation with respect to the hyperparameters and the latent function values

2 and 3 work (pretty much) out of the box in Julia. 1 and 4 is the tricky part. So, I ended up spending a few days writing a few CUDA kernels using KernelAbstractions.jl.

The implementation can be found here. It supports the two following covariance kernels: \begin{align} k\left(\mathbf{x}, \mathbf{y}\right) &= \sigma^2 k_{\text{SE ARD}}\left(\mathbf{x}, \mathbf{y} \right) + \epsilon^2 \newline k\left(\mathbf{x}, \mathbf{y}\right) &= \sigma^2 k_{\text{Matern 5/2 ARD}}\left(\mathbf{x}, \mathbf{y} \right) + \epsilon^2 \end{align} where SE ARD and Matern 5/2 stand for the squared-exponential and Matern 5/2 kernels with automatic relevance determination (ARD), which are, arguably, the most widely used covariance kernels. We have $$D + 2$$ hyperparameters here: the $$D$$ ARD length scales, the noise variance $$\epsilon^2$$, and the function variance $$\sigma^2$$.

### Likelihood

The log likelihood of a Gaussian process prior is \begin{equation} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) = -\frac{1}{2}\mathbf{y}^{\top} \mathbf{K^{-1}} \mathbf{y} - \frac{1}{2} \log \mathrm{det} \mathbf{K} - \frac{N}{2} \log 2 \pi. \end{equation} This is implemented as

You can use the squared exponential kernel by swapping matern52_gpu into se_gpu and gram_matern52_derivative_gpu into gram_se_derivative_gpu. The other routines are self-contained in gpu_cuda_utils.jl.

For the gradients, the GPML (Rasmussen & Williams, 2006) book shows how to differentiate the log likelihood. For the record, the gradients for the kernel hypeparameters are \begin{align} \nabla_{\mathbf{y}} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) &= \mathbf{K^{-1}} \mathbf{y} \\ \nabla_{\epsilon^2} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) &= \mathbf{y}^{\top} \, \mathbf{K}^{-1} \mathbf{K}^{-1} \, \mathbf{y} - \mathrm{tr}\left( \mathbf{K}^{-1} \right) \\ \nabla_{\sigma^2} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) &= \mathbf{y}^{\top} \, \mathbf{K}^{-1} \mathbf{K} \mathbf{K}^{-1} \, \mathbf{y} - \mathrm{tr}\left( \mathbf{K}^{-1} \mathbf{K} \right) \\ \nabla_{\ell^2} \log p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{\theta} \right) &= \mathbf{y}^{\top} \, \mathbf{K}^{-1} \frac{\partial \mathbf{K}}{\partial \ell^2} \mathbf{K}^{-1} \, \mathbf{y} - \mathrm{tr}\left( \mathbf{K}^{-1} \frac{\partial \mathbf{K}}{\partial \ell^2} \right), \end{align} where, clearly, there is lots of opportunities for reuse. Therefore, writing our own gradients should be far more efficient for GPs both in terms of time and memory.

You can compute the gradients using Zygote as

Note that the gradients with respect to X_dev are not implemented, but shouldn’t be too hard to do.

## Demonstration

I will now compare the GPU implementation against AbtractGPs. I will use 32-bit floating point numbers since most GPUs perform very poorly with 64-bits. Since I will use my poor little GTX 1050 GPU, the numbers should be much better on a proper workstation with a beefier GPU. To get proper performance measurements, I turned off frequency scaling and paused Youtube. (Imagined how bored I was during the experiments.)

### Numerical Accuracy

In terms of numerical accuracy, the GPU version is close to the result of AbstractGPs at 1e-4 tolerance level:

### Computational Performance

In terms of performance, here is a execution time comparison:

The error bars are the 80% empirical quantiles and $$N$$ is the number of datapoints. We can see that GPUs quickly becomes more efficient for $$N>100$$. In general, it is about 10 times faster, which is pretty good for a simple implementation without any GPU-specific optimization (not even using shared memory!). Since GTX 1050 is supposed to achieve 1TFLOPS and most modern CPUs achieve around 200GFLOPS, this is close to the most we can get.

### Realistic Example

The main.jl file in the repository contains a realistic example with predictions. I performed MAP-II hyperparameter optimization using Optim.jl on the Boston housing dataset. Here are the results:

┌ Info: MAP-II Hyperparameter Optimization Result
│   likelihood_before = -544.3303199616416
│   likelihood_after = -116.86849745187607
│   rmse_before = 0.60338885f0
│   rmse_after = 0.3102568f0
│   lpd_before = -0.8926057396811591
└   lpd_after = -0.16185267732364805


before is the initial hyperparameters used without optimization and after is the result of MAP-II. We can see that everything is working in order.

### Cholesky Fail

When the Cholesky fails, the current implementation does not throw. Instead, it spits a -Inf for the likelihood and CUDA.zeros arrays for the gradients.

### Fixing Zygote for Differentiating Cholesky with CUDA

Update: this has been fixed by sethaxen. See also the issues at ChainRules.jl, Zygote.jl

While doing this, I ran into a bug that prevents Cholesky being differentiated by Zygote, which I reported. A quick fix is to use the following snippet:

Update: this has been fixed by myself
The weird part of my solution here is the two calls to triu, which create a normal Matrix that is upper triangular, in contrast to the UpperTriangular adaptor. This is necessary because, currently, multiplying two UpperTriangular matrices on the GPU is extremely slow. Running the profiler seems to show that there is a weird device memory copy somewhere that takes forever, but I didn’t pursue the matter further.

## References

1. Gaussian Processes for Machine Learning
2006
]]>