A collection of questions that I frequent to prepare for MLE interviews.

Probability and statistics

Suppose you have two random variables A, B and compute two OLS linear regressions: one from A to B, and one from B to A. What can you say about the products of the two slopes?
Why do people describe X^TX as a covariance matrix?
When do sample covariance matrices have negative eigenvalues?
For a stochastic matrix A, why do the column vectors of A^n approach the steady state probabilities for large n? (e.g. why does PageRank work?)
Suppose the head-probability of a coin is uniformly distributed in [0,1]. Suppose further that you flip it n times and get k heads. What’s your best guess for its head-probability? How would you construct a 95% Bayesian confidence interval on its head-probability?

Computational Bayesian

Justify why Metropolis-Hastings converges to the desired density.
1. Monte-Carlo - generating random numbers (draw) from: $h e t a_{t} \sim N (μ, σ)$ , which produces a trace plot, and the resulting density looks like the posterior dist.
2. Markov Chain is a sequence of numbers where each number depends on its predecessor in sampling, like: $h e t a_{t} \sim N (h e t a_{t - 1}, σ)$ the trace plot will wonder in a pattern we call a “random walk”.
3. Metropolis-Hastings decides which value of (\theta) to accept.
  1. Calculate the posterior of $θ_{new}$ and $θ_{t - 1}$ to obtain:
    $r (θ_{new}, θ_{t - 1}) = \frac{posterior θ _{new}}{posterior θ _{t - 1}}$
  2. if the posterior probability of new theta is greater, r will always be greater than 1 and we always choose the new theta, if the new posterior is smaller, we treat ratio $r$ as the acceptance probability, draw new ratio from uniform [0, 1], if new draw is smaller than the original ratio, keep the theta. if draw is greater, reject theta.
4. Metropolis-Hastings has two major problems:
  1. dependence on starting values: can discard the burn-in period (the time it takes for the chain to stabilize).
  2. values of thetas are correlated because they are generated by a markov process.
5. Metropolis-Hastings can sample from complex distributions by using a proposal distribution and an acceptance-rejection mechanism.
When would you use Gibbs sampling over Metropolis-Hastings and vice versa?
How can you tell when a univariate chain of MCMC has converged?
How can you tell when a high-dim multivariate chain of MCMC has converged?
Given a pandas dataframe of samples of p(x, y, z), how can you get the marginal distribution of p(x, y)?
Variational inference is kind of wild – why would you try to approximate P(Z|X) with some Gaussian Q(Z)? What if P(Z|X) doesn’t even look Gaussian? Why would anyone do this?
Derive the ELBO bound.

ML / DL

What happens to the angle between two randomly sampled vectors from the standard multivariate Gaussian in high dimensions?
What is an inductive bias? Describe the inductive biases of two distinctive models.
Sufficiently large MLPs are universal approximators. What does this mean, and why do we even bother using anything else if they can literally approximate anything?
Why do people often use ReLU variants over tanh or softmax?
What is the relationship between Adam, the Fisher Information Matrix, and the Hessian of the loss?
What is the salient property of ResNets in parameter space?
Why are transformer-based architectures so potent?
Pros/cons of each popular DL framework (JAX, PyTorch, Keras)
Eigenvalues of Fisher-information matrix
When are they negative?
For large neural networks, the FIM starts to approach rank one. What does this mean for optimization and generalization?
Name a few narratives for why wildly overparameterized neural networks don’t immediately overfit to hell and back as would be predicted by statistical learning theory.
Explain the lazy paradigm, how NTKs exploit this, and why we don’t just use NTK-kernelized SVMs instead of neural networks everywhere.
Why does boosting use shitty classifiers like trees instead of something that works much better like a neural net?
What is the point of a VAE compared to a regular AE? What does the Gaussian jiggling and the reparameterization trick give you?

Energy Based Machines

Why do EBM practitioners use the log-likelihood instead of just likelihoods?
Derive the breakdown of the gradient of log-likelihood into the positive phase and the negative phase.
Look at “contrastive-divergence” but watch out for overloading – it’s a term that applies both to this breakdown and the special case of this breakdown for Restricted Boltzmann Machines (RBMs).
What are the failure modes of contrastive divergence? In theory and in practice?
What are the motivations of score matching?
How do practitioners sample from EBMs?
If we want to generate samples x ~ p(x), what are the advantages and disadvantages of EBMs, normalizing flows, diffusion models, GANs, and VAEs? When would one want to use an EBM?
What is the importance of persistence in EBM training? What assumptions/theoretical results does this break and why do we do it?
Besides score matching and CD, what other methods are there for training EBMs?

Gaia Prime

Explorer

ML Interview Questions

Probability and statistics

Computational Bayesian

ML / DL

Energy Based Machines

Other Common ML Questions

Table of Contents

Backlinks

Graph View