A collection of questions that I frequent to prepare for MLE interviews.

Probability and statistics

  1. Suppose you have two random variables A, B and compute two OLS linear regressions: one from A to B, and one from B to A.  What can you say about the products of the two slopes?
  2. Why do people describe X^TX as a covariance matrix?
  3. When do sample covariance matrices have negative eigenvalues?
  4. For a stochastic matrix A, why do the column vectors of A^n approach the steady state probabilities for large n?  (e.g. why does PageRank work?)
  5. Suppose the head-probability of a coin is uniformly distributed in [0,1].  Suppose further that you flip it n times and get k heads.  What’s your best guess for its head-probability?  How would you construct a 95% Bayesian confidence interval on its head-probability?

Computational Bayesian

  1. Justify why Metropolis-Hastings converges to the desired density.
    1. Monte-Carlo - generating random numbers (draw) from: , which produces a trace plot, and the resulting density looks like the posterior dist.

    2. Markov Chain is a sequence of numbers where each number depends on its predecessor in sampling, like: the trace plot will wonder in a pattern we call a “random walk”.

    3. Metropolis-Hastings decides which value of (\theta) to accept.

      1. Calculate the posterior of and to obtain:

      2. if the posterior probability of new theta is greater, r will always be greater than 1 and we always choose the new theta, if the new posterior is smaller, we treat ratio as the acceptance probability, draw new ratio from uniform [0, 1], if new draw is smaller than the original ratio, keep the theta. if draw is greater, reject theta.

    4. Metropolis-Hastings has two major problems:

      1. dependence on starting values: can discard the burn-in period (the time it takes for the chain to stabilize).
      2. values of thetas are correlated because they are generated by a markov process.
    5. Metropolis-Hastings can sample from complex distributions by using a proposal distribution and an acceptance-rejection mechanism.

  2. When would you use Gibbs sampling over Metropolis-Hastings and vice versa?
  3. How can you tell when a univariate chain of MCMC has converged?
  4. How can you tell when a high-dim multivariate chain of MCMC has converged?
  5. Given a pandas dataframe of samples of p(x, y, z), how can you get the marginal distribution of p(x, y)?
  6. Variational inference is kind of wild – why would you try to approximate P(Z|X) with some Gaussian Q(Z)?  What if P(Z|X) doesn’t even look Gaussian?  Why would anyone do this?
  7. Derive the ELBO bound.

ML / DL

  1. What happens to the angle between two randomly sampled vectors from the standard multivariate Gaussian in high dimensions?
  2. What is an inductive bias? Describe the inductive biases of two distinctive models.
  3. Sufficiently large MLPs are universal approximators. What does this mean, and why do we even bother using anything else if they can literally approximate anything?
  4. Why do people often use ReLU variants over tanh or softmax?
  5. What is the relationship between Adam, the Fisher Information Matrix, and the Hessian of the loss?
  6. What is the salient property of ResNets in parameter space?
  7. Why are transformer-based architectures so potent?
  8. Pros/cons of each popular DL framework (JAX, PyTorch, Keras)
  9. Eigenvalues of Fisher-information matrix
  10. When are they negative?
  11. For large neural networks, the FIM starts to approach rank one. What does this mean for optimization and generalization?
  12. Name a few narratives for why wildly overparameterized neural networks don’t immediately overfit to hell and back as would be predicted by statistical learning theory.
  13. Explain the lazy paradigm, how NTKs exploit this, and why we don’t just use NTK-kernelized SVMs instead of neural networks everywhere.
  14. Why does boosting use shitty classifiers like trees instead of something that works much better like a neural net?
  15. What is the point of a VAE compared to a regular AE? What does the Gaussian jiggling and the reparameterization trick give you?

Energy Based Machines

  1. Why do EBM practitioners use the log-likelihood instead of just likelihoods?
  2. Derive the breakdown of the gradient of log-likelihood into the positive phase and the negative phase.
  3. Look at “contrastive-divergence” but watch out for overloading – it’s a term that applies both to this breakdown and the special case of this breakdown for Restricted Boltzmann Machines (RBMs).
  4. What are the failure modes of contrastive divergence? In theory and in practice?
  5. What are the motivations of score matching?
  6. How do practitioners sample from EBMs?
  7. If we want to generate samples x ~ p(x), what are the advantages and disadvantages of EBMs, normalizing flows, diffusion models, GANs, and VAEs? When would one want to use an EBM?
  8. What is the importance of persistence in EBM training? What assumptions/theoretical results does this break and why do we do it?
  9. Besides score matching and CD, what other methods are there for training EBMs?

Other Common ML Questions

  1. What is the random in random forest?
  2. What are the differences between gradient boosting and XGBoost?
    1. adding either L1 or L2 regularization to the objective function.
    2. parallel processing… wait, iirc tree boosting is sequential, how do we parallelize that?
      1. https://zhanpengfang.github.io/418home.html explained the approach:
      2. During training in tree boosting, the gradients (residuals) are used to update the leaf values in each iteration. The leaf values of the current tree are determined based on these gradients, and the predictions from the current iteration are the sum of the previous model’s predictions and the contributions from the current tree’s leaf values.
      3. So, the parallelization part goes to the tree building process… why tree building process?
        1. While the overall process is sequential, the construction of an individual tree can be parallelized. Building a decision tree involves splitting nodes based on the best feature and threshold, which can be parallelized since each node’s split decision is independent of others;
        2. Evaluating potential split points can be done in parallel for different features;