Hi Libin,

My name is Wayne Zhang and I just graduated from MS Data Science at UCSD, we have chatted before. My UCSD email was yuz226 at ucsd, but I have now graduated from my MS… unfortunately.

I thought of a statement when discussing logistic regressions with my friend, but I don’t know how to reason about it:

“there are infinitely many linearly separable manifolds in the embedding space we make for natural language and therefore we gotta go beyond cross entropy.”

What prompted me this was that if we give logistic regression linearly separable data, the model training will never converge. Similarly, a causal language model training will

  • can we extrapolate this thinking to higher-order dimensions, the way I did with natural language embedding data?
  • what part(s) of the model optimization process should we think about? (i.e. any property or comparison between loss functions, or is it a matter of training objectives?

I apologize in advance if any of these were taught by you in the Math of Deep Learning last Spring: I had rather little remaining notes from then.

If you happen to have any thoughts and are willing to discuss, I look forward to hearing from you!

Hi Wayne,

Hope everything is doing well with you! I’m glad that you reached out to me!

I didn’t quite follow your questions. Could you please elaborate more? What do you mean by linearly separable manifolds and “go beyond” cross entropy?

I guess I can say something about the choice of the loss functions. In general, different loss functions will lead to different implicit bias. For example, for a linear model with squared loss, GD will converge to minimum L2 norm solution. But for cross entropy loss, it will converge to maximum margin solution: https://jmlr.org/papers/volume19/18-188/18-188.pdf

Also different training algorithms will lead to different solutions: https://www.jmlr.org/papers/volume24/23-0836/23-0836.pdf

Overall, how you train the model will determine the properties of the solutions.

Does it answer your questions? Pls let me know if I misunderstood your questions.