from Nils Reimers “Training SOTA Text Embedding & Neural Search Models”

Can we just average BERT? No.

base bert out of the box performs badly, but we can train bert with either unsupervised methods or with NLI data. dataset size scaling law applies here.

Global And Local Structure of Vector Space

  • Global structure: Relation of two random sentences,
  • Local structure: Relation of two similar sentences.
  • Loss function must optimize local and global structure.
  • Contrastive or triplet loss might only optimize the local structure.

Multiple Negative Ranking Loss

A nice loss function is Multiple Negative Ranking Loss

  • Given positive pairs: , it should be close in vector space.
  • Any other negative pairs, denoted , should be distant in vector space . Intuitively, it’s rather unlikely that two randomly selected questions are that similar.
  • We then compute the ranking loss with Cross-Entropy:
    • Given , what is the right answer out of ?
    • Compute the similarity scores to obtain ,
    • Compute cross entropy loss with gold label
  • Mathematically, MNRL can be denoted as: where is the similarity function for two vectors.
  • MNRL considers either cosine similarity or dot product similarity.

Hard Negatives