Part of my efforts at a startup doing retrieval and reranking work.

Contrary to a bi-encoder, which processes two sentences separately and then another program compute the score between the output embedding, cross-encoders are single-model arch that process one input that is the concatenation of (typically) query and document. This means that cross-encoders have a singluar output and that is scalar (?)

Pair-wise cross-encoders are typically trained from embedding models, see: BGE-m3 into BGE-rerank-v2m3 and jina-

Making rerankers multimodal

TBD!! Some rudimentary thoughts:

  • repurpose SigLIP,
  • retrain embedding models,

Also need to consider

This actually excites me, but first I need to know how image-text models work better than “oh we train a joint embedding layer by aiming for an almost diagonal matrix somewhere in the network”