Part of my efforts at a startup doing retrieval and reranking work.
Contrary to a bi-encoder, which processes two sentences separately and then another program compute the score between the output embedding, cross-encoders are single-model arch that process one input that is the concatenation of (typically) query and document. This means that cross-encoders have a singluar output and that is scalar (?)
Pair-wise cross-encoders are typically trained from embedding models, see: BGE-m3 into BGE-rerank-v2m3 and jina-
Making rerankers multimodal
TBD!! Some rudimentary thoughts:
- repurpose SigLIP,
- retrain embedding models,
Also need to consider
This actually excites me, but first I need to know how image-text models work better than “oh we train a joint embedding layer by aiming for an almost diagonal matrix somewhere in the network”