It is reasonable to consider contrastive learning as part of a stepping stone for achieving the “general” in general intelligence.


Intuition: Instead of using a static softmax classifier, CLIP uses a contrastive learning objective to align the image and text representations. This means we can do classification by comparing the similarity between an image and a textual description of the desired class, once we can obtain the embeddings of the images.