🪐 Universal Text Embeddings

Dispatches from the 4th era of text embeddings

Table of contents

From colossal pre-training datasets to instruction-based fine-tuning, “universal text embeddings” represent a paradigm shift for NLP applications across the board from the preceding waves of count-based (BoW/TF-IDF/n-gram), static dense (Word2Vec/fastText/GloVe), and contextualised (GPT/BERT/T5) text embeddings.

In this series, we begin with a recap of the so-called “4th era” that began in December 2022 with Microsoft's E5, and continued into summer 2023 with Alibaba's GTE and BAAI's BGE. We will highlight the innovations—both conceptual and practical—that are reshaping how models capture linguistic meaning, promising to generalise across tasks, domains, and languages.

A Deep Research series