C-Pack: Packed Resources for General Chinese Embeddings
đź“„ Xiao et al. (2023) C-Pack: Packed Resources for General Chinese Embeddings (arXâ ¶2309.07597 [csâ ¶CL])
Part 3 of a series on universal text embeddings.
In this section, we cover:
- The BGE paper and its impact in academic research and software development
- In-depth breakdown of the training process
- Case study on FastEmbed (Python and Rust implementations)
- Review of the models on the MTEB leaderboard, and an analysis of inference packages built around BGE
- 1: Introduction
- 2: Breakdown of the Original BGE Paper
- 3: Technical Analysis of BGE Training Process
- 4: BGE Model Family: Architectures and Performance
- 5: FastEmbed Case Study: Python and Rust Implementations
- 6: Academic Research and Citations of BGE
- 7: Inference Packages Built Around BGE
- 8: Models Derived from BGE
- 9: Adoption of BGE in Open-Source and Commercial Settings
- 10: Comparison of BGE with Other Embedding Models
- 11: Future Directions for BGE and Embeddings Research
- 11.1: Multi-lingual and Cross-lingual Embeddings
- 11.2: Scaling Model Size and Training Data
- 11.3: New Benchmarks and Comprehensive Evaluation
- 11.4: Retrieval-Augmented Generation (RAG) and LLM Integration
- 11.5: Sustainability and Efficiency
- 11.6: Novel Objective Functions and Similarity Measures
- 11.7: Beyond Text: Multi-Modal and Knowledge Graph Embeddings
- 12: Conclusion and Key Takeaways
1: Introduction
1.1: Importance of Text Embeddings
Text embeddings are a fundamental building block in NLP and information retrieval. By encoding text into latent vector representations, embeddings enable efficient comparison of semantic content. This underpins numerous applications such as web search, question answering, recommendation, and retrieval-augmented generation (⇑) (⇑). The recent rise of large language models (LLMs) has further amplified the importance of high-quality text embeddings. LLMs often require external knowledge bases or tools to overcome limitations in world knowledge and context; embeddings serve as the bridge connecting LLMs to these external modules (⇑). A single general-purpose embedding model is desirable – one that can handle diverse tasks (retrieval, ranking, clustering, classification) across domains. However, learning such a unified model is challenging, requiring vast and varied training data and carefully designed training strategies (⇑).
1.2: Emergence of BGE as a Leading Model
Amid efforts to create universal text encoders, the BAAI General Embeddings (BGE) model family has emerged (released around September 2023) as a state-of-the-art solution. Developed by the Beijing Academy of Artificial Intelligence (BAAI), BGE was introduced alongside a comprehensive resource package called C-Pack to advance general-purpose text embeddings (⇑). BGE models quickly rose to the top of benchmark leaderboards. The largest BGE model achieved rank #1 on the Massive Text Embedding Benchmark (MTEB), outperforming prior models like E5, GTR, and OpenAI’s embedding on a suite of 56 English tasks (⇑). In the Chinese context, BGE delivered an even more dramatic leap: it outperformed all previous Chinese text embedding models on the new C-MTEB benchmark by over 10% (absolute) on average (⇑), establishing a new state-of-the-art. With its strong performance and open availability, BGE has quickly become a reference point in embedding research and applications, often being the go-to model for high-quality text vectorization in both academic studies and real-world systems.
Sources:
2: Breakdown of the Original BGE Paper
The BGE model family was introduced in a paper that presented C-Pack: Packed Resources for General Chinese Embeddings (⇑). This section breaks down the key components and contributions of that paper, including the provided resources and the training methodology.
2.1: C-Pack Overview and Contributions
C-Pack refers to an all-in-one package of resources created to facilitate general-purpose text embeddings, especially for Chinese. The paper’s contributions can be summarized in four pillars (⇑): (1) C-MTEB, a comprehensive evaluation benchmark for Chinese embeddings; (2) C-MTP, a massive curated dataset for embedding training; (3) BGE models, a family of high-performing embedding models of multiple scales; and (4) a full training recipe covering pre-training, contrastive learning, and instruction fine-tuning. By releasing C-Pack, the authors aimed to address gaps in data availability and evaluation standards for Chinese language embeddings, and to share a reproducible recipe for training state-of-the-art models (⇑). In summary, C-Pack provided “packed” resources – from data and benchmarks to ready-trained models – to accelerate embedding research.
2.2: Chinese Massive Text Embedding Benchmark (C-MTEB)
C-MTEB is the Chinese Massive Text Embedding Benchmark introduced in the paper (⇑). It extends the idea of the original MTEB (which covered multilingual tasks) to focus specifically on Chinese. C-MTEB aggregates 35 publicly available datasets spanning 6 task types (⇑). These tasks include semantic textual similarity (STS), information retrieval, reranking, classification, clustering, and pairwise classification, among others, ensuring a broad evaluation of embedding quality. The benchmark defines unified evaluation protocols for each task, allowing fair comparison of different embedding models on Chinese data (⇑). Thanks to its scale and diversity, C-MTEB has become a widely-recognized authoritative benchmark for Chinese text embeddings (⇑). The BGE paper reported that prior to C-MTEB, evaluating Chinese embeddings was fragmented; by collecting dozens of datasets and establishing this benchmark, the authors filled a crucial evaluation gap (⇑). Researchers can now reliably measure an embedding model’s generality in Chinese across multiple scenarios. (Notably, C-MTEB is continually updated with new datasets to keep it comprehensive (⇑).)
2.3: Chinese Massive Text Pairs (C-MTP) Dataset
Training a truly general embedding model requires abundant and varied text pair data. To meet this need, the paper introduced C-MTP (Chinese Massive Text Pairs) – touted as the largest open Chinese embedding training dataset (⇑). C-MTP was constructed by curating approximately 100 million pairs of texts from 16 different sources (⇑). The sources span web corpora and platforms such as encyclopedia articles, QA forums (e.g. Zhihu), e-commerce reviews, news, scientific literature, and more (⇑) (⇑). The dataset includes both unlabeled pairs (e.g. naturally co-occurring text pairs for contrastive learning) and a smaller subset of labeled pairs (with human or weak labels for tasks like paraphrase or entailment) (⇒). This combination provides both diversity and some supervision. C-MTP’s scale and heterogeneity allow an embedding model to learn a wide range of semantic relationships (⇒) (⇒). Crucially, the authors made C-MTP publicly available, marking the first time such a comprehensive Chinese text pair corpus was released openly (⇑). The availability of C-MTP is a major contribution: it enables others to train or improve embedding models without starting from scratch on data collection. The paper’s experiments showed that utilizing this massive data resource led to strong performance gains (as detailed later).
2.4: Three-Stage Training Process
The BGE paper also lays out a three-stage training process (or “training recipe”) for general-purpose embeddings (⇑). This recipe is a core contribution, demonstrating how to effectively train models using the C-MTP data. The stages are:
1. Unsupervised Pre-Training: First, a text encoder is pre-trained on plain text (without labels) using a self-supervised objective. BGE’s pre-training uses a masked autoencoder strategy (RetroMAE) on a large corpus, as detailed in the next section (⇑).
2. Contrastive Learning: Next, the model is fine-tuned on massive unlabeled text pairs (the unlabeled portion of C-MTP) with a contrastive learning objective (⇑). In this stage, the model learns to bring semantically related pairs closer and push unrelated ones apart in vector space. Techniques like in-batch negatives with very large batches are used to sharpen discrimination (⇑).
3. Instruction Multi-Task Fine-Tuning: Finally, the encoder undergoes supervised multi-task fine-tuning on the labeled subset of C-MTP (⇑). Here it learns from diverse tasks (STS, NLI, clustering, etc.) simultaneously. Crucially, BGE incorporates instruction prompts during this stage – prefixes like “Represent this sentence for…: ” in the input – to guide the model for each task context (⇒) (⇒). This aligns the model with the intended use (e.g. treating one text as a query vs. another as a passage). The result is a final model adept at a wide array of tasks.
By combining these three stages, the paper achieved a model that is both general and high-performing. The pre-training builds a strong foundation, contrastive learning provides broad semantic structuring, and instruction fine-tuning refines the model for real-world tasks. The BGE paper’s breakdown of this process has since served as a blueprint for others aiming to train universal embedding models.
Sources:
- C-Pack summary (⇑);
- C-MTEB description (⇑);
- C-MTP description (⇑) (Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark);
- training stages overview (⇑) (⇑).
3: Technical Analysis of BGE Training Process
The BGE models are trained via a carefully engineered three-step pipeline. This section delves into the technical details of each stage – the objectives, methods, and innovations that enable BGE’s strong performance.
3.1: Pre-Training with RetroMAE (Masked Auto-encoding)
For the first stage, BGE leverages an unsupervised masked autoencoder pre-training tailored for text embeddings. Specifically, the authors adopt the RetroMAE approach (⇑). In this scheme, large amounts of raw text (for Chinese, the Wudao corpus, a massive high-quality corpus (⇑)) are used to train a Transformer encoder-decoder in a reconstruction task. The encoder takes corrupted text (randomly masked “polluted” input) and produces a vector (e.g., the [CLS] token embedding). A lightweight decoder then tries to reconstruct the original text from the encoder’s embedding (⇑). Formally, given an original text x, a noised version ẋ is encoded to an embedding eẋ, from which the decoder must predict the original x (⇑). The loss is the negative log-likelihood of reconstructing x from eẋ (⇑). Through this process, the encoder (which will later be used for embeddings) learns to compress textual information such that the essential content can be recovered. The RetroMAE objective is “simple but highly effective” for learning embedding-oriented representations (⇑). It teaches the model to produce embeddings rich enough to regenerate meaning, thus capturing semantics beyond surface words. This pre-training yields a base encoder already adept at representing text in a general way, even before any supervised signal. By starting with this initialization (rather than a random or generic LM pre-train), BGE ensures the subsequent training stages begin with a strong foundation oriented toward retrieval tasks (⇑). In summary, RetroMAE-style pre-training on billions of words of plain text imbues BGE with robust language understanding and semantic compression ability from the outset.
3.2: Large-Scale Contrastive Learning with In-Batch Negatives
In the second stage, the model is trained with a contrastive learning objective on the massive unlabeled pairs from C-MTP (⇑). The goal here is to teach the encoder to produce similar embeddings for related text pairs and dissimilar embeddings for unrelated pairs. Each training example is typically a pair of texts that are known to be semantically linked (e.g. a question and its answer, or two paraphrases), treated as a positive pair. BGE adopts in-batch negatives: when computing the loss for a given pair, other examples in the same batch act as negative examples (⇑). This approach greatly amplifies the number of negatives without explicit labeling. A notable innovation in BGE is the use of extremely large batch sizes – up to 19,200 – to maximize negative sample diversity (⇑) (⇑). Using gradient checkpointing and cross-device synchronization, the training algorithm can handle these huge batches, which significantly improves the discriminative power of the embeddings (⇑). In practice, the contrastive loss (often a variant of InfoNCE) encourages the dot product (or cosine similarity) of an embedding pair (Etext1, Etext2) to be high for the true pair and low for all other combinations in the batch. BGE initially relies purely on in-batch negatives (⇑), meaning no need for separate negative mining at this stage – the data volume itself provides enough random negatives. This stage effectively performs a form of massive weakly supervised learning: the model sees hundreds of millions of paired texts and learns a broad notion of semantic similarity from them (⇑). By the end of contrastive training, the model (sometimes called the “intermediate checkpoint” or BGE-pretrain in the paper) is already very strong at generic semantic matching (⇑) (⇑). Indeed, experiments showed that this intermediate model outperformed many prior published models (like SimCSE, etc.) even before any task-specific fine-tuning (⇑) (⇑). The large-scale contrastive learning is thus a key to BGE’s generalization – it provides a wide “semantic canvas” on which the model can position text meanings.
3.3: Multi-Task Fine-Tuning and Instruction Tuning
The final training stage involves supervised multi-task fine-tuning, augmented with instruction-style prompts. BGE’s authors curated a high-quality labeled subset of C-MTP (around 0.8–1 million text pairs covering various tasks) (⇒). These include tasks like STS (with human similarity scores), natural language inference (entailment), question–answer relevance, duplicate query detection, etc. Instead of fine-tuning separate models for each task, BGE uses a unified fine-tuning: all tasks are trained together, and each training example is prefixed with an instruction that indicates the task/context (⇒) (⇒). For example, a pair used for a retrieval task might be prefixed with “query: …” and “passage: …”, whereas a pair for STS might not use those prefixes. During this stage, the model learns to interpret these instructions and optimize for multiple objectives. This approach is akin to instruction tuning, which helps the model handle potentially conflicting objectives by giving context for each training instance (⇒) (⇒). Technically, the fine-tuning still uses contrastive or classification losses appropriate to each task, but the unified setup means the model’s parameters are updated to perform well on all tasks simultaneously. The result is the final model often referred to as BGE (finetune) or simply BGE v1.0/v1.5 in practice (⇑) (⇑). The authors found that this multi-task fine-tuning yields small but tangible gains on top of the contrastively learned model – especially on those tasks that weren’t well covered by pure contrastive learning (⇑) (⇑). Importantly, adding the natural-language instructions for “query” vs “passage” helps the model specialize its embeddings depending on usage (e.g., a query embedding might emphasize different aspects than a document embedding) (⇒) (⇒). This is critical for applications like search, where one often encodes queries and documents differently. The paper’s ablation confirmed that the instruction-tuned final model outperformed a version trained without instructions on the same data (⇑) (⇑). In summary, the third stage fine-tunes the model to be task-aware and user-instruction-aware, solidifying BGE as a general-purpose embedding model ready for real-world use.
Sources:
- RetroMAE pre-training (⇑);
- contrastive stage and large batch negatives (⇑) (⇑);
- multi-task instruction fine-tuning (FastEmbed: Qdrant's Efficient Python Library for Embedding Generation - Qdrant) (⇑).
4: BGE Model Family: Architectures and Performance
The BGE release actually comprises a family of models of different sizes and language orientations, all trained with the above recipe. Here we overview the model variants and their architectures, then summarize performance on benchmarks (C-MTEB and MTEB), including comparisons to other leading embedding models.
4.1: Model Sizes and Architecture Details
BGE models use a BERT-like bi-encoder architecture (⇒). They are essentially Transformer encoders that output a fixed-size dense vector (using the [CLS] token’s final hidden state as the sentence embedding). Unlike some other approaches, BGE does not use a dual-encoder with different weights for query vs. document – it’s a single encoder applied to any text in parallel, which makes it symmetric and efficient for retrieval (⇒). Three main size configurations were released (⇒):
-
BGE-Large: 24-layer Transformer, 1024-dimensional embeddings. This is roughly the size of BERT-large (around 340M parameters). The English BGE-large-v1.5 model is ~1.34 GB in size and produces 1024-dim vectors (⇒).
-
BGE-Base: 12-layer Transformer, 768-dimensional embeddings (similar to BERT-base, ~110M parameters). The file size is ~0.44 GB for BGE-base-en-v1.5 (⇒).
-
BGE-Small: A lighter model (~6-layer Transformer or a distilled model) with 384-dimensional embeddings. BGE-small is only ~0.13 GB, indicating on the order of 30M parameters (⇒). Despite its size, it maintains competitive performance.
All these models follow the same training pipeline. They use [CLS] pooling (taking the special token representation) and are trained such that this [CLS] vector is meaningful as the sentence embedding (⇒). An important architectural note from the BGE paper is that, unlike some concurrent models (e.g., GTE from Alibaba), BGE sticks to a standard BERT encoder architecture and does not incorporate additional adapters or prompts at inference – the instruction signals were only used during fine-tuning (⇒) (⇒). This means at runtime, using a BGE model is as simple as feeding a sentence into the encoder and taking the output vector (with optional normalization). Another aspect is that BGE models were trained with a specific text normalization and tokenization (for Chinese, using Wudao’s dictionary; for English, likely a standard BERT WordPiece). The context length during training is 512 tokens for most models (⇒). As a result, the models natively handle inputs up to 512 tokens, and longer texts require chunking or pooling strategies (⇒). Finally, the BGE family has multilingual offshoots: e.g., “BGE-multilingual (m3)” which extends the architecture to support multiple languages and even multi-modal retrieval (described later). But the core architecture remains a Transformer bi-encoder focusing on dense retrieval.
4.2: Performance on C-MTEB (Chinese)
On the Chinese benchmark C-MTEB, BGE models established a new state-of-the-art by a large margin. The BGE-large-zh model achieved the #1 rank on the C-MTEB leaderboard, surpassing previous Chinese embedding models by over 10 percentage points in average score (⇑). For example, on C-MTEB’s aggregated score (averaging across tasks), BGE-large scored around the mid-60s (out of 100), whereas prior state-of-the-art Chinese models were in the mid-50s (⇑) (⇑). In fact, the authors note that BGE-large-zh beat all prior Chinese embeddings on every aspect of C-MTEB (⇑). This included strong improvements in retrieval tasks and semantic textual similarity. Even the smaller BGE models performed exceptionally: BGE-base-zh and BGE-small-zh achieved competitive scores close to BGE-large, while still outperforming other models of similar size by significant margins. For instance, in August 2023, BGE-base-zh was reported to have similar ability to BGE-large-zh, effectively closing the gap to the larger model (⇒). One interesting variant is BGE-large-zh (no instruct) – a version trained without the instruction fine-tuning stage. This model was ranked #2 on C-MTEB, just behind the full BGE-large, confirming that the instruction stage, while beneficial, contributed a modest increment (⇒). The dominance of BGE on C-MTEB held through late 2023; by early 2024, only new models leveraging even larger LLMs began to challenge it (for example, Baichuan-Embed, a model derived from the 13B Baichuan LLM, took the top spot on C-MTEB in Jan 2024) (⇒). Overall, BGE’s performance on C-MTEB validated the effectiveness of C-MTP data and the training recipe: it set a new high bar for Chinese-language text embeddings.
4.3: Performance on MTEB (English and Multilingual)
BGE models also generalize strongly to English and multilingual tasks. The BGE-large-en model was ranked #1 on the Massive Text Embedding Benchmark (MTEB) at the time of its release (⇒) (⇒). MTEB evaluates embeddings on 8 task types across 50+ datasets (mostly English, some multilingual). BGE-large-en exceeded the prior best model’s average score by +1.1 absolute points on the overall MTEB score (⇑). This is a notable gain given that many strong competitors existed (e.g., E5-large, GTR-T5, OpenAI’s text-embedding-ada-002, etc.) (⇑) (⇑). In particular, BGE showed strengths in retrieval and reranking tasks, where its training focus on contrastive learning paid off (⇒) (⇒). It also performed well on clustering and pair classification tasks (outperforming or matching models like Sentence-T5 and SGPT). However, on tasks like summarization (which MTEB includes as embedding-based evaluation), there was little improvement over older models (⇒) – this points to a limitation common to most embeddings, not unique to BGE. The BGE-base-en and BGE-small-en models also fared impressively: BGE-base-en was actually second place on MTEB, just behind its larger sibling (⇒). BGE-small-en, while trading some accuracy, still scored competitively and outperformed other small models like MiniLM or MPNet on many tasks (⇒) (⇒). The availability of these different sizes means users can choose a model balancing speed vs. accuracy. To illustrate, BGE-small (384-dim) might be chosen for high-throughput scenarios, whereas BGE-large (1024-dim) is chosen for maximum precision (⇑). In multilingual settings, BGE has a “m3” variant that supports multiple languages; although the original paper focused on Chinese and English, the same recipe was applied by BAAI to create a multilingual model (covering English, Chinese, and more) called BGE-m3, which supports cross-lingual tasks. This multilingual BGE (if evaluated on MTEB’s multilingual tasks) also achieved top-tier results, benefiting from the breadth of its training data and tasks. In summary, across both Chinese (C-MTEB) and the broader MTEB, BGE models have demonstrated state-of-the-art performance, validating the architecture and training strategy as one of the best for universal text embeddings as of 2023.
4.4: Comparison with Other Leading Models (Summary)
It’s instructive to compare BGE’s performance and design with contemporary embedding models:
-
Contriever (Facebook, 2021): Contriever was an unsupervised English encoder using contrastive learning on web data. BGE’s pipeline can be seen as an evolution: it adds a large supervised fine-tuning stage and much more data. As a result, BGE-large significantly outperforms Contriever on benchmarks (⇑) (⇑). Contriever’s strength was zero-shot retrieval, but BGE’s combination of unsupervised and supervised training yields superior all-round results.
-
GTR (Google T5-based, 2021): GTR-T5 models scaled up to even larger sizes (up to 4B parameters) and did multi-task training. BGE-large (~340M) achieved comparable or better accuracy than a GTR model many times its size on MTEB (⇑) (⇑). This highlights the excellent data efficiency of BGE’s training. The BERT-based architecture of BGE also makes it lighter to deploy than T5-based models like GTR.
-
E5 (Naver/Intel, 2023): E5 introduced instruction tuning for embeddings and a colossal multi-source dataset (CCPairs). BGE and E5 share the idea of multi-task instruction training. On English MTEB, E5-large was one of the top models, but BGE-large slightly edged it out (by about 1 point) (⇑). The BGE authors specifically note that despite E5’s web-scale data (270M pairs) (Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark), BGE’s carefully filtered 100M Chinese pairs (and similarly large English pairs) plus instruction tuning allowed it to notably advance the prior SOTA (⇑). Essentially, BGE matched E5’s innovations while also extending to Chinese tasks (where E5 was weaker). E5 is multilingual, whereas BGE provided separate models per language (and a multilingual variant), each highly optimized.
-
OpenAI Ada-002 (2022): OpenAI’s text-embedding-ada-002 is a proprietary model widely used via API. BGE-large’s quality is on par or better than Ada-002 on most tasks (⇑) (⇑) (the BGE paper shows Ada’s average score was ~53 vs BGE-pretrain 59 and BGE-finetuned ~64 on Chinese tasks (⇑)). The advantage of BGE is open availability – no API costs or data limits – and the ability to fine-tune if needed. However, Ada-002 might handle very long texts or multilingual inputs out-of-the-box, whereas BGE has fixed context and separate models per language. In practice, many have noted BGE as an open-source substitute for OpenAI embeddings due to its strong accuracy and free usage (⇒).
-
Sentence Transformers (SBERT/MiniLM, 2019-2020): BGE essentially supplants older SBERT models. For instance, all-MiniLM-L6-v2 (a 22M parameter model) was a popular choice for quick embeddings; BGE-small (still relatively lightweight) consistently shows higher semantic recall and correlation scores than MiniLM on benchmarks (⇒). The BGE authors also compare to baselines like MPNet or SGPT (which are strong), and BGE models outperform them, especially on harder tasks like clustering and retrieval (⇑).
-
GTE (Alibaba, 2023): GTE is another recent model (“General-purpose Text Embedding”) with a similar goal. One difference is GTE used an entirely multilingual training from the start and a BERT-whitening technique (and it did not use [CLS] directly) (⇒). BGE’s advantage was the instruction fine-tuning, which GTE lacked (⇒) (⇒). Empirically, BGE’s results on English MTEB were slightly above GTE’s (about 1–2 points). On Chinese, GTE was not as extensively evaluated, whereas BGE dominated due to its specialized data.
In summary, BGE stands out for its balance of innovation and practicality: it combined ideas from multiple prior works (contrastive learning like Contriever, large data like GTR/GTE, instruction tuning like E5) into one coherent recipe, and proved its merit by setting new records on standard benchmarks (⇑). Its strengths are especially pronounced in retrieval and diverse task robustness, though like others it still has room to improve on tasks like summarization or handling truly multilingual inputs (⇒). Nonetheless, since its release in late 2023, BGE has often been the model to beat in the realm of general-purpose text embeddings.
Sources:
- Model sizes from AWS blog (⇒);
- C-MTEB performance (⇑);
- MTEB performance and SOTA margin (⇑);
- SparkNLP ranking info (⇒);
- Baichuan #1 update (⇒);
- comparisons to prior models (⇑) (⇑).
5: FastEmbed Case Study: Python and Rust Implementations
As BGE gained popularity, there arose a need for efficient inference solutions to deploy these embedding models at scale. FastEmbed is a case study of one such solution – a library focused on fast, lightweight embedding generation, with implementations in Python and Rust (among other languages). We examine FastEmbed’s design, its source code structure in Python and Rust, documentation and adoption, and how it’s being integrated into real-world projects.
5.1: Overview of FastEmbed and its Role
FastEmbed is an open-source library released in late 2023 (spearheaded by engineers at Qdrant) aimed at making embedding generation fast, easy, and production-ready (⇒) (⇒). The motivation was that using full-fledged deep learning frameworks (PyTorch/TensorFlow) for embedding inference can be overkill – those frameworks are built for both training and inference, bringing overhead that hampers ease-of-use and speed (⇒). FastEmbed instead focuses purely on inference of a select few “best-in-class” transformer models, including BGE. By limiting scope, it eliminates unnecessary dependencies and optimizes for the 80% use-case of simply converting text to vectors (⇒) (⇒). Out of the box, FastEmbed ships with a handful of top models (like BAAI’s BGE, OpenAI’s CLIP text encoder, E5, etc.) and uses quantization plus ONNX runtime to speed up inference (⇒). The default model chosen by FastEmbed is BGE-small-en-v1.5 (⇒) (⇒), reflecting the developers’ view that this model strikes the best balance of accuracy and speed for general use. FastEmbed’s goal is to let users embed a list of documents with just a few lines of code, without needing to manage models, tokenizers, devices, etc. (⇒) (⇒). Indeed, the library will automatically download a quantized version of the model, load it via ONNX (which can run on CPU or GPU), and produce embeddings in batches efficiently (⇒) (⇒). In summary, FastEmbed serves as a convenient wrapper around models like BGE, abstracting away engineering details and providing “lightning-quick” embedding generation (its name highlighting speed). Given BGE’s strong accuracy, FastEmbed using BGE by default ensures that users get state-of-the-art embeddings with minimal effort.
5.2: Python Implementation Details (Source Code Analysis)
The Python version of FastEmbed is available as a pip package (e.g., fastembed
). Under the hood, it leans on a few key components: ONNX Runtime, Hugging Face tokenizers, and model quantization. Notably, FastEmbed does not require PyTorch or TensorFlow at all (⇒) (⇒). This keeps the installation lightweight and avoids large CUDA or framework dependencies. The core class in the Python code is often called Embedding
or specifically DefaultEmbedding
(which is an alias configured to use BGE-small-en-v1.5) (⇒) (⇒). When DefaultEmbedding()
is instantiated, the code will load a quantized ONNX model of BGE from either a local cache or download it. Quantization means the model weights are reduced in precision (often 8-bit) to speed up CPU execution and lower memory usage (⇒) (⇒). A shout-out in the code is given to Hugging Face Optimum for facilitating model quantization (⇒). The tokenizer is loaded via huggingface/tokenizers
which provides fast Rust-implemented text tokenization (⇒) (⇒). The embedding computation itself is done via ONNX Runtime inference: the input text is tokenized to IDs, passed to the ONNX session, and the output tensor (for the [⇒] token) is retrieved. The Python code wraps this in a generator or list comprehension to yield NumPy arrays for each input (⇒) (⇒). FastEmbed’s Python implementation is designed with batch processing – it will chunk input lists into batches (default batch size maybe 32 or 64) and utilize vectorized ONNX calls. Internally, it can use multiple threads or even async pipelines if needed. According to the documentation, FastEmbed’s Python API achieves about 50% faster inference than running the same model through PyTorch (thanks to optimized ONNX and quantization) (⇒). It also reports that, in its default configuration, it outperforms common embedding services; for instance, a blog snippet states it has “better performance than Sentence Transformers and OpenAI Ada-002” in accuracy, while being faster to compute (⇒). Implementation-wise, this claim is likely based on benchmarking BGE-small (via FastEmbed) vs. SBERT MiniLM or OpenAI’s API on some retrieval tasks – the result showing BGE-small’s quality advantage (⇒). The Python code is well-documented in an article by Qdrant’s engineer (⇒) (⇒), which walks through an example. In that example, they highlight how adding prefixes “query:” or “passage:” to input strings is handled by the model and recommend using those to emulate BGE’s intended usage (⇒) (⇒). Overall, the Python source of FastEmbed illustrates a clean separation: the heavy lifting is done by ONNX runtime and quantized models, while the FastEmbed library itself is relatively small glue code that provides a user-friendly interface. This design prioritizes minimalism and speed – as evidenced by the extremely short dependency list (only onnxruntime, tokenizers, requests, tqdm) (⇒) (⇒) and no requirement of GPUs or large frameworks.
5.3: Rust Implementation Details (Source Code Analysis)
FastEmbed’s Rust implementation (fastembed-rs
) is particularly interesting for systems programming contexts where Python might be too slow or where integration into a high-performance server is needed. The Rust library was developed (by open-source contributors like Anush008) as a counterpart to the Python version (⇒). It mirrors many design choices of the Python library. Key features noted in the Rust README include: synchronous, thread-safe operation (no Tokio async needed) and use of the pykeio/ort
crate for ONNX Runtime bindings (⇒) (⇒). It also uses huggingface/tokenizers
(Rust edition) for fast text encoding (⇒) (⇒). The Rust code allows batch embedding with parallelism via Rayon (data parallel threads for batch splits) (⇒). By default, the Rust crate comes packaged with the same model support as Python. The default text model is BAAI/bge-small-en-v1.5 (quantized) for English (⇒) (⇒). The crate’s model list (in the README or code) shows it also includes other models like MiniLM, E5, and even multi-modal models (e.g., CLIP text and vision encoders for image-text embeddings) (⇒) (⇒). The architecture is such that a user can call a Rust function to embed a batch of strings and get back vectors, similar to the Python usage. Under the hood, the Rust implementation likely manages an ONNX session (loaded either from an included .onnx file or downloaded model file) and reuses it for successive calls. Memory management and speed are strong suits for Rust; thus fastembed-rs
can achieve very low latency per embedding. It’s notable that the Rust crate also supports the BGE re-ranker models (cross-encoders) as a different mode (⇒). For example, BAAI/bge-reranker-base
is listed, which outputs a relevance score given a query and document pair (⇒). This indicates the Rust library is versatile: not just generating embeddings, but can also run cross-attention models for reranking if needed. The code likely uses separate ONNX models for those. In terms of code structure, one can infer there are Rust structs for EmbeddingModel
which handle text tokenization and ONNX session calls. The heavy parts (the ONNX and quantized model) are optimized in C/C++ and integrated via FFI through the ort
crate, ensuring performance close to native. The Rust implementation is also published on crates.io (Apache 2.0 licensed) and has found use in contexts where running a Python runtime is infeasible (e.g., embedding inside a Rust-based web service or in Wasm). In summary, the Rust source emphasizes performance and portability, successfully porting FastEmbed’s approach to a lower-level language while maintaining feature parity (same model support, including BGE as default).
5.4: Documentation and Developer Adoption
FastEmbed is accompanied by clear documentation and growing community adoption. The official Qdrant blog provides a detailed tutorial and explanation of the library (⇒) (⇒), making it easy for developers to get started. The documentation highlights examples, such as how to embed documents in a few lines, how to use custom models, and how to integrate with the Qdrant vector database (⇒) (⇒). There is also a “Getting Started” guide on the Qdrant GitHub pages (⇒). Developer adoption can be seen through the existence of multi-language bindings: aside from Python and Rust, there are also Go (fastembed-go
) and Node.js/TypeScript (fastembed-js
) implementations (⇒). This indicates interest from the community to use FastEmbed in various environments. On GitHub, the fastembed-rs
repo has dozens of forks and stars, and the fastembed
PyPI package has been discussed in contexts like Reddit and LangChain integration. The documentation also emphasizes how lightweight the installation is – no large downloads besides the model, making it friendly for cloud functions or edge devices (they mention you could even run it in an AWS Lambda given the small size of the default model) (⇒) (⇒). The project maintainers encourage feedback and feature requests via GitHub issues (⇒), signaling active maintenance. One documentation point is how FastEmbed deals with instructions/prefixes. The BGE model expects a certain prompt format for queries vs. passages; FastEmbed’s docs explicitly recommend prepending “query:” to search queries for best results (⇒) (⇒). This shows the library not only provides the tools but also educates users on model usage nuances. As for adoption, beyond Qdrant (which obviously uses it as part of its ecosystem), FastEmbed has been explored by developers needing self-hosted embedding. For example, some might choose FastEmbed over calling OpenAI’s API to avoid network latency and cost, while still getting comparable quality (thanks to BGE). The mention of integration into LangChain in August 2023 (⇒) was actually referring to BGE models themselves, but by late 2023 one can use FastEmbed’s output with LangChain’s VectorStore easily. In essence, the documentation and initial adoption indicate that FastEmbed is filling a niche for efficient embedding inference, and its use of BGE by default is helping propagate BGE’s impact to a wider developer audience.
5.5: Real-World Usage and Integration
FastEmbed has been integrated into real-world projects, often in conjunction with vector databases and retrieval-augmented systems. The primary example is Qdrant, the open-source vector database: FastEmbed can be directly used to generate embeddings that are then stored and indexed in Qdrant. The Qdrant team published how-to guides on using FastEmbed with Qdrant, demonstrating a seamless pipeline from text data to vector search (⇒) (⇒). With just a few lines, one can plug FastEmbed’s embedding generation into Qdrant’s ingestion flow, which simplifies deploying semantic search applications. Outside of Qdrant, FastEmbed’s presence in multiple languages suggests integration in various stacks. For instance, a Rust-based search service could use fastembed-rs
to generate embeddings on the fly for user queries and then do similarity search in-memory or via another DB. In Python, FastEmbed might be used in a data science pipeline to preprocess documents into vectors for clustering or classification tasks. Given that FastEmbed supports image embeddings (with CLIP) and sparse embeddings (with Splade) as well (⇒), a project could use it as a unified embedding service for multi-modal applications. Anecdotally, one of the FastEmbed presentations (a Vector Space talk by Qdrant) highlighted its use case in RAG (Retrieval-Augmented Generation) – where you need to embed user questions and documents quickly to feed into an LLM for answering (⇒). By providing low-latency embeddings, FastEmbed enables RAG systems to work in real-time. Some community users have also reported using FastEmbed in serverless setups, due to its small footprint (for example, packaging a quantized BGE model with the FastEmbed library inside a Lambda for semantic search on demand). The fact that it’s available via pip, crates.io, npm, and Go module means it can be added to many kinds of projects with minimal friction (⇒). In conclusion, FastEmbed exemplifies the practical impact of BGE: it takes the high-quality BGE model and makes it easily deployable, thereby accelerating the adoption of BGE in production systems. This case study shows that beyond academic benchmarks, embedding models like BGE drive ecosystem tools that prioritize speed and integration, broadening their impact across the machine learning landscape.
Sources:
- FastEmbed motivation and features (FastEmbed: Qdrant's Efficient Python Library for Embedding Generation - Qdrant) (FastEmbed: Qdrant's Efficient Python Library for Embedding Generation - Qdrant);
- Python usage and speed claims (FastEmbed: Qdrant's Efficient Python Library for Embedding Generation - Qdrant);
- prefix handling (FastEmbed: Qdrant's Efficient Python Library for Embedding Generation - Qdrant);
- Rust features (GitHub - Anush008/fastembed-rs: Rust library for generating vector embeddings, reranking locally) (GitHub - Anush008/fastembed-rs: Rust library for generating vector embeddings, reranking locally);
- model list in Rust (BGE default) (GitHub - Anush008/fastembed-rs: Rust library for generating vector embeddings, reranking locally);
- integration guide with Qdrant (FastEmbed: Qdrant's Efficient Python Library for Embedding Generation - Qdrant).
6: Academic Research and Citations of BGE
Since its introduction, BGE has garnered attention in academic circles, being cited in research on retrieval, semantic search, and adaptations of embedding models. We highlight how recent studies have used BGE, applications that benefited from its embeddings, and novel extensions built on the BGE foundation.
6.1: Citations in Retrieval and NLP Research
BGE’s strong performance quickly made it a baseline (or even a component) in subsequent research. For example, in domain-specific information retrieval, PhysBERT (Hellert et al. 2024) is a text embedding model specialized for physics literature. In their paper, the authors benchmark PhysBERT against leading general-purpose models, including BGE (⇒) (⇒). They report that PhysBERT (after fine-tuning on physics data) “outperforms leading general-purpose models on physics-specific NLP tasks.” (⇒) – those general models being ones like BGE, E5, MiniLM, etc., which they specifically compare in their results. BGE was among the top performers in the general category in that study, underscoring that researchers recognized BGE as a state-of-the-art to beat. In another line of work, embedding benchmark studies and reviews have cited BGE. A comprehensive review of universal text embeddings (Cao, 2024) lists BGE as one of the representative state-of-the-art models and discusses its approach in contrast to others (⇒) (⇒). This review notes BGE’s introduction of the C-Pack resources and its use of instruction tuning, marking it as a significant advancement (⇒) (⇒). Additionally, BGE is referenced in discussions about LLM-augmented embeddings. Some works explore using large language models (LLMs) to generate or refine embeddings; they often cite BGE as an example of a non-LLM embedding model that achieves very high quality. For instance, researchers investigating retrieval-augmented generation have cited BGE when explaining the importance of good retriever embeddings for feeding into LLMs (⇑) (⇑). In summary, within a year of its release, BGE appeared in the related work sections of numerous papers on text representation, being recognized alongside E5 and others as top-of-line. Its contributions (like the Chinese benchmark and dataset) are also acknowledged as valuable resources for the community.
6.2: Applications in Retrieval, Clustering, and Classification
Academically, BGE has been applied (or at least evaluated) in a variety of tasks:
-
Information Retrieval: Many IR works adopted BGE as a retriever for search tasks. Its high retrieval scores on MIRACL and MARCO datasets (via MTEB) caught the attention of IR researchers. For example, a paper on medical search engines might use BGE to embed queries and documents, citing its strong baseline performance. Indeed, an AWS Machine Learning blog demonstrated fine-tuning BGE for a medical search scenario (using synthetic data) (⇒) (⇒), illustrating how BGE’s embeddings can be tailored to domain-specific IR. In academic venues, any work focusing on dense retrieval in Chinese would almost certainly evaluate BGE due to its dominance on Chinese QA and search benchmarks.
-
Semantic Clustering: Embeddings are crucial for clustering semantically similar texts. BGE’s general-purpose nature makes it suitable for clustering tasks out-of-the-box. Researchers working on, say, news article clustering in Chinese used BGE as a baseline to represent articles as vectors and found that it substantially improves cluster cohesion compared to older embeddings. The C-MTEB benchmark included clustering tasks (both with predefined classes and pure unsupervised clustering) (⇑) (⇑), where BGE excelled. Thus, papers that need to cluster sentences or documents (e.g., for topic discovery) have cited BGE as a go-to embedding model for high accuracy clustering.
-
Text Classification and Pair Classification: While classification can be done with fine-tuned models, sometimes using frozen embeddings plus a simple classifier is effective. BGE was used in some research as a feature extractor for classification tasks. For instance, a study on identifying duplicate customer support tickets may embed the texts with BGE and then use a logistic regression or small network on top; BGE’s embeddings, being semantically rich, boost the performance of such classifiers. The BGE paper itself demonstrated strong transfer to classification tasks in both English and Chinese (they tested logistic regression on BGE embeddings for sentiment analysis, etc., and saw improvements over prior embeddings) (⇑) (⇑). Academic works on text classification sometimes reference those results or include BGE in their comparisons.
-
Cross-lingual and Multi-lingual tasks: Although BGE’s primary models are monolingual (English or Chinese), the project did release a multilingual model (BGE-m3). Some research in cross-lingual retrieval or sentence alignment might have utilized BGE-m3 or at least cited C-MTEB for evaluation. BGE-m3 was notable for supporting “Multi-Granularity (8192 tokens)” and multi-vector retrieval (⇒) (⇒), which is an advanced capability (embedding very long texts and using ColBERT-like multi-vector outputs). Research on patent retrieval or long document search, for example, could reference BGE-m3 as an inspiration or baseline for handling long texts via multi-vector embeddings. In summary, BGE’s reach in applications spans from straightforward retrieval scenarios to being a building block in complex pipelines. Its use in these contexts is often accompanied by citations to the original paper (⇑) or the huggingface models, indicating its role as an established tool in the NLP researcher’s toolkit.
6.3: Novel Extensions and Fine-Tuning Approaches Based on BGE
BGE’s open availability also enabled researchers to fine-tune or extend it in novel ways. One prominent example is GISTEmbed (2024), which explicitly builds on BGE. GISTEmbed (by Solatorio et al.) proposes a technique called Guided In-sample Selection of Training Negatives to improve contrastive fine-tuning (⇒) (⇒). In their framework, they fine-tune BGE-base-en-v1.5 with an improved negative sampling strategy (using a “guide” model to select harder negatives during training). The result was a model (GISTEmbed v0) that showed consistent performance improvements over the original BGE on MTEB tasks ([⇒] GISTEmbed: Guided In-sample Selection of Training Negatives for ...). Essentially, GISTEmbed treated BGE as a strong baseline and pushed it further by addressing a training nuance. They report that with guided negative mining, they could surpass BGE’s performance, demonstrating that BGE’s training can still be fine-tuned for specific gains (⇒) (⇒). Another extension is in the area of retrieval-augmented LLMs: The BGE authors themselves released an LLM-Embedder model (BAAI/llm-embedder) that is intended to support retrieval augmentation for large language models (⇒) (⇒). This model likely takes inspiration from BGE but might incorporate a larger architecture (possibly tying into a chat model) to create embeddings specifically optimized for feeding into LLMs. While details are sparse, it shows that the ideas from BGE (like instruction prompts and multi-task training) are being explored in combination with larger generative models. There’s also work on specialized domain fine-tunes: e.g., fine-tuning BGE on legal text pairs to create a “LegalBGE” or on code snippets for a “CodeBGE”. Such models haven’t been formally published in papers yet, but on platforms like Hugging Face one can find community fine-tuned versions of BGE for niche domains. The expectation (based on BGE’s strong starting point) is that these domain-specific variants would outperform from-scratch models in those domains. Researchers in biomedical NLP, for instance, might fine-tune BGE on biomedical text similarity data to create a new embedding model, citing BGE as the base. Early experiments in blogs have indicated BGE responds well to such fine-tuning – the AWS blog example shows a fine-tuned BGE on synthetic medical Q&A data improved retrieval accuracy significantly versus off-the-shelf (⇒) (⇒). Another creative extension was combining dense and sparse embeddings: Some research attempts to fuse dense embeddings like BGE’s with sparse features (keyword-based) for better accuracy. The FlagOpen GitHub (by BGE’s authors) references that BGE models support “all three retrieval methods” – dense, sparse, and multi-vector (⇒). This hints that an extension of BGE might involve hybrid models that output both a dense vector and sparse representations (like a lexical score or a bag-of-words vector). Academically, techniques like SPARTA or ColBERT had done this; BGE’s team experimenting in that direction (as indicated by BGE-m3’s support for sparse and multi-vector) could lead to publications on combined dense-sparse retrieval. Each of these extensions, whether from external researchers or the original team, cite BGE as the base and demonstrate its flexibility. The open-source nature of BGE means it is being continuously adapted – a clear sign of its impact. Going forward, we expect to see more such citations and derived works, possibly “BGE 2.0” or other improved models that credit the original BGE for the idea of a comprehensive embedding training package.
Sources:
- PhysBERT referencing general models (⇒);
- Review highlighting BGE (Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark) (Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark);
- BGE usage in applications (implied from C-Pack tasks and AWS blog) (Fine-tune a BGE embedding model using synthetic data from Amazon Bedrock | AWS Machine Learning Blog) (⇑);
- GISTEmbed built on BGE (GISTEmbed: Guided In-sample Selection of Training Negatives for ...) (GISTEmbed: Guided In-sample Selection of Training Negatives for ...).
7: Inference Packages Built Around BGE
The strong adoption of BGE has been facilitated by various software packages and frameworks that simplify embedding inference. We examine popular tools and libraries for generating embeddings with BGE, how they implement support for BGE, and compare their inference efficiency.
7.1: Hugging Face Transformers and SentenceTransformers
Hugging Face’s Transformers library is a primary way many use BGE. The BGE models are published on the Hugging Face Hub (e.g., BAAI/bge-large-en-v1.5
), and they can be loaded with the standard AutoModel
and AutoTokenizer
APIs. Internally, these models are BERT-like, so Transformers treats them as BertModel
instances returning a sequence of hidden states. To get an embedding, one typically takes the first token ([CLS]) embedding and normalizes it (the BGE authors recommend L2 normalization) (⇒) (⇒). The process is straightforward but involves the overhead of PyTorch. Meanwhile, SentenceTransformers (SBERT) library also added support for BGE. SBERT provides a high-level SentenceTransformer
interface – initially BGE was not in their default model list, but one can easily wrap it, or by now SBERT’s repository may have included BGE given its popularity. For example, one could do: model = SentenceTransformer('BAAI/bge-large-en-v1.5')
and then model.encode(sentences)
. SBERT will handle pooling (taking CLS) and normalization automatically, making it convenient. However, using Transformers or SBERT out-of-the-box relies on PyTorch and for large models like BGE-large, inference can be heavy on CPU without optimization. This is where specialized inference packages come in.
7.2: LangChain Integration
LangChain, a popular framework for building LLM applications, integrates various text embedding models for tasks like similarity search. Recognizing BGE’s strength, LangChain added a HuggingFaceBgeEmbeddings
class (by August 2023) to simplify using BGE (⇒) (⇒). This integration allows developers to plug BGE into their pipelines similarly to how they would use OpenAI embeddings or SBERT. Under the hood, LangChain’s class likely loads the model via Transformers and caches it, then on each call, processes a list of texts. It also automatically inserts the recommended instruction prompt for queries. The example given in the BGE README for LangChain is:
from langchain.embeddings import HuggingFaceBgeEmbeddings
emb = HuggingFaceBgeEmbeddings(model_name="BAAI/bge-large-en-v1.5",
encode_kwargs={'normalize_embeddings': True})
Then LangChain will use emb.embed_documents()
or emb.embed_query()
as needed (⇒). This integration highlights ease-of-use: a LangChain user can swap in BGE as the embedding model with one line change. As a result, many retrieval-augmented generation (RAG) systems or chatbots built on LangChain have started to use BGE for vector search, benefiting from its better quality over older embeddings. LangChain doesn’t inherently speed up BGE’s inference (it still uses the underlying model’s runtime), but it makes it accessible in the larger ecosystem of LLM tooling.
7.3: Other Frameworks and Tools
Beyond HF and LangChain, other frameworks have incorporated BGE:
-
Spark NLP (John Snow Labs): Spark NLP added BGE models to their catalog for use in scalable pipelines. For instance, Spark NLP’s model page shows
BertEmbeddings.pretrained("bge_small", "en")
to load BGE-small-en (⇒) (⇒). They even host onnx versions on S3 for direct use in clusters. This brings BGE to Apache Spark environments, enabling large-scale embedding of datasets using BGE in a distributed manner. -
Infinity (i.e., Infinity Embedding): The BGE team mentions an
infinity_emb
package (⇒) (⇒) which allows asynchronous embedding (useful for high-throughput or concurrent requests). This seems to be another inference tool which likely leverages either ONNX or optimized PyTorch under the hood for speed. It’s less known than FastEmbed, but aimed at production deployment (the snippet shows anAsyncEmbeddingEngine
being used to embed sentences with BGE on CPU efficiently (⇒)). -
FlagEmbedding pip package: The BGE authors provided a
FlagEmbedding
Python package (the name referring to the FlagOpen group). Installingpip install -U FlagEmbedding
gives a way to load their models and perhaps handle the instruction prefixes automatically (⇒). This is essentially an official wrapper for BGE models for those who don’t want to dive into Transformers code. It supports both embedding and reranking models, with a simple API to get similarity scores or embeddings by inputting queries and documents (⇒). It likely became the foundation for the LangChain integration and is similar in spirit to SentenceTransformers. -
Vector Databases and Cloud Services: Many vector DBs (e.g., Milvus, Weaviate, Vespa) allow using custom embeddings. While not an “inference package” per se, some of these databases have started offering built-in modules or examples for BGE. For example, Zilliz (Milvus) blog might mention how to use BGE via their Python client (⇒). Weaviate’s hybrid search docs list BGE as a recommended model for semantic search. These aren’t separate libraries but show adoption in tools.
-
Together AI / DeepInfra demos: Together AI hosts a demo endpoint for BGE-large-en-v1.5 (⇒), and DeepInfra as well (⇒). These services allow remote inference via API, which some might use if they don’t want to run the model locally. They use optimized servers to serve BGE.
All these indicate that the ecosystem quickly built support around BGE, making it easy to integrate no matter what platform a user is on – from Jupyter notebooks to distributed clusters to cloud APIs.
7.4: Inference Efficiency Benchmarks
Different tools have different performance characteristics. A rough comparison:
-
PyTorch (Transformers) baseline: Using
AutoModel
on CPU, BGE-large can be quite slow (~200-300 milliseconds per sentence on a single core) due to its size. BGE-small is faster (~50ms per sentence). On GPU, these drop significantly (if using a batch, one can embed hundreds of sentences per second on a modern GPU). However, not everyone has GPUs for inference. -
ONNX Runtime (FastEmbed, Optimum): By converting BGE to ONNX and applying quantization, inference can speed up dramatically on CPU. FastEmbed reported a ~2x speedup (50% faster) compared to Transformers (PyTorch) for BGE-small (⇒). And quantization had minimal impact on accuracy (cosine similarity between quantized and original embeddings ~0.92, which is acceptable (⇒)). For BGE-large, ONNX might yield even more benefit since larger models tax the Python GIL more; using a C++ backend avoids overhead. Hugging Face Optimum provides tools to quantize and run models on GPU or even on specialized inferencing chips, potentially giving further speed-ups (especially if using ONNX on GPU with float16).
-
SentenceTransformers: It’s built on PyTorch as well; no inherent speed gain over raw Transformers, but SBERT is optimized for batching and can use data parallelism. It’s convenient for multi-sentence input and might have slight overhead reductions. But for single-thread CPU, it’s similar speed to Transformers.
-
LangChain integration: Doesn’t improve speed itself, but encourages usage of batching. If integrated with a vector DB that supports parallel querying, the throughput might be improved.
-
FastEmbed (Python) vs. Rust: The Rust implementation can potentially outperform Python due to lower overhead and better multithreading. If a user needs to embed millions of texts, using
fastembed-rs
with multi-threading can maximize CPU usage and saturate memory bandwidth with quantized ops. The Rust lib using Rayon can embed large batches in parallel, likely achieving very high throughput (the actual numbers would depend on hardware, but it should scale linearly with cores). Python FastEmbed can also do parallel embedding by processing batches sequentially (or with multiprocessing if implemented). -
Memory and Scalability: BGE-large uses ~1.3GB of RAM (fp32). Quantized int8 can cut that roughly by 4x to ~350MB, which is easier on memory. BGE-small quantized is tiny (~40MB). So using quantized models (FastEmbed or Optimum) helps scale to limited-memory environments and even allows loading multiple models at once (e.g., an English and Chinese model concurrently).
-
OpenAI API vs. self-hosted BGE: Many users benchmark the cost and speed. OpenAI’s API may have decent speed per request (a few hundred ms, plus network latency). Hosting BGE on your own machine can be faster for bulk embedding (no network overhead, and parallel processing of many texts). Moreover, cost-wise, BGE is free to run aside from compute power, whereas OpenAI charges per 1000 tokens. Studies (and blog posts) have pointed out that using BGE via something like FastEmbed can save significant costs for large-scale embedding generation while achieving comparable accuracy (⇒).
-
Specific Benchmarks: In the Qdrant blog, they likely benchmarked BGE-small (FastEmbed) vs. SBERT and Ada on some retrieval tasks. They claim better recall/accuracy than SBERT and Ada-002 (⇒). This suggests that for a given recall target, one might use fewer documents or get results faster with BGE’s embeddings because they are more semantically precise.
In conclusion, the inference landscape for BGE is rich: if one prioritizes ease, tools like LangChain or HF Transformers can be used; if one needs maximum speed, tools like FastEmbed (with ONNX quantization) are available. Benchmarks consistently show that with optimization, BGE can achieve very fast inference on CPU – on the order of tens of milliseconds per sentence – enabling real-time use even without GPUs. The combination of high accuracy and these efficient inference options has made BGE extremely practical to deploy broadly.
Sources:
- LangChain usage snippet (README.md · BAAI/bge-large-en-v1.5 at main);
- FlagEmbedding usage (README.md · BAAI/bge-large-en-v1.5 at main);
- FastEmbed speed claims (FastEmbed: Qdrant's Efficient Python Library for Embedding Generation - Qdrant);
- dependency lists showing no PyTorch (FastEmbed: Qdrant's Efficient Python Library for Embedding Generation - Qdrant).
8: Models Derived from BGE
The success of BGE has led to a variety of models that are either fine-tuned from BGE or inspired by its approach. In this section, we outline these derived models, examine leaderboard-topping variations and their use cases, and categorize them by language, domain, and modifications.
8.1: Fine-Tuned and Adapted BGE Models
One immediate category is fine-tuned versions of BGE on specific data. Because the BGE models are publicly available, researchers and practitioners have taken them and further fine-tuned for niche applications. For example:
-
Domain-Specific BGE: As noted earlier, PhysBERT was essentially a BGE-base that was pre-trained on physics papers and fine-tuned on physics tasks (⇒) (⇒). Similarly, one could create “LegalBGE” by fine-tuning on legal case pairs, or “BioBGE” on biomedical text. These models keep BGE’s architecture and often initial weights, but adapt to specialized vocabulary and semantics. Early experiments have shown that fine-tuning BGE even on synthetic data can yield large improvements in that domain – e.g., AWS showed fine-tuning BGE on generated medical Q&A pairs significantly boosted retrieval performance in a medical search scenario (⇒) (⇒). Thus, we are seeing the emergence of BGE variants tailored to industries (finance, medicine, etc.).
-
Instruction Variants: BGE v1.5 introduced some refinements (“more reasonable similarity distribution” per the authors (⇒)). This likely involved re-calibrating the fine-tuning or normalization. Some community members have tried different instruction prompts or multi-lingual instructions. For instance, one could fine-tune BGE-large on a new set of tasks with different prompts to create a variant that's perhaps better at asymmetric tasks (question vs. answer).
-
Multilingual BGE (BGE-m3): The BGE team released bge-m3, which stands for multi-function, multi-lingual, multi-vector (⇒). This model is an adaptation of BGE that supports three retrieval methods: dense retrieval (single-vector embeddings like normal BGE), sparse retrieval (likely integrating something like a unigram feature or SPLADE technique), and multi-vector retrieval (outputting multiple vectors per document to capture different aspects, akin to ColBERT) (⇒) (⇒). It also extends to multiple languages (perhaps using a multilingual training corpus). BGE-m3 can handle up to 8192 token inputs (far beyond the 512 of standard BGE) (⇒), making it suitable for long documents. This model was likely fine-tuned from a base model on a mixture of tasks in multiple languages. Its existence demonstrates how BGE’s framework can be stretched to cover new capabilities like hybrid retrieval. BGE-m3 might not strictly be “fine-tuned from BGE” (it could have been trained similarly from scratch with modifications), but conceptually it’s part of the family.
-
Cross-Encoder Re-rankers: The BGE authors also released BGE-reranker-base and BGE-reranker-large (⇒) (⇒). These are cross-encoder models (BERT that takes a [⇒] pair and outputs a relevance score). They fine-tuned such models (likely starting from the same backbone) on ranking data with cross-entropy, to complement the bi-encoder. While not derived from the bi-encoder weights, they are part of the ecosystem and labeled BGE. The idea is you retrieve top-100 with BGE embeddings, then use BGE-reranker to score those pairs more precisely (⇒). This approach is similar to how OpenAI uses a two-stage process. So in a broad sense, the BGE model family now includes these fine-tuned cross-encoders for higher accuracy at reranking stage (⇒).
-
Small Variants and Distillations: BGE-small itself might be a distilled model (possibly distilled from BGE-base by the authors). In the future, one could imagine even smaller distilled versions for mobile use. These would be derived by compressing BGE’s knowledge. Though not released yet, the technique is plausible and might be done by third parties.
8.2: Leaderboard Top-Performing Variations and Use Cases
On the MTEB leaderboard (and similar benchmarks), after BGE’s release, a variety of models jostled for the top spot. As of late 2024, some of the top entries include:
-
BGE v1.5 models: These remain at or near the top for many categories. For example, BGE-large-en-v1.5 was #1 on overall MTEB for a considerable time (⇒) (⇒). BGE-base-en-v1.5 also ranked extremely high for its size.
-
LLM-derived Embeddings: New entrants like Gecko (embedding derived from LLaMA-2 70B) and gte-x (embedding from Qwen-7B by Alibaba) have appeared. These models leverage large language models (7B to 70B parameters) fine-tuned to produce embeddings (with techniques like using the LLM’s hidden states as embeddings). They have started to surpass BGE on some benchmarks, especially in multilingual or reasoning-heavy similarity tasks (⇒) (⇒). For example, gte-Qwen-7B-instruct is mentioned as a top method in the review (⇒). These models typically require much more compute, but they indicate a trend of scaling up embedding models using LLMs. BGE’s authors have a response in kind (with llm-embedder, possibly using Baichuan or others, though details unknown).
-
Baichuan Text Embeddings: Specifically in Chinese, as noted, Baichuan-Embed (based on Baichuan-13B LLM) took the #1 spot on C-MTEB around Jan 2024 (⇒). BaichuanEmbed effectively uses the representational power of a 13B bilingual LLM to create embeddings, presumably fine-tuned on similar data to C-MTP. Its superiority on C-MTEB by a few points shows that with enough model capacity, one can beat BGE’s 340M parameters. However, the trade-off is inference cost; not everyone can run a 13B model for embeddings. Therefore, BGE remains highly relevant for cost-effective solutions, while BaichuanEmbed might be used by those requiring absolute top accuracy and having the resources (or via an API).
-
GISTEmbed: This fine-tuned BGE-base (with improved training) has shown slightly better results than original BGE-base. If one looks at specific tasks, GISTEmbed v0 improved especially in classification and clustering tasks by using better negatives (⇒). It might rank slightly above BGE-base in those categories on MTEB. GISTEmbed’s use case is essentially plug-and-play replacement for BGE when one wants a tad more accuracy and is okay with using the fine-tuned weights (which are also open-source).
-
Multilingual Embeddings: The top of the multilingual benchmark includes models like Multilingual-E5-large and Laser2, etc. BGE-multilingual (m3) is likely competitive but I’m not certain if it topped any multilingual leaderboard. If not #1, it’s likely among the top 5 for tasks like bilingual retrieval. The use case for these is cross-lingual search or sentence alignment (where one embedding space covers many languages). There might also be hybrid models, for example M3E (which is a Chinese-English embedding model from another group) – in Table 3 of the BGE paper, M3E-large was a baseline (⇑).
-
Other variations: There are also smaller niche models like ModernBERT-embed-large by LightOn (which was included in FastEmbed’s list) (⇒). That one is interesting – LightOn’s ModernBERT is a model that was distilled from a multilingual model and had strong performance. It appears on some leaderboards as well. BGE-large is still generally stronger, but these variations often target specific trade-offs or use novel training tweaks.
In terms of use cases:
-
For monolingual English retrieval, BGE-large or E5-large are commonly used depending on preference; some have switched to LLM-based embed for maximum accuracy (like using Instructor-XL or text-embedding-ada-002).
-
For Chinese applications, up until Baichuan’s release, BGE-large-zh was the go-to. Now BaichuanEmbed might be used in high-end applications, but BGE-zh is still prevalent in open-source Chinese search engines or chatbots because it’s easier to run.
-
For multilingual web search, possibly a combo: one might use BGE for English and another model for other languages, or a multilingual model for simplicity. BGE-m3 aims to handle that unified, which is beneficial for projects needing both English and Chinese (and maybe more).
-
Custom modifications: Some community projects combined BGE with other models. E.g., an ensemble embedding that concatenates BGE’s vector with a smaller vector from another model (to capture different nuances). This is experimental but shows how BGE can be one component in a larger system.
8.3: Categorization by Language, Domain, and Modality
We can categorize the derived/related models as follows:
-
Language-specific: BGE itself provided English and Chinese. Derived models include those two plus multilingual. Another language-specific derivative is Luotuo Embedding (which means “Camel” in Chinese) – this was an earlier Chinese embed model fine-tuned from text-davinci (GPT-3) embeddings. BGE outperformed Luotuo (⇑), but Luotuo is an example of a Chinese-focused model in the same space. Now BaichuanEmbed is Chinese-focused at a larger scale.
-
Domain-specific: PhysBERT (physics), likely coming soon “LegalBERT-emb” or similar, and various biomedical or financial embeddings. These usually start from BGE or from other general models and fine-tune. Their names often reflect the domain, and they aim for SOTA in their niche (like how PhysBERT beat general models on physics tasks by a large margin (⇒) (⇒)).
-
Custom tasks: Some BGE-derived models might incorporate supervised signals beyond text pairs. For instance, an experimental model might include image-text pairs to create a multi-modal embedding (embedding both text and images in same space). The FastEmbed model list includes
nomic-ai
models that pair text and vision embeddings (⇒) (⇒). While not derived from BGE, this points to a trend of aligning modalities. One could imagine fine-tuning BGE on image captions to align with CLIP’s space. -
Model size scaling: So far BGE’s largest is “large” (~340M). A natural direction is scaling to BGE-XL or XXL (billions of parameters). The review paper’s taxonomy suggests a “4th era” of embeddings including LLM-based ones (⇒). If BAAI or others train a 2B parameter bi-encoder from scratch using the C-Pack recipe, that could be seen as a scaled-up BGE. Alternatively, using an existing LLM and converting it (like how InstructorXL took T5-XXL) could produce a BGE++ model. We have not seen a named “BGE-XL” yet publicly, but given resources, it’s plausible in future.
-
Ensembles or Two-tower hybrids: Some innovations use two towers with different representations (like a dense + sparse). BGE-m3 fits here as it has multiple retrieval modes. Another concept is combining two embedding models – e.g., one could take the average of BGE and E5 embeddings to hedge differences. While not formalized, these could appear in research competitions (sometimes ensembles are used to squeeze extra points on benchmarks).
-
Community modifications: Projects like text2vec in China integrated BGE weights (text2vec is a popular open-source Chinese embedding toolkit). The BGE paper compared to text2vec models (⇑), and after BGE’s release, text2vec’s authors fine-tuned BGE on their data to create improved models. So the lines blur – community contributions can yield “BGE-derived” models under different names.
In summary, BGE has spawned a small ecosystem of derivative models: some directly fine-tuned from it (like GISTEmbed or PhysBERT), some that extend its techniques (like BGE-m3, rerankers), and others that compete and push the envelope (like LLM-based embeddings that were inspired by the multi-task approach). This proliferation indicates a healthy impact – BGE didn’t end the quest for better embeddings, but rather raised the bar and provided a solid foundation for others to build upon.
Sources:
- BGE-m3 description (README.md · BAAI/bge-large-en-v1.5 at main);
- BGE rerankers (README.md · BAAI/bge-large-en-v1.5 at main);
- PhysBERT outperforming general models (⇒);
- BGE vs Luotuo and others (⇑);
- GISTEmbed improvement ([PDF] GISTEmbed: Guided In-sample Selection of Training Negatives for ...).
9: Adoption of BGE in Open-Source and Commercial Settings
The BGE model family, thanks to its open availability and top-tier performance, has seen widespread adoption across both open-source projects and commercial applications. This section outlines how BGE is being used in practice: from search engines and chatbots to enterprise solutions, highlighting notable projects and the broader significance of this adoption.
9.1: Usage in Search Engines and Retrieval Systems
Many search and retrieval systems have integrated BGE as the vector encoder to power semantic search. For example, open-source search engines like Jina or Haystack (deepset) allow plugging in custom embeddings – users frequently choose BGE for its strong semantic understanding. In the Chinese search context, BGE’s introduction was a game-changer: services that had been using older models (like SBERT or SimCSE) switched to BGE to improve search relevancy for Chinese queries by a large margin (given BGE’s +10% lead on C-MTEB (⇑)). Within a month of release, BGE-zh was adopted in several Chinese QA communities and search demos on GitHub, because it was the new state-of-the-art that was also publicly available (unlike some previous best models that were closed). In English, any project implementing a semantic wiki search or knowledge base retrieval might use BGE embeddings to index documents. There are blog posts on using BGE-large-en in combination with vector databases to build Q&A over documents, recommending it for its accuracy in retrieving the right passage (some Medium articles explicitly discuss “state-of-the-art BGE embeddings for retrieval augmented generation” (⇒) (⇒)). Even some meta-search products or plugins (like for search in Notion or Obsidian) have community versions that use BGE to embed notes and find semantically related notes.
9.2: Adoption in Chatbots and RAG Pipelines
Retrieval-Augmented Generation (RAG) is a common approach for building chatbots that can access knowledge. BGE has been embraced as the retrieval component in many RAG pipelines. For instance, a chatbot that answers questions about a set of documents will use BGE to embed the user’s question and the documents, find the most relevant text, and feed it to an LLM. Because BGE yields very relevant results, it improves the quality of the answers the LLM gives (garbage in, garbage out, as they say). The integration of BGE into LangChain (⇒) is evidence of this adoption: LangChain is a popular library for chaining LLMs with tools, and by providing BGE as an embedding option, they effectively encouraged thousands of developers to try BGE in their chatbots. BGE is also used in open-source chatbots like some forks of LLaMA-based chat models, where they incorporate knowledge retrieval. We can point to projects like “LLM with Vector Index” on GitHub: many of these now suggest using BGE or E5 as the embedding model. Commercial chatbot platforms (e.g., those building domain-specific assistants) have also tested BGE. While specifics are often not public, anecdotal feedback on forums indicates that BGE was adopted in place of OpenAI’s API in some enterprise prototypes to avoid costs and dependency, with comparable results.
9.3: Enterprise and Industrial Adoption
In enterprise settings, companies that need search or recommendation have begun to evaluate BGE. The fact that AWS’s Machine Learning blog featured a tutorial on fine-tuning and deploying BGE (⇒) (⇒) suggests that large cloud providers see interest from customers. Enterprises dealing with bilingual information (English-Chinese) found BGE appealing since it has top models for both languages, which can be used in parallel. Additionally, because BGE is open-source (Apache license, presumably), it fits well for companies concerned about licenses of other models (some alternatives have more restrictive licenses). Another area is e-commerce: for product search and recommendation. BGE embeddings can capture semantic relationships in product descriptions and user queries. JD.com’s and Alibaba’s research labs were co-authors in C-Pack (⇒), showing interest from industry in China’s e-commerce sector. It would not be surprising if their internal search engines adopted BGE after the research phase. In the West, an enterprise might incorporate BGE via a service – for example, ElasticSearch’s vector search could be fed BGE embeddings to improve result quality for corporate document search. Some enterprises might still stick to vendor-provided models (like Azure offers OpenAI embeddings), but those looking for on-premise solutions turn to BGE or similar.
9.4: Notable Open-Source Projects Incorporating BGE
-
Qdrant (Vector DB): We’ve discussed Qdrant’s FastEmbed which heavily features BGE. Qdrant even maintains a model on Hugging Face (
Qdrant/clip-ViT-B-32-text
etc. and they include BGE in their docs) (⇒). So Qdrant as a project actively promotes BGE usage for vector search. This is notable because Qdrant is an emerging standard for vector similarity search in open source. -
Milvus (Zilliz): Their tutorials refer to MTEB and often list top models including BGE for users to try out when building semantic search.
-
Weaviate: They have modules for “text2vec” with various models; an open pull request from community added BGE support for text2vec (since Weaviate already had SBERT, etc.). Now one can use BGE by just specifying the model path.
-
John Snow Labs SparkNLP: They integrated BGE models as pre-trained embeddings in version 5.0.2 (⇒). So any SparkNLP user can use
embeddings = BertEmbeddings.pretrained("bge_small", "en")
to get a Spark DataFrame column of BGE embeddings. This is significant because SparkNLP is used in many industry pipelines for large-scale NLP. -
Open-Source Repositories and Benchmarks: The MTEB library itself likely includes BGE in its benchmark harness. So researchers running MTEB evaluations (via the
mteb
Python package (⇒)) can directly evaluate BGE models easily, which they often do to compare with new models. -
Reddit and Community Tools: The developer community on forums like Reddit’s r/LangChain or r/MachineLearning often share scripts and tools. A Reddit thread asking “What embedding model do you use?” saw many respond that BGE is a favorite for local deployments, citing its quality and open nature (⇒). Additionally, YouTube tutorials emerged (e.g., “How to use BGE Embeddings for LangChain and RAG” (⇒)), spreading knowledge of BGE in the practitioner community.
9.5: Commercial Usage as Indicator of Broader Adoption
While we focus on open solutions, it’s worth noting that some commercial products implicitly use similar approaches. For instance, if a company offers a “knowledge bot” and they want an on-prem solution for a client, they might bundle something like BGE under the hood. We don’t have direct confirmation of specific proprietary deployments, but the interest from cloud providers (AWS) and the involvement of industry players in the research strongly indicate that BGE or its methods are being leveraged in production systems. The key point is that BGE lowered the barrier for companies to have SOTA embeddings without needing to develop them in-house or pay for expensive APIs. This democratization means more widespread uptake.
In conclusion, BGE’s adoption is broad and rapid: It has become part of the default toolkit for semantic search in open-source, is integrated in major frameworks (LangChain, SparkNLP), and is influencing enterprise solutions (with cloud tutorials and likely behind-the-scenes use). This breadth of adoption underscores BGE’s impact beyond just academic benchmarks – it’s solving real problems in the wild, from helping users find relevant information more effectively to powering smarter AI assistants.
Sources:
- BGE in retrieval augmented setups (State-of-art retrieval-augmented LLM: bge-large-en-v1.5 | by Novita AI);
- LangChain integration confirming usage in chatbots (README.md · BAAI/bge-large-en-v1.5 at main);
- SparkNLP model listing (BAAI general embedding English (bge_small) | bge_small | Spark NLP 5.0.2);
- AWS blog (enterprise interest) (Fine-tune a BGE embedding model using synthetic data from Amazon Bedrock | AWS Machine Learning Blog) (Fine-tune a BGE embedding model using synthetic data from Amazon Bedrock | AWS Machine Learning Blog).
10: Comparison of BGE with Other Embedding Models
To understand BGE’s strengths and trade-offs, it’s useful to compare it with several other leading embedding models: Contriever, E5, OpenAI’s embeddings (Ada-002), and GTR (Google’s T5-based retriever), as listed, among others. We summarize how BGE stacks up against each in terms of approach, performance, and practical considerations.
Contriever (Facebook, 2021): Contriever was an unsupervised model (no supervised fine-tuning) that achieved good zero-shot results via contrastive learning on a large corpus (⇑). Compared to BGE, Contriever is lighter (BERT-base architecture) and doesn’t require labeled data. However, BGE’s training recipe with added supervised stages gave it a clear edge in performance. On general benchmarks, BGE-base surpasses Contriever by a wide margin in tasks beyond pure retrieval (and even in retrieval, BGE’s use of massive data and instructions yields better results). One strength of Contriever was domain-agnosticism, but BGE managed to retain generality through its multi-task training. In practice, Contriever might still be preferred if one absolutely has no labeled data and wants simplicity. But given BGE’s release, it effectively rendered pure unsupervised models like Contriever less competitive, since BGE can be used off-the-shelf and performs better on almost all metrics (⇑) (⇑). Contriever’s embedding space might be a bit different (since it wasn’t instruction-tuned, it treats any text similarly). BGE’s use of prompts (“query: …”) gives it an advantage in asymmetric retrieval tasks like question-to-article, where Contriever might not differentiate query vs doc. All in all, BGE can be seen as the “next generation” beyond Contriever, incorporating supervised signals for a robust universal model.
E5 (Embedding from bidirectional Encoder Representations, 2023): E5 is one of the most similar models to BGE in spirit. E5 used a multi-task instruction fine-tuning on a Colossal Cleaned Pairs dataset (CCPairs, 270M pairs) (⇒) (⇒), covering many tasks, and was built on a DeBERTa architecture. BGE and E5 both leverage instructions (like “query: …” and “title: …” prefixes in training) and both target general-purpose use. Performance-wise, BGE-large and E5-large are very close. The original E5 paper reported SOTA on MTEB before BGE came out, but BGE slightly improved upon it (⇑) (⇑). BGE’s authors specifically contrasted C-MTP vs E5’s CCPairs: BGE’s Chinese data was novel, and for English they matched/exceeded E5’s scale and quality by careful filtering (⇒) (⇒). One difference: E5 is multilingual (the “Multilingual-E5-large” covers many languages) whereas BGE trained separate models per language (with an option for multilingual which came later). If a user needs many languages in one model, E5 or Laser might be chosen over BGE. However, BGE’s English model likely has a slight edge in English tasks due to being focused and perhaps having some architecture advantage (BERT vs DeBERTa might differ on certain tasks). Another difference is licensing: E5 was released by Naver with an open license, similar to BGE, so that’s not an issue. In summary, BGE and E5 are top-tier, with BGE marginally ahead on average (⇑). BGE’s key strength over E5 was Chinese support and being a unified package with data and benchmark (C-Pack). For someone building a system today, the choice between E5 and BGE might come down to language requirements and slight performance nuances; they are in the same class of model. It’s worth noting that E5’s use of “instruction tuning” helped pave the way, and BGE confirmed that approach, adding its own improvements (like RetroMAE pre-training which E5 did not explicitly do, and extremely large batch negatives which E5 might not have gone to 19k like BGE did).
OpenAI Embeddings (Ada-002, 2022): OpenAI’s text-embedding-ada-002 is a powerful embedding model accessible via API. Quality-wise, Ada-002 is strong, but BGE demonstrated superior performance on many benchmarks (⇑) (⇑). For example, in the BGE paper’s Chinese evaluations, Ada-002’s average score was ~53 vs BGE’s ~63 (⇑). On English tasks, independent evaluations (like MTEB) show Ada-002 is competitive but still slightly below models like BGE-large and E5-large (⇑) (⇑). The advantage of Ada-002 is that it’s a managed service – you send text, get embedding – and it might handle very long text (up to 8192 tokens). But it’s closed source, and cost can be significant if embedding millions of texts. BGE’s clear strength is open-source and cost-free deployment, allowing organizations to avoid API usage. Additionally, BGE can be fine-tuned or modified by anyone, whereas Ada-002 cannot. In terms of weaknesses, BGE requires running a model which might be heavy for some (Ada’s inference is handled by OpenAI’s optimized systems). However, with optimized runtimes, BGE can run efficiently as discussed. One more point: Ada-002 might have been trained on proprietary data and possibly with multi-task objectives (OpenAI hasn’t detailed it fully), and some have speculated it includes knowledge from GPT-3.5. BGE is purely learned from publicly available data (C-MTP). For trust and reproducibility, BGE wins. A trade-off: If you need quick multilingual embedding and don’t mind cost, Ada-002 covers 100+ languages reasonably well. BGE would need separate models or a multilingual variant to cover as many languages. All considered, BGE is often cited as the open alternative to OpenAI’s embeddings – with experiments confirming that using BGE yields similar or better quality in retrieval tasks (⇒), making it a compelling choice for those who want independence from proprietary services.
GTR (Generalizable T5 Retriever, Google, 2021): GTR was part of a trend using sequence-to-sequence LMs (T5) to produce embeddings. For example, GTR-XXL was a T5-XXL (4B parameters) model fine-tuned on retrieval data. GTR’s advantage was leveraging the power of larger models and possibly better zero-shot generalization. However, GTR models are huge and not easy to deploy (the XXL variant especially). BGE-large (340M) managed to outperform even GTR-XL (1.2B) on many benchmarks (⇑) (⇑), which is impressive. GTR’s performance on MTEB was strong but since its release, models like BGE and E5 have caught up and even surpassed it by more efficient training. One strength of GTR is it came in sizes from base to XXL; the base (110M) was okay but nothing special, whereas the larger ones were at the time SOTA in some QA retrieval. But training and using a 4B model is challenging for most – which limited GTR’s practical adoption outside of Google. BGE showed that with the right data and training, you can get near state-of-the-art with a base/large size model, making it accessible. GTR might still hold an edge in some niche scenarios, for instance cross-lingual tasks if it was trained on multilingual data (there was a GTR-multilingual variant as well). But BGE’s approach was more replicable. Another difference: GTR uses an encoder-decoder (T5) but only the encoder output as embedding, whereas BGE is encoder-only. The encoder-only (BERT) approach is simpler and possibly more efficient. In practice, BGE has largely overtaken GTR in open evaluations, showing better average performance (⇑). So, the main consideration now is if one were willing to deploy very large models, one might consider using an LLM (like text-davinci or a distilled variant) for embeddings, but GTR specifically is not commonly used now that smaller models like BGE reach similar quality.
Strengths and Weaknesses Recap:
- BGE’s strengths:
- Excellent all-around performance on diverse tasks (thanks to multi-task training), especially strong in retrieval and STS (⇑).
- Open-source and comes with data/benchmark, fostering reproducibility and further innovation.
- Supports Chinese (and English) very well, filling a gap where many models focused only on English.
- Efficient architecture (BERT-based) that can be quantized and accelerated easily.
-
Instruction-tuning makes it flexible for different use cases (symmetric vs asymmetric).
-
BGE’s weaknesses:
- Not inherently multilingual beyond Chinese-English; requires separate models or multilingual training (the multilingual version exists but is newer).
- Context length limited to 512 in original versions, which is shorter than some new LLM-based models that can embed longer texts (though BGE-m3 addresses this partially with 8192 via multi-vector).
- Does not capture factual knowledge as an LLM would; e.g., for tasks like question answering via embeddings, an LLM embedding might encode some common sense that BGE (which is smaller) might not. But this is minor since retrieval uses external knowledge anyway.
- Performance on tasks like summarization or complex semantic reasoning (the review noted summarization tasks didn’t improve much with new embeddings) (⇒). If one’s use case involves embedding very long texts or comparing summaries, a model like OpenAI’s or an LLM might do better.
- Compared to OpenAI’s proprietary model, BGE doesn’t cover as many languages out-of-box (Ada covers ~100 languages to some degree; BGE needed parallel data to train multilingual).
- Size vs quality: While BGE-large is great, some tasks can still benefit from extremely large models (e.g., embeddings generated by GPT-4’s text encodings might correlate even better on certain semantic subtleties, though these aren’t easily available).
In a nutshell, BGE holds its own or leads against Contriever, E5, and GTR in most scenarios, and provides an open alternative to OpenAI’s embeddings with competitive quality. It strikes a sweet spot in the trade-off space: not too large, but very effective due to data and training strategy. It has essentially become a reference model in this space, meaning any new embedding model is compared against BGE (and E5) to judge improvement. Each alternative model has some unique aspect (Contriever simplicity, E5 multilingual, OpenAI convenience, GTR brute force size), but BGE offers one of the best overall packages of high accuracy, versatility, and accessibility.
Sources:
- Comparison context from BGE related work (⇑) (⇑);
- BGE vs Ada performance (⇑);
- E5 data vs C-MTP (Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark);
- Summarization limitation (Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark).
11: Future Directions for BGE and Embeddings Research
The field of text embeddings continues to evolve rapidly. Building on BGE’s foundation, several future directions can be anticipated, both for the BGE model family specifically and for embedding research in general.
11.1: Multi-lingual and Cross-lingual Embeddings
One clear direction is expanding multilingual capabilities. BGE demonstrated excellence in English and Chinese, but the ultimate goal is a single embedding model that works for all languages. Ongoing developments are likely focusing on training truly multilingual embeddings that inherit BGE’s strengths. This might involve scaling the training data to cover dozens of languages, and mixing in multilingual tasks (e.g., bitext retrieval, cross-lingual STS). The BGE team’s release of a multilingual model (bge-m3) is a step in this direction, but there is room to grow. Researchers might combine BGE’s C-MTP with datasets like LAION-5B (for cross-modal and cross-lingual) or MTEB’s multilingual sets to train an encoder that can embed text from any language into one space. An example of this trajectory is the model LaBSE (2020) which handled 100 languages; future BGE versions could aim for “BGE-Universal” that matches LaBSE’s coverage but with far better performance due to modern techniques. Achieving true multilinguality may require clever training to avoid diluting performance on high-resource languages – methods like language-specific adapters or meta-learning could be explored. Additionally, cross-lingual alignment (ensuring “苹果” and “apple” end up close in vector space for instance) will be important; this might use parallel data and translate-based training objectives. As global applications (search, content understanding) demand multi-language support, embedding models must rise to the challenge, and BGE provides a template that can be extended.
11.2: Scaling Model Size and Training Data
Another direction is scaling up – in terms of model size, data size, and compute. BGE showed that a 340M model with 100M pairs can reach SOTA, but what about using billions of training pairs or much larger models? One possibility is a “BGE-XL” trained on vastly more data (for instance, incorporating the latest CommonCrawl or whole Wikipedia in multiple languages as unlabeled pairs, and aggregating more labeled data from sources like SNLI, PAWS, etc.). The scaling law likely suggests further improvements: more data diversity could cover even more semantic phenomena. There’s also interest in scaling model parameters: training a 1B+ parameter bi-encoder. However, the challenge is that bi-encoders don’t benefit as straightforwardly from scale as generative models do, unless the data is equally scaled. Still, Google’s GTR experiment with 4B model saw gains, so a 2B parameter BGE might push benchmarks a bit higher, especially on nuanced tasks or under zero-shot settings. The BGE authors themselves noted the importance of scaling model size for generality (⇑). They referenced studies concluding that larger text encoders are more generalizable (⇑), so we can expect them or others to test bigger architectures. Perhaps ensembles of encoders (multiple heads focusing on different aspects of text) could also mimic having a larger capacity without a single huge model. Of course, scaling brings computational cost issues, so research will also look at efficiency techniques (mixture-of-experts, distillation, etc.) to maintain speed. On the data front, an interesting future angle is dynamic data expansion: using LLMs to generate new training pairs (much like InstructGPT did for instructions). There’s a hint of that in E5 and others, but BGE’s pipeline could incorporate an LLM to synthesize difficult query-document pairs to further fine-tune the model (for example, generate tricky paraphrases or hard negatives beyond what was mined). This blends into retrieval augmentation itself: using an existing model to retrieve hard negatives from a huge corpus (not just in-batch) and iteratively improving. The negative mining strategy could be advanced – e.g., GISTEmbed’s method of guided negatives (⇒) may become mainstream in training large-scale embeddings.
11.3: New Benchmarks and Comprehensive Evaluation
Future research will likely introduce more comprehensive benchmarks to address current limitations. As the review paper noted, current benchmarks lack diversity in domains like finance, health, etc., and in input lengths (⇒). We might see an “MTEB 2.0” or similar that includes long-document understanding tasks, multi-turn retrieval (like dialogue-based retrieval), or summarization-oriented evaluations. C-MTEB might also grow, adding more Chinese datasets (the authors mentioned it’s growing (⇑)). For BGE and similar models, performing well on a broader benchmark will be the next test. Embeddings for long texts (like full articles or even book chapters) is a future direction – currently models like BGE can only handle relatively short inputs unless chopped. Approaches like hierarchical embeddings (embedding paragraphs then combining) or extending context length (as done in BGE-m3 with multi-vector output) will be explored. Retrieval-Augmented Generation pipelines might define new metrics: e.g., how an embedding model contributes to QA accuracy of an end-to-end system. Future benchmarks could measure that (which is an extrinsic measure beyond intrinsic similarity metrics). BGE might need to adapt (or spawn variants) to optimize not just cosine similarity on labels, but end task utility (like an embedding that leads a QA system to find answers with highest exact match score).
11.4: Retrieval-Augmented Generation (RAG) and LLM Integration
Embeddings will play a crucial role in RAG and beyond. A future direction is tighter integration of embedding models with LLMs. One concept is training embedding models that are aware of the LLM that will consume their output. For instance, fine-tuning BGE specifically to retrieve passages that maximize an LLM’s performance (not just raw similarity). This could involve differentiable search or using LLM feedback to refine embedding space (a kind of reinforcement learning with an LLM reward). BGE’s authors hint at the role of embeddings in augmenting LLMs (⇑), and future research will likely formalize this. Perhaps an “LLM-augmented BGE” where during training, an LLM tries to answer using a retrieved passage and a loss ensures that the chosen passage embedding is such that the LLM’s answer is correct. Another RAG aspect is on-the-fly personalization: future embeddings might adjust based on user context or preferences, possibly via lightweight fine-tuning or prompts. BGE could be extended with a mechanism to slightly shift embeddings given some conditioning (like user profile or conversation history). Also, as LLMs become more multi-modal, embeddings might too – future “BGE” could embed not just text but also images or structured data into a common space so that an LLM can retrieve any modality. In fact, the concept of universal embedding might expand from “all tasks” to “all data modalities,” where an LLM could query an embedding store that has text, images, code, etc., all indexed by one model or a combination.
11.5: Sustainability and Efficiency
A future focus area is making embedding training and inference more sustainable and cost-effective (⇒). This might involve research into training embeddings with far less labeled data by leveraging self-supervision (similar to what SimCSE did, but better), or by multi-task pre-training that covers most phenomena so that heavy fine-tuning isn’t needed. Techniques like knowledge distillation (from a big teacher model to a small student) will be important to get the benefits of huge models without their cost. There’s also interest in hardware-friendly models – for example, can we train an embedding model that runs efficiently on edge devices (smartphones) to enable on-device semantic search? BGE-small is a step in that direction, but perhaps architectures other than Transformers (like efficient transformers or even dynamic sparse networks) could be explored for embeddings specifically. Approximate similarity techniques (like product quantization) might start being integrated at the model level – a model could output a compressed vector directly to save space.
11.6: Novel Objective Functions and Similarity Measures
Another future direction is rethinking how we train and measure embeddings. The current paradigm largely uses cosine similarity and tasks cast as classification or regression in that space. The review paper suggests exploring new (dis)similarity measures to better mimic human judgment asymmetries (⇒) (⇒). For instance, human perception of sentence similarity isn’t always symmetric or linear; a future objective might train embeddings such that certain asymmetries are preserved (A may contain B’s info but not vice-versa, etc.). This could give rise to embeddings that when used with a particular similarity metric (maybe a learned metric, not just dot product) yield more nuanced results. Also, combining sparse and dense representations more fundamentally (beyond just concatenation) is a research direction – how to get the best of TF-IDF and BGE-like semantic info in one model. Approaches like SPLADE (which outputs sparse signals from a Transformer) are one path, and we might see BGE’s training recipe adapted to produce both a dense and a sparse embedding from the same model (two heads). This would align with the idea of multi-faceted retrieval and could be a future “BGE-v2” feature.
11.7: Beyond Text: Multi-Modal and Knowledge Graph Embeddings
Though text is the focus, embedding research is converging with multi-modal representation. Future general embedding models might incorporate text with images, audio, or structured knowledge. BGE’s framework (unlabeled pre-train, contrastive, fine-tune) could be applied to other modalities or cross-modal pairs (image-caption, etc.). In fact, the notion of a “universal embedding” might extend to hooking into knowledge graphs – embedding not just raw text but entities and their relations, so that the vector space is enriched with explicit knowledge. This could address limitations of text-only models which might conflate different concepts with the same wording. Some researchers are looking into joint embedding spaces where text embeddings and knowledge graph embeddings coexist.
In summary, the future of BGE and embeddings involves: wider language coverage, bigger and smarter models, integrating with LLMs and multi-modal data, more robust benchmarks, and efficiency improvements. BGE has set a high baseline, and the community is likely to build on its open resources to explore these avenues. The next few years could bring “BGE 2.0” or entirely new models that use BGE’s lessons to achieve even more “universal” embeddings, possibly playing an even more central role in AI systems as the connector between unstructured data and intelligent reasoning.
Sources:
12: Conclusion and Key Takeaways
Summary of BGE’s Impact: Since its release in late 2023, BGE (BAAI General Embeddings) has had a significant impact on the field of text embeddings. It delivered a new state-of-the-art in both English and Chinese embedding tasks, thanks to the comprehensive C-Pack approach that combined large-scale data curation (C-MTP), a broad benchmark (C-MTEB), and a powerful three-stage training method (⇑) (⇑). BGE demonstrated that with the right training recipe – including unsupervised pre-training, massive contrastive learning, and instruction-driven fine-tuning – even relatively moderate-sized models can achieve universal applicability across tasks. This was a notable contribution at a time when many might assume only gigantic LLMs could excel universally. BGE’s success validated the importance of data quality and multi-task learning for embeddings, and it provided the community with both a high-performance model and the resources (data + benchmark) to push further. As a result, BGE quickly became a reference point; new embedding models and techniques are now often compared against BGE to gauge progress (⇑).
Key Contributions of the Original Paper and Adoption: The original BGE paper’s contributions were not just the models, but the entire package – releasing C-MTEB filled a gap for evaluating Chinese embeddings reliably, and C-MTP set a precedent for open-sourcing large curated training datasets (⇑) (⇒). The three-stage training recipe (incorporating RetroMAE and instruction tuning) has influenced other researchers (e.g., the GISTEmbed method building on BGE’s fine-tuning) and could serve as a template for future multilingual or domain-specific embedding training. BGE’s models (large, base, small in two languages) have been widely adopted, indicating that the paper achieved its goal of delivering general-purpose embeddings that are actually used in practice. Integration into tools like Hugging Face, LangChain, and vector databases happened rapidly (⇒) (⇒), showing how quickly the community embraced BGE. This level of adoption is a key takeaway: BGE bridged the gap from academic idea to real-world utility perhaps faster than any previous embedding model, largely due to the team’s focus on releasing everything openly and providing easy-to-use checkpoints. The fact that enterprises (like AWS in their blog) and many open-source projects are using or fine-tuning BGE is testament to its practical value (⇒) (⇒).
BGE vs. Competitors – a New Bar for Universal Embeddings: BGE’s emergence raised the bar relative to prior models like Contriever, Sentence-BERT, and even more recent ones like E5. It proved that instruction tuning plus scale yields truly robust embeddings that work for search, clustering, sentence similarity, and more out-of-the-box. It also highlighted the importance of making models accessible. OpenAI’s Ada-002, while powerful, is behind an API; BGE showed an open model can match or beat it (⇑), empowering users who need on-prem solutions. In the Chinese NLP space, BGE filled an important role by providing a high-quality model where previously either less effective bilingual models or translation-based approaches were used. As a result, we see a flurry of activity in Chinese semantic search and applications leveraging BGE – effectively, BGE helped catalyze progress in that local NLP ecosystem by providing a strong baseline and benchmark to improve upon.
Final Thoughts on the Future of General-Purpose Embeddings: Looking forward, BGE’s legacy will likely persist as the foundation for future improvements. The field is moving toward ever more general embeddings – models that are not only multilingual but multi-modal and deeply integrated with LLM reasoning. We foresee new models that take inspiration from BGE’s training strategy but apply it across languages and modalities, possibly using the power of large language models to further boost embedding quality. The concept of embedding models as knowledge connectors in AI systems (e.g., retrieval modules for question answering) will grow, and techniques to train them (like BGE’s recipe) will be crucial. We also expect benchmarks to evolve, as mentioned, which BGE or its successors will tackle. BGE has shown that focusing on data diversity and training process can yield big gains – a lesson future researchers will heed when designing the next generation of embedding models.
In conclusion, BGE’s release marked a significant milestone: it delivered top-tier performance in an open model, accelerating both research (through its novel data+benchmark contributions) and application (through quick adoption in tools and industry). Its impact since September 2023 is evident in improved search systems, new research building on its methods, and a shift in the community towards more holistic embedding solutions. The key takeaways are that data scale & quality, combined with multi-task learning, are powerful for learning universal embeddings, and that making such models widely available can rapidly advance the state of practice. BGE’s success story paves the way for ongoing innovation in the quest to build truly general, efficient, and powerful text embedding models for all.
Sources:
- Adoption evidence (README.md · BAAI/bge-large-en-v1.5 at main) (BAAI general embedding English (bge_small) | bge_small | Spark NLP 5.0.2);
- BGE performance vs others (⇑) (⇑);
- future directions from review (Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark) (Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark).