Retrieve Anything To Augment Large Language Models

Peitian Zhang
Xiao ♣ Shitao
Zheng Liu
Zhicheng Dou
Jian-Yun Nie

Abstract

Large language models (LLMs) face significant challenges stemming from the inherent limitations in knowledge, memory, alignment, and action. These challenges cannot be addressed by LLMs alone, but should rely on assistance from the external world, such as knowledge base, memory store, demonstration examples, and tools. Retrieval augmentation stands as a vital mechanism for bridging the gap between LLMs and the external assistance. However, conventional methods encounter two pressing issues. On one hand, the general-purpose retrievers are not properly optimized for the retrieval augmentation of LLMs. On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios. In this work, we present a novel approach, the LLM Embedder, which comprehensively support the diverse needs of LLMs' retrieval augmentation with one unified embedding model. Training such an unified model is non-trivial, as various retrieval tasks aim to capture distinct semantic relationships, often subject to mutual interference. To address this challenge, we systematically optimize our training methodology. This includes reward formulation based on LLMs' feedback, the stabilization of knowledge distillation, multi-task finetuning with explicit instructions, and the use of homogeneous inbatch negative sampling. These optimization strategies contribute to the outstanding empirical performance of the LLM-Embedder. Notably, it yields remarkable enhancements in retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios. This project is made publicly available at https://github.com/FlagOpen/FlagEmbedding.

1. Introduction

Figure 1: To confront the threefold inherent boundaries of LLMs on top of retrieval augmentation.

Large language models represent a significant milestone in the development of general artificial intelligence [10, 12, 53] . While these Figure 1 : To confront the threefold inherent boundaries of LLMs on top of retrieval augmentation. models have demonstrated unprecedented performance across various general tasks, they still face a series of challenges, including issues such as hallucination [7, 21] , instruction following [5, 39] , and handling long contexts [1, 6] . Many of these challenges can be traced back to the inherent limitations of LLMs, with three critical boundaries deserving attention.

• Knowledge boundary. LLMs are constrained by their knowledge capacity. Due to finite model parameters, they cannot fully internalize the vast body of world knowledge. Moreover, the internal knowledge of LLMs is static and difficult to be updated with the dynamically evolving world. Furthermore, LLMs are predominantly trained on publicly available, high-frequency data, which may result in inaccuracies when dealing with domain-specific or long-tail knowledge.

• Memory boundary. LLMs also grapple with severe limitations in memory, primarily due to restrictions on context length. While advances have been continually made in expanding the maximum context length, it still falls short of achieving the goal of lifelong engagement with human users. Additionally, both the training and deployment of LLMs with extended context can be prohibitively computationally and storage-intensive, making it impractical to significantly expand their memory.

• Capability boundary. LLMs' capabilities are constrained in terms of action and autonomy. Firstly, they are limited to the 'language space' and cannot meaningfully interact with the physical world. Secondly, these models heavily rely on human guidance, requiring clear user instructions and appropriate demonstration examples to perform specific tasks effectively.

The above inherent boundaries cannot be effectively addressed by by LLMs alone. To overcome these limitations, external assistance is sought through the process known as retrieval-augmented generation [9, 15, 20, 27] . Retrievers play a crucial role in connecting LLMs with the necessary external components, enabling LLMs to accomplish various downstream tasks (see Figure 1 ). In this context, several common types of retrievers have been designed, each tailored to fulfill a distinct role in enhancing LLMs:

• Knowledge Retriever: This type of retriever provides external knowledge to support LLMs in tackling knowledge-intensive tasks [24, 27, 40] .

• Memory Retriever: Memory retrievers collect information that extends beyond the immediate context, assisting in the generation of lengthy sequences [8, 49, 59] .

• Tool Retriever: Tool retrievers are responsible for selecting appropriate tools, allowing LLMs to interact effectively with the physical world [42, 43, 50] .

• Example Retriever: This retriever locates pre-cached demonstration examples, from which LLM prompts can be automatically generated to facilitate in-context learning [32, 58] .

Given the critical importance to connect LLMs with the external world, it is imperative to optimize their performance across various tasks. Presently, the effectiveness of retrieval systems heavily relies on the quality of embeddings [18, 24, 47, 65] . Consequently, the optimization challenge centers around the learning of the embedding model. Historically, two common approaches have been pursued. The first approach focuses on developing task-specific models, where the embeddings are tailored for each specific application scenario, such as question answering [68] or in-context learning [58] . While this approach leads to a competitive performance within each scenario, it lacks generality and versatility across different contexts. In contrast, the second approach resorts to general-purpose embedding models [25, 40, 41] . The general embeddings aim to be universally applicable to various tasks [18, 57, 62] . However, the current methods are not properly optimized for the specific requirements of retrieval augmentation in LLMs. This limitation significantly hampers their empirical performance in corresponding tasks.

In this work, we propose LLM-Embedder, a unified embedding model designed to address the primary retrieval augmentation needs of LLMs. Developing such a unified model presents significant challenges. Firstly, the embedding model must optimize its final retrieval augmentation impact for LLMs rather than focusing solely on intermediate retrieval results. Secondly, the diverse retrieval tasks aim to capture distinct semantic relationships, whose impacts can be subject to mutual interference. To address both challenges, we optimize our training methodology in the following ways.

• Reward from LLM. To train the LLM-Embedder, we utilize a combination of labels from various sources. In addition to the native hard labels from the original datasets, we leverage rewards obtained from the LLM's output. A retrieval candidate is assigned a higher reward if it substantially improves the LLM's final performance. These rewards are considered soft labels and are learned via knowledge distillation by the embedding model.

• Stabilized distillation. Given the diversity of training data, the LLM's output can exhibit significant fluctuations. In some cases, the output scores may be distributed too closely or polarized, making it challenging to assess the fine-grained quality of candidates. To mitigate this issue, we introduce stabilized distillation. It jointly incorporates soft reward-based labels and hard ranking-based labels, where the distillation effect is significantly improved.

• Instruction based fine-tuning. We curate a diverse training dataset comprising a wide variety of tasks closely related to the retrieval augmentation for LLMs. To harmonize the training impact across different data sources, we take advantage of instruction based fine-tuning, where task-specific prompts are used to differentiate each individual task [4, 51] .

• Homogeneous in-batch negative sampling. In-batch negative sampling is a common practice to introduce a large number of negative samples [24, 44] . However, one potential drawback in our context is that negative samples shared across different tasks (i.e. heterogeneous negatives) may be less effective in discriminating semantic relationships for a specific context. To mitigate this issue, we construct each mini-batch using training data from the same tasks, ensuring that the in-batch negatives are homogeneous and contribute effectively to the discriminative power of the embedding.

To summarize, our work makes significant contributions in the following ways.

• LLM-Embedder: We introduce LLM-Embedder, a novel embedding model designed to bridge LLMs with the external world. To the best of our knowledge, LLM-Embedder is the first of its kind, offering comprehensive support for all key facets of LLMs' retrieval augmentation. • Systematic Optimization: We systematically optimize LLM-Embedder across multiple dimensions, including reward formulation, knowledge distillation, instruction based finetuning, and negative sampling, which ensures the effectiveness of the proposed model. • Empirical Validation: We verify the effectiveness of LLM-Embedder with comprehensive experiments. Our model outperforms the existing embedding models, significantly amplifying the impact of retrieval augmentation on various critical aspects of LLMs, such as knowledge enhancement, long-context modeling, and instruction following.

2. Llm-Embedder

The introduction of LLM-Embedder is partitioned into the following three parts: 1) the curation of training data, 2) the training methodology, 3) the retrieval augmentation of LLMs.

2.1 Training Data

LLM-Embedder is to serve as a unified model for the retrieval augmentation of LLMs. To fulfill this objective, we assemble a diverse training dataset from the following tasks. 1) Question Answering. We utilize MSMARCO [38] and Natural Questions [26] to establish the model's knowledge retrieval capability. 2) Conversational Search. The QReCC dataset [2] is employed to further improve the model's information seeking capability in the conversational context. 3) Tool Learning. The ToolLLM dataset [43] is used to learn the selection of appropriate tools in the tool-using context. 4) Instruction Tuning: To retrieve useful demonstration examples for in-context learning, we re-purpose FLAN [60] and UPRISE [11] , which are originally designed for instruction tuning. 5) Generation. The model is trained to extract valuable historical information (i.e. memory) based on a long conversation dataset: Multi-Session

Chat [66] , as well as long-range language modeling datasets: including Books3 [14] , ArXiv [14] , CodeParrot [54] . These datasets can be grouped into two types based on the availability of labels.

• Labeled data. The datasets on the first three types of tasks are composed of pairwise texts, where hard-coded labels are presented. For question answering datasets (MSMARCO, NQ), each data instance consists of a query and the source passage of answer, denoted as . For conversational search dataset (QReCC), each data instance is made up of a conversational query and the source passage of answer, denoted as . For tool learning dataset (ToolLLM), each data instance includes an instruction and the description of the needed tool, denoted as .

• Non-labeled data. In contrast, the last two types of datasets do not have explicit labels. For instruction tuning datasets (FLAN, UPRISE), each instance consists of human's instruction and the expected output: . For generation datasets, each instance is a long text sequence partitioned into chunks: [chunk_0, ..., chunk_L]. Books3, ArXiv, and CodeParrot are made up of plain texts, which are chunked into spans of equal length (128 tokens per chunk). Multi-Session Chat is composed of conversations, where each chunk corresponds to a pair of consecutive utterances.

2.2 Training Methodology

2.2.1 Formulation of Training Reward. In our work, we explore two types of supervision signals for training the LLM-Embedder. Firstly, we can directly utilize the hard labels provided by the labeled datasets. Secondly, we aim to optimize the LLM's final performance with retrieval augmentation. To achieve this goal, we leverage the reward produced by LLM for both labeled and unlabeled datasets. Particularly, given the expected output of the LLM, denoted as , and a retrieval candidate, denoted as , the reward for the candidate, represented as | , is derived by the following equation:

EQUATION (1): Not extracted; please refer to original document.

Here, represents the -th token of the expected output, and LLM( | ) stands for the LLM's generation likelihood of producing given the context . In other words, a higher reward is assigned to a retrieval candidate if it results in a higher generation likelihood for the expected output.

The LLM based reward is applied in the following ways for each of the tasks in consideration. 1) For Question Answering: the reward is computed as the generation likelihood of answers given one single candidate passage. 2) For Instruction Tuning: The reward is computed as the generation likelihood of the instructed output given one candidate example. 3) For Generation: the reward is computed as the generation likelihood of a new content given one candidate historical chunk. Note that the LLM reward is not applied to conversational search and tool learning datasets, as there is no clear expectation of the LLM's output in these cases.

Given the two sources of supervision signals of LLM-Embedder, i.e. the native hard labels and the soft reward derived from LLM, the training is conducted with a composite recipe. The contrastive learning is applied to capture the semantic relationship reflected by the hard labels; meanwhile, the knowledge distillation is used to learn from the soft rewards derived from LLM.

2.2.2 Contrastive

Learning. For each pair of hard-labeled texts: and (e.g., query and passage), the loss function of contrastive learning is formulated in the following way:

EQUATION (2): Not extracted; please refer to original document.

where * stands for the embedding, ⟨•⟩ indicates the inner product operator, P are the union of positive and negative samples, refers to the temperature. To improve the discriminative power of embeddings across diverse application scenarios, we employ a couple of key designs in our contrastive learning framework.

The first featured design is the Instruction-based Fine-Tuning. In this approach, each task is assigned with a unique task instruction denoted as . While generating the query-side embedding, the task instruction and query content are concatenated and jointly encoded, resulting in the update of query embedding: ← encode( [ , ]). This use of task-specific instructions plays a pivotal role in initializing the embedding model with distinct activations, thereby facilitating the discrimination between different tasks.

The second notable feature is the Homogeneous In-Batch Negative Sampling. It calls for a considerable amount of negative samples to guarantee the embedding's discriminativeness [18, 44, 57] . In our work, this is realized by the joint usage of in-batch negatives and hard negatives. We also apply cross-device sharing [44, 64] , which further expands the scale of negative samples. Consequently, our method results in × × − 1 negative samples in total, where is the batch size, is the number of GPU devices, is the total number of positive and hard negative samples. However, the vanilla practice of in-batch negative sampling presents one drawback in our multi-task settings. Particularly, the embeddings shared between different datasets (namely heterogenous negative samples) are mostly irrelevant, which are less effective for discriminating the semantic relationships within a specific task scenario. To address this limitation, we introduce a regularization strategy for the organization of training data, where the data instances from the same task are grouped into consecutive mini-batches. The strategy makes the majority of in-batch negative samples to originate from the same dataset (i.e. homogeneous negative samples), thus enhancing the discriminative power of embeddings for each specific task.

2.2.3 Knowledge Distillation.

In our training framework, knowledge distillation plays a crucial role in learning from the LLM's reward. we employ the KL-divergence to minimize the gap between the distributions of candidates computed using LLM's rewards and those predicted by the embedding model. In particular, for each query and its candidate list P:

[ 1 , ...,

], we derive the LLM's rewards towards the candidates, denoted as : [ 1 , ..., ], using Eq 1. To make the LLM's rewards suitable for distillation, we transform each reward into a normalized weight:

← softmax ( / ), where represents the temperature. On top of these elements, the KL divergence is computed by the following equation:

EQUATION (3): Not extracted; please refer to original document.

While the above formulation has been successfully employed in mono-task settings [17, 31, 56] , applying it directly to our multi-task scenario poses unique challenges. Notably, the magnitude of LLM's rewards can exhibit high fluctuations due to the diverse training samples from various tasks. In many cases, the LLM's rewards closely distribute, making it challenging to distinguish the quality of candidates. In contrast, in many other cases, the rewards become polarized, with candidates receiving either a positive reward or nearly zero rewards. Both of these scenarios contribute little to the distillation process and can severely impair the training effect.

• Stabilized Distillation. To address the challenge of fluctuated rewards in our multi-task scenario, we introduce a modified formulation of the loss function. This adaptation effectively alleviates the negative impact resulted from the rewards' fluctuations. Particularly, instead of using LLM rewards solely as "soft weights", we also leverage them as hard ranking labels. .

Here, P comprises two sources: the lower-ranked candidates of : [ +1 , ..., ]; and the the in-batch negative samples. Our adapted formulation serves to stabilize fluctuated rewards in two fundamental ways. On one hand, the model is consistently trained to promote compared to its lower-ranked counterparts [ +1 , ..., ]. This means that the model is always able to learn from the LLMs' preferences, regardless of the absolute value of rewards. This mechanism is particularly effective in handling cases where LLMs' rewards are too closely distributed. On the other hand, when the top-ranked candidate receives a significantly higher reward compared to the other candidates, the weights will become one-hot. In this scenario, the distillation process will be reduced to the form of contrastive learning, with the top-ranked candidate treated as a positive sample. This mechanism help to address the situations where polarized rewards are generated by LLMs.

2.3 Retrieval Augmentation Of Llms

Figure 2: Retrieval augmentation with LLM-Embedder.

The multi-tasking capacity of LLM-Embedder makes it as a versatile solution. By connecting to the vector DB where any needed external elements are stored, it may support a wide variety of retrieval augmentation tasks. In this place, we discuss the typical scenarios empowered by LLM-Embedder ( Figure 2 ), with focusing on three key issues: 1) what to store in the vector DB, 2) what is used to query the vector DB, 3) how to leverage the retrieved data.

• Knowledge Enhancement. When handling knowledge intensive tasks [24, 40] , the entire docs from the knowledge corpus can be encoded and stored in vector DB. In many cases, questions are explicitly presented, which can be used to query the vector DB. In other cases, the working context during the generation process can be used as query [15, 22] . The retrieved docs can be directly applied or refined for more informative segments [29] . Finally, the query and retrieved docs are concatenated to generate knowledgegrounded answer, e.g., [knowledge, query] → answer.

• Long-Context Modeling. When dealing with a long context, the entire history can be chunked, encoded, and off-loaded to the vector database. The working context during the generation process can be used to query the vector DB for relevant chunks. In many cases, both the relevant chunk, e.g., chunk_ , and its subsequent chunk_ +1 are used for memory augmentation [9] , because the subsequent chunk can be more critical to the future generation. The retrieved chunks are used to back-fill the current context, where new content can be generated with remote but important memory, e.g., [retrieved chunks, current context] → new generation.

• In-context Learning. The demonstration examples, organized in the form of "(task instruction, expected output)", can be encoded and pre-stocked in vector DB. When a new task is given, the task's instruction is used to query the vector DB [11, 58] . The retrieved examples are concatenated with the task's instruction, based on which the in-context learning can be conducted, e.g., [retrieved examples, instruction] → response.

• Tool Learning. The tool's functionality can be verbalized as a description, and paired with its API: "(description, API)". In this way, a massive toolkit can be managed by vector DB based on the encoded description [43] . Given a user request that involves the use of tools, the user request can be encoded and used to query the vector DB. The retrieved tool is executed via its API, where the execution result is returned for LLM to complete the remaining generation: [user request, tool's execution result] → generation.

3. Experiment

The experimental study is to clarify three basic research questions.

3.1 Settings

The baseline, datasets, evaluation method, and implementation of the experiment are introduced as follows. Given the limited space, more detailed specifications are presented in the Appendix.

3.1.1 Baselines. Firstly, we measure the performance of Language Model Models (LLMs) without retrieval augmentation, denoted as None, to gauge the empirical benefits introduced by retrieval augmentation. Secondly, we make comparison with a series of baseline retrievers, which are categorized into two types. 1) General embedding models. These models are trained to support a wide range of text retrieval and representation tasks, such as question answering, entity retrieval, duplication detection, and document ranking. Our experiment includes the following widely-recognized baselines: Contriever [18] , Instructor [51] , RetroMAE-BEIR [31] , and BGE [62] . These methods are empirically competitive according to BEIR [52] and MTEB [36] benchmarks, among which BGE maintains the leading performance upon the time of this work. 2) Task-specific embedding models. These models are tailored to optimize performance on specific tasks. We include the following task-specific baselines, which excel in their respective domains: ARR [68] for knowledge enhancement of LLMs, LLM-R [58] for in-context learning, API-Retriever [43] for tool learning, and Conv-ANCE [34] for conversational search. Additionally, we consider BM25 [48] , a widely used retriever based on lexical similarity.

3.1.2 Evaluation And Datasets

. . We present the tasks used to assess the retriever's performance, including knowledge enhancement, in-context learning, long-context modeling, tool learning, conversational information seeking. For each task, we introduce the relevant evaluation dataset and methodology.

• Knowledge Enhancement. We adopt the established setup used by AAR [68] . The experiment is performed on two popular benchmarks. 1) MMLU [16] , which comprises multiple-choice questions evaluated by accuracy. 2) PopQA [33] : which involves question answering tasks evaluated by exact match (EM). Following AAR, the knowledge is retrieved from MS MARCO [38] and Wikipedia Corpus [40] , respectively.

• In-Context Learning. We adopt the data and framework from LLM-R [58] . There are 30 public datasets from 9 distinct categories, including Close QA (CQA), Commonsense (Comm), Coreference (Coref), Paraphrase (Para), NLI, Reading Comprehension (RC), Sentiment Analysis (Sent), Data2Text (D2T), Summarization (Summ). To better assess the generalization ability, we withhold four datasets (QNLI, PIQA, WSC273, Yelp) from the training stage. We collect demonstration examples from the combination of FLAN [60] and UPRISE [11] , creating a retrieval pool of 6.3 million examples. For each presented task, we retrieve the top-8 examples to complete the task.

• Long-Context Modeling. We focus on two scenarios: long conversation and long-range language modeling. The first scenario leverages Multi-Session Chat [66] . We retrieve historical dialogue turns with the current utterance, append them ahead of the current utterance, based on which the next reponse is generated. Following existing literature about augmenting memory for LLMs [49, 59, 61] , we leverage Books3 [14] , ArXiv [14] , CodeParrot [54] , and PG19 [45] for the second scenario. We hold out PG19 entirely from training to assess the generalization ability. These datasets divide each historical sequence into chunks of 128 tokens. Historical chunks are retrieved based on the latest chunk, appended ahead of the current context, based on which the future chunk is generated. Performance in both scenarios is measured by Perplexity (PPL).

• Tool Learning. We follow the established data and framework from ToolLLM [43] , whose primary objective is to find the needed tool based on the instructions and the tool's descriptions. The dataset already provides ground-truth information about the needed tool, which allows us to directly measure the retriever's performance using its ranking performance, specifically NDCG.

• Conversational Search. We use the setup of QReCC [2] for evaluation, where the required knowledge is retrieved based on the concatenation of conversation's context and the last query. Like ToolLLM, this dataset also provides ground-truth, whereby letting the retriever's performance to be directly measured by its ranking performance (NDCG).

3.2 Analysis

The experiment results are analyzed from three perspectives: the overall analysis, the analysis for each individual scenario, and the ablation studies for influential factors.

3.2.1 Overall Analysis.

Table 1: Impact on knowledge enhancement. MMLU and PopQA are measured by precision and exact match (EM), respectively. “∗” and “†” indicates the SOTA general embedding model and the task-specific method for the corresponding scenario.
Method	STEM	Social	Human	Other	All Avg.	PopQA
			MMLU			PopQA
None	0.3468	0.5328	0.5094	0.4967	0.4599	0.2061
BM25	0.3760	0.5378	0.5051	0.5088	0.4721	0.3491
Instructor [51]	0.3702	0.5406	0.5111	0.5082	0.4721	0.3533
Contriever [18]	0.3677	0.5383	0.5080	0.5013	0.4684	0.3276
RetroMAE-BEIR [31]	0.3857	0.5456	0.5221	0.5276	0.4853	0.4364
BGE* [62]	0.3852	0.5564	0.5194	0.5389	0.4896	0.4491
AAR [68]	0.3802	0.5501	0.5125	0.5288	0.4826	0.4792
API-Retriever [43]	0.3535	0.5335	0.4999	0.5068	0.4625	0.2488
LLM-R [58]	0.3629	0.5277	0.5018	0.4984	0.4625	0.2506
LLM-Embedder	0.3848	0.5568	0.5255	0.5360	0.4903	0.5052

The experiment results on different retrieval augmentation scenarios are presented with Table 1 -3, respectively. We can come to the following conclusions given the observations across all the presented results.

Firstly, compared with the result from plain LLM, i.e. None, LLM-Embedder helps to deliver more precise answers with the retrieved knowledge (Table 1) , better instruction following effect with the retrieved examples (Table 2) , and improved quality of long-sequence generation with the retrieved memory (Table 3) . Besides, the LLM's performance can also by improved by other baseline retrievers in many of the situations. However, the relative improvements are not always as significant as the ones with LLM-Embedder. Such observations indicate that LLMs can benefit from properly retrieved assistance; and with a stronger retriever, the augmentation's impact can be substantially magnified.

Table 2: Impact on in-context learning. The performances are measured by Misc. metrics (see Appendix).
Method	CQA	Comm	Coref	Para	NLI	RC	Sent	D2T	Summ	Avg
				In-Context		Learning
None	0.2923	0.7212	0.6578	0.5242	0.4478	0.4892	0.7077	0.1982	0.1447	0.4645
BM25	0.3603	0.7019	0.6029	0.5059	0.4583	0.5396	0.7284	0.3019	0.1555	0.4840
Instructor	0.5003	0.7772	0.5735	0.6312	0.5360	0.6219	0.9148	0.4595	0.4572	0.6036
Contriever	0.4912	0.7723	0.5624	0.6358	0.5466	0.6297	0.9141	0.4380	0.4444	0.6009
RetroMAE-BEIR	0.4594	0.7742	0.5840	0.5755	0.5408	0.6029	0.9286	0.4661	0.4465	0.5939
BGE*	0.4718	0.7773	0.5550	0.6171	0.5413	0.5988	0.9281	0.4719	0.4521	0.5974
AAR	0.4809	0.7796	0.5848	0.5890	0.5354	0.6039	0.9210	0.4445	0.4410	0.5938
API-Retriever	0.4765	0.7620	0.5465	0.6266	0.5204	0.6096	0.9245	0.4866	0.4424	0.5945
LLM-R*	0.5165	0.7802	0.5830	0.6567	0.6145	0.6223	0.9059	0.4777	0.4878	0.6262
LLM-Embedder	0.5163	0.7842	0.5927	0.6556	0.6041	0.6318	0.9224	0.4731	0.4742	0.6268

Table 3: Impact on long conversation and language modeling (PPL), tool learning (NDCG), conv search (NDCG).
	Conversation		Language	Modeling		Tool	C-Search
Method	MSC	Books3	Arxiv	CodeParrot	PG19 (o.d.)	ToolLLM	QReCC
None	19.3501	8.8193	3.7647	2.7663	10.2510	-	-
Recency	13.9569	8.7391	3.4158	2.5989	10.2216	-	-
BM25	14.6512	8.6576	3.3106	2.4591	10.1960	0.5115	0.4341
Instructor	14.8799	8.6619	3.3546	2.4756	10.2011	0.3882	0.2863
Contriever	14.2129	8.6460	3.2709	2.4437	10.1616	0.4904	0.3563
RetroMAE-BEIR	14.3990	8.6376	3.2903	2.4592	10.1731	0.5205	0.4037
BGE*	14.2943	8.6311	3.2912	2.4578	10.1541	0.5761	0.3856
AAR	14.6999	8.6381	3.3260	2.4666	10.1808	0.4200	0.2877
API-Retriever	14.7834	8.6722	3.3858	2.4919	10.1833	0.8017	0.1137
Conv-ANCE*	-	-	-	-	-	-	0.4560
LLM-R	14.4746	8.6619	3.3635	2.4724	10.2024	0.1321	0.0234
LLM-Embedder	13.4832	8.6080	3.2322	2.4303	10.1185	0.8645	0.5053

Secondly, LLM-Embedder brings forth a competitive retrieval augmentation effect across the diverse scenarios. On one hand, it notably outperforms a series of general retrievers, including the state-of-the-art method BGE. On the other hand, it also goes beyond the task-specific method, i.e. AAR for knowledge enhancement, LLM-R for in-context learning, API-Retriever for tool learning, Conv-ANCE for conversational search. Such an observation indicates that LLM-Embedder is able to provide a strong and unified foundation to support different retrieval augmentation needs of LLMs.

Figure 3: Retrieval augmentation’s impact from different retrievers. The warmer color indicates a better performance.

Finally, we can also observe that the task-specific retriever optimized for one scenario could result in limited performances in other scenarios, indicating that the training impacts between different retrieval tasks are not always transferable. To better illustrate this point, we visualize the retrieval augmentation's impact (improvements over None) from five representative methods in Figure 3 : BGE, AAR, LLM-R, API-Retriever (API-R), and LLM-Embedder (ours). The first method is the general embedding model, while the second to fourth are task-specific methods. We can observe that although task-specific training can deliver a competitive performance for its corresponding scenario, e.g., AAR for knowledge enhancement and LLM-R for in-context learning, their impacts are severely weakened when applied for other usages. In contrast, LLM-Embedder demonstrates a steady and competitive performance across different scenarios. This indicates that the seemingly irrelevant or even adverse retrieval patterns can be unified by one embedding model on top of the properly optimized training recipe. Table 1 : Impact on knowledge enhancement. MMLU and PopQA are measured by precision and exact match (EM), respectively. " * " and " †" indicates the SOTA general embedding model and the task-specific method for the corresponding scenario. • Knowledge Enhancement. The experiment results on knowledge enhancement are shown in Table 1 , where we can make the following observations. 1) Benefit of external knowledge. LLMs benefit from external knowledge when answering questions in MMLU and PopQA, as clear empirical advantages are achieved by the retrieval augmentation methods compared with the plain LLM, i.e. None. 2) Importance of retrieval accuracy. The impact of knowledge enhancement becomes more pronounced when knowledge retrieval is more accurate. We observe consistent improvements as we transition from using the BM25 retriever to more advanced embedding models. 3) Distinction between datasets. The impact of retrieval augmentation is more noticeable in the PopQA dataset compared to MMLU. This difference is likely due to the nature of the datasets. PopQA tends to be more knowledge-intensive, with a focus on questions about long-tail entities. In contrast, many questions in MMLU rely more on common sense and logical reasoning rather than extensive world knowledge.

• In-Context Learning. The experiment results on in-context learning are shown in Table 2 , where we can draw the following observations. 1) Benefits of retrieved examples. When comparing plain LLM (None) with other retrieval-augmented methods, we can consistently observe the improved performances in most cases. This finding underscores the enhancement of LLM's ability to follow instructions when retrieved examples are presented. 2) Limitation of BM25. It's noteworthy that BM25's performance is comparatively weaker than its performance in other scenarios. This discrepancy can be attributed to the specific nature of in-context learning, where examples need to emphasize semantic similarity rather than lexical similarity. 3) Limited transferability. While the task-specific method, LLM-R, exhibits a competitive performance for in-context learning, its utility becomes severely limited when applied to other scenarios, such as knowledge retrieval and tool using. This suggests that example retrieval calls for a unique pattern tailored to this very task, making it challenging to transfer to other scenarios. • Long-Context Modeling. The experiment results on longcontext modeling are shown in Table 3 . While retrieval augmentation consistently demonstrates improvements compared to having no augmentation (None), it may not be entirely convincing due to the utilization of more context. To address this issue, we introduce a simple yet strong baseline called Recency. Rather than using retrieved context, Recency directly leverages the most recent context immediately preceding the current window. For example, in conversation, it considers the last pair of utterances before the current session; and in language modeling, it introduces the content within the range of 2049-4096 tokens preceding the latest 2048 tokens.

With the introduction of this new baseline, the impact of retrieval augmentation becomes more nuanced. On one hand, the LLM-Embedder continues to exhibit superior performance across various situations. On the other hand, other retrievers no longer guarantee a consistent enhancement: although alternative retrieval-augmented methods yield improved generation quality for language modeling, a majority of them fall short of Recency's performance while dealing with conversation. This observation underscores the challenges regarding effective memory retrieval in practice.

• Tool Learning and Conversation Search. The experiment results on tool learning and conversational search are shown in Table 3 . In line with our prior observations, the task-specific approaches, i.e. the API retriever (Tool) and Conv-ANCE (Conv Search), consistently deliver higher performances then most of the baselines. Besides, unlike other cases, BM25 overtakes most of the embedding models in these two scenarios. However, it's worth noting that LLM-Embedder continues to maintain the leading position, which again highlights its capability in unifying diverse retrieval tasks.

3.2.3 Ablation Studies.

Table 4: Ablation study for the three influential factors about LLM-Embedder’s training: using soft reward from LLM, stabilized distillation, instruction based fine-tuning, in-batch negative sampling from the same scenario.
Method	MMLU	PopQA	Misc.	MSC	ArXiv	ToolLLM	QReCC
	Knowledge		ICL	Long		Tool	Conv Search
W.O. LLM Reward	0.4872	0.4794	0.6217	13.9176	3.2495	0.8927	0.4945
W.O. Instruction FT	0.4776	0.5025	0.6211	13.9125	3.2383	0.8192	0.5029
W.O. homo NS	0.4791	0.4520	0.6200	14.0441	3.2558	0.8364	0.4563
W.O. Stablized Distill	0.4815	0.5027	0.6105	13.6090	3.2441	0.7905	0.4865
AAR	0.4826	0.4792	0.5938	14.6999	3.3260	0.4200	0.2877
LLM-R	0.4625	0.2506	0.6262	14.4746	3.3635	0.1321	0.0234
API-Retriever	0.4625	0.2488	0.5942	14.7834	3.3858	0.8017	0.1137
LLM-Embedder	0.4903	0.5052	0.6268	13.4832	3.2322	0.8645	0.5053

The ablation studies are presented to analyze the influential factors about LLM-Embedder's training process (see Table 4 ): reward from LLM, instruction based fine-tuning, homogeneous in-batch negative sampling, and stabilized distillation. For "w.o. LLM reward", we replace the soft reward from LLM by using highest rated candidates as positive samples (i.e. hard labels). By doing so, the knowledge distillation is reduced to contrast learning. The empirical performance in most of the scenarios are decreased due to such a change. However, the performances in tool learning and conversational search are little affect; this is comprehensible knowing that LLM-Embedder is purely trained with hard labels in both scenarios.

For "w.o. instruction FT", we remove the task-specific instructions while fine-tuning LLM-Embedder. Without such a component, it will become harder for the embedding model to discriminate the retrieval task in different scenarios. This speculation is consistent with the observed result, as LLM-Embedder's performance is decreased from such a change.

For "w.o. homo NS", the homogeneous in-batch negative sampling is disabled. Such a change could reduce the discrimination of the embeddings, because a great portion of the negative samples will come from different tasks, which are irrelevant with each other. As we can observe, LLM-Embedder's performance is decreased due to such a change, especially for PopQA and Conv Search, where a massive candidate pool is presented (Wikipedia corpus).

For "w.o. stabilized distill", we replace our stabilized distillation with the conventional KL-divergence based method. As introduced, this operation handles the fluctuated reward from LLM such that distillation can become more stabilized. We can observe that LLM-Embedder's performance is reduced once this step is removed, especially for ICL where LLM's reward is the major training signal.

4. Related Works

The related works are reviewed from two perspectives: retrieval augmented large language models, and dense retrieval. [10, 12, 53] . Despite such superiority, LLMs still face a series of severe challenges, such as hallucination, human alignment, and long-term memory. Many of the existing problems are caused by the inherent boundaries, which cannot be addressed by LLMs alone, but to rely on support from the external world. The retrieval-augmented LLMs are regarded as a go-to option to bridge LLMs with the external assistance [3, 35] . For the past few years, they have been widely applied to several critical scenarios. One common case is the knowledge enhancement. The internal knowledge of LLMs can be incomplete, static, and limited by the popularity bias. When dealing with knowledge intensive tasks, the retrieval augmented LLMs will look for necessary information from an external database, where the generated content can be grounded on proper knowledge [9, 19, 20, 27] . Besides, the retrieval augmented LLMs are also used to retrieve historical context to establish long-term memory [49, 59] , retrieve examples to improve the instruction following capability [11, 58] , and retrieve tools to engage with the physical world [43] .

The retrieval augmented LLMs consist of two basic parts: generator and retriever. According to previous studies [20, 27, 58, 68] , the retrieval augmentation effect is highly influenced by the retrieved content. In practice, there are two common types of retrievers. One is to leverage the general purpose retrievers, such as sparse models like BM25 [48] , and dense models, like DPR [24] , contriever [18] , E5 [56] , BGE [62] , OpenAI text embedding [37] . The other option is develop task-specific retriever, e.g., AAR for knowledge enhancement [68] , LLM-R [59] for in-context learning. The general purpose methods are praised for their generality and simplicity for usage, but may suffer from an inferior retrieval quality. In contrast, the task-specific ones can better fit one scenario, but fall short in transferability. Compared with the existing works, LLM-Embedder unifies the generality and speciality: it comprehensive supports all major retrieval augmentation needs of LLMs, meanwhile achieving the leading performance in every application scenario.

• Dense retrieval. Dense retrieval leverages latent representation of texts, i.e. embeddings, to search for relevant information from a vector DB. In recent years, it has grown into a major paradigm of information retrieval. The success of dense retrieval can attribute to several reasons. The first and foremost driving force is the development of pre-trained language models [13, 30, 46] , where the textual data can be represented in a highly expressive manner. The general pre-trained models are further improved by the retrieval-oriented ones [31, 56] , which better establish the sentencelevel representation capability during the pre-training stage. The second factor is the advancement of contrastive learning. On one hand, there has been a major upgrade of negative sampling, where massive [18, 24] and sufficiently hard samples [65] are utilized to help with the embedding's discriminativeness. On the other hand, the training objective is improved as well. Instead of simply learning from hard labels, the embedding models are made to distill knowledge from a more precise ranking model [17, 44, 63] . This notably facilitates the embedding model to encode fine-grained semantic relationships. Thirdly, the generality becomes increasingly emphasized in these days, where embeddings need to handle a wide variety of application scenarios. For this purpose, people come up with many different strategies, e.g., data augmentation [28, 55] , domain adaptation [23, 67] , instruction-based fine-tuning [4, 51] , which help the model to better handle diverse tasks. These factors are incorporated and optimized while developing our training recipe, which results in the empirical competitiveness of LLM-Embedder.

5. Conclusion

In this study, we introduce LLM-Embedder, a novel model designed to enhance the retrieval augmentation of LLMs in a variety of scenarios. Our model integrates four key retrieval capabilities: knowledge, memory, example, and tool, which boost LLMs' performance in dealing with knowledge-intensive tasks, long-context modeling, in-context learning, and tool learning. To optimize LLM-Embedder's performance in such diverse scenarios, we've refined our training workflow from multiple aspects, including reward from LLM, homogeneous negative sampling, instruction based fine-tuning, and stabilized distillation. Our experiments show LLM-Embedder's empirical advantages over both general and task-specific embedding models, which highlights its effectiveness as a foundational building-block to support the retrieval augmentation of LLMs.

SECTION

Conference'17, July 2017, Washington, DC, USA Zhang and Xiao, et al.