Search-R1/retriever.md at ba152349fddeca423d72799c5321099cc3cdd801

Files

PeterGriffinJin ba152349fd add local sparse retriever, ann dense retriever and online search engine

2025-04-07 18:20:43 +00:00

5.2 KiB

Raw Blame History

Search Engine

In this document, we provide examples of how to launch different retrievers, including local sparse retriever (e.g., BM25), local dense retriever (e.g., e5) and online search engine. For local retrievers, we use wiki-18 corpus as an example and the corpus indexing can be found at bm25, e5-flat, e5-HNSW64.

How to choose the retriever?

If you have a private or domain-specific corpus, choose local retriever.
- If there is no high quality embedding-based retrievers (dense retrievers) in your domain, choose sparse local retriever (e.g., BM25).
- Otherwise choose dense local retriever.
  - If you do not have sufficent GPUs to conduct exact dense embedding matching, choose ANN indexing on CPUs.
  - If you have sufficient GPUs, choose flat indexing on GPUs.
If you want to train a general LLM search agent and have enough funding, choose online search engine (e.g., SerpAPI).
If you have a domain specific online search engine (e.g., PubMed search), you can refer to link to integrate it to Search-R1 by yourself.

Local Sparse Retriever

Sparse retriever (e.g., bm25) is a traditional method. The retrieval process is very efficient and no GPUs are needed. However, it may not be as accurate as dense retrievers in some specific domain.

(1) Download the indexing.

save_path=/your/path/to/save
huggingface-cli download PeterJinGo/wiki-18-bm25-index --repo-type dataset --local-dir $save_path

(2) Launch a local BM25 retriever server.

conda activate retriever

index_file=$save_path/bm25
corpus_file=$save_path/wiki-18.jsonl
retriever_name=bm25

python search_r1/search/retrieval_server.py --index_path $index_file --corpus_path $corpus_file --topk 3 --retriever_name $retriever_name

Local Dense Retriever

You can also adopt some off-the-shelf dense retrievers, e.g., e5. These models are much stronger than sparse retriever in some specific domains. If you have sufficient GPU, we would recommend the flat indexing variant below, otherwise you can adopt the ANN variant.

Flat indexing

Flat indexing conducts exact embedding match, which is slow but very accurate. To make it efficient enough to support online RL, we would recommend enable GPU usage by --faiss_gpu.

(1) Download the indexing and corpus.

save_path=/the/path/to/save
python scripts/download.py --save_path $save_path
cat $save_path/part_* > $save_path/e5_Flat.index
gzip -d $save_path/wiki-18.jsonl.gz

(2) Launch a local flat e5 retriever server.

conda activate retriever

index_file=$save_path/e5_Flat.index
corpus_file=$save_path/wiki-18.jsonl
retriever_name=e5
retriever_path=intfloat/e5-base-v2

python search_r1/search/retrieval_server.py --index_path $index_file --corpus_path $corpus_file --topk 3 --retriever_name $retriever_name --retriever_model $retriever_path --faiss_gpu

ANN indexing (HNSW64)

To improve the search efficient with only CPU, you can adopt approximate nearest neighbor (ANN) indexing, e.g., with HNSW64. It is very efficient, but may not be as accurate as flat indexing, especially when the number of retrieved passages is small.