Update README.md

Browse files

Files changed (1) hide show

README.md +14 -0

README.md CHANGED Viewed

@@ -1010,6 +1010,9 @@ First install the PyLate library:
 pip install -U pylate
 ```
 ### Retrieval
 Use this model with PyLate to index and retrieve documents. The index uses [FastPLAID](https://github.com/lightonai/fast-plaid) for efficient similarity search.
@@ -1041,6 +1044,7 @@ documents_embeddings = model.encode(
     documents,
     batch_size=32,
     is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
     show_progress_bar=True,
 )
@@ -1066,6 +1070,9 @@ index = indexes.PLAID(
 Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
 To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
 ```python
 # Step 1: Initialize the ColBERT retriever
 retriever = retrieve.ColBERT(index=index)
@@ -1075,6 +1082,7 @@ queries_embeddings = model.encode(
     ["query for document 3", "query for document 1"],
     batch_size=32,
     is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
     show_progress_bar=True,
 )
@@ -1086,8 +1094,12 @@ scores = retriever.retrieve(
 ```
 ### Reranking
 If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
 ```python
 from pylate import rank, models
@@ -1113,11 +1125,13 @@ model = models.ColBERT(
 queries_embeddings = model.encode(
     queries,
     is_query=True,
 )
 documents_embeddings = model.encode(
     documents,
     is_query=False,
 )
 reranked_documents = rank.rerank(

 pip install -U pylate
 ```
+> [!WARNING]
+> **Prompt alignment is critical for ColBERT-Zero models.** You **must** use `prompt_name="query"` when encoding queries and `prompt_name="document"` when encoding documents. ColBERT-Zero was pre-trained with asymmetric prompts (`search_query:` / `search_document:`), and stripping them causes significant performance.
 ### Retrieval
 Use this model with PyLate to index and retrieve documents. The index uses [FastPLAID](https://github.com/lightonai/fast-plaid) for efficient similarity search.
     documents,
     batch_size=32,
     is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
+    prompt_name="document", # ⚠️ Required for ColBERT-Zero! Do not omit.
     show_progress_bar=True,
 )
 Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
 To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
+[!WARNING]
+Always pass prompt_name="query" for queries and prompt_name="document" for documents. Omitting these prompts will silently degrade retrieval quality.
 ```python
 # Step 1: Initialize the ColBERT retriever
 retriever = retrieve.ColBERT(index=index)
     ["query for document 3", "query for document 1"],
     batch_size=32,
     is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
+    prompt_name="query", # ⚠️ Required for ColBERT-Zero! Do not omit.
     show_progress_bar=True,
 )
 ```
 ### Reranking
+> [!WARNING]
+> Always pass `prompt_name="query"` for queries and `prompt_name="document"` for documents. Omitting these prompts will silently degrade retrieval quality.
 If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
 ```python
 from pylate import rank, models
 queries_embeddings = model.encode(
     queries,
     is_query=True,
+    prompt_name="query" # ⚠️ Required for ColBERT-Zero! Do not omit.
 )
 documents_embeddings = model.encode(
     documents,
     is_query=False,
+    prompt_name="document" # ⚠️ Required for ColBERT-Zero! Do not omit.
 )
 reranked_documents = rank.rerank(