In standard vector search (Bi-Encoders), the user query and the document are embedded separately and compared using simple geometry (Cosine Similarity). While incredibly fast, this misses deep linguistic nuances because the words in the query never interact with the words in the document during processing. A Cross-Encoder passes the query and the document *together* into a Transformer model (e.g., `[CLS] Query [SEP] Document [SEP]`). The self-attention mechanism compares every word in the query against every word in the document simultaneously, producing a single, highly accurate similarity score. Because this is computationally expensive, it is used strictly as a 'Reranker'—re-evaluating only the top 50 results retrieved by the faster Bi-Encoder.

How It Works

  • Stage 1 (Retrieval): A fast Bi-Encoder retrieves the top 100 potential documents from a vector database of millions.
  • Stage 2 (Concatenation): The system pairs the exact user query with each of the 100 documents.
  • Stage 3 (Cross-Encoding): A Cross-Encoder model processes each pair, allowing deep contextual interactions between the query and document tokens.
  • Stage 4 (Reranking): The model outputs a float score (0 to 1) for each pair. The list is sorted, and the absolute top 5 results are sent to the LLM.

Common Use Cases

  • Fixing 'Top-K' retrieval failures in enterprise RAG pipelines.
  • Improving search relevance for highly technical or domain-specific language where exact phrasing matters.

Related Terms