Rankify: Toolkit for Retrieval, Re-Ranking and RAG

30.05.2025

Retrieval, re-ranking, and retrieval-augmented generation (RAG) are fundamental to modern information retrieval and knowledge-based generation. Researchers and practitioners often face fragmented tools complicating experimentation. Our Rankify framework addresses this by providing a unified, modular, extensible open-source platform for seamless retrieval, re-ranking, and RAG workflows.

Have you ever struggled to find the best document retrieval model for your project? Or had to combine multiple frameworks just to get a basic information retrieval (IR) pipeline running? Rankify simplifies these by providing an all-in-one retrieval, re-ranking, and retrieval-augmented generation toolkit.

Information retrieval (IR) methods are essential for search engines, question-answering, and knowledge-based applications. These systems typically operate in a two-stage pipeline: first, a retriever identifies a set of potentially relevant documents, and then a re-ranker refines the ranking to maximize relevance. While this approach has proven effective, existing tools for retrieval and re-ranking tend to be distributed or incompatible, requiring researchers to combine multiple frameworks to experiment with different techniques.

Why use two stages?

Why are two-stage pipelines so widely adopted, rather than relying on a single, all-powerful retrieval method? At its core, it comes down to balancing performance with efficiency. Typically, you start with a lightweight retrieval approach—either keyword-based (like BM25) or one that uses neural embeddings for vector-based search. In the latter case, you embed the query using the same model that produced the document embeddings, and then measure their similarity (for example, with cosine similarity) to find matches.

Both keyword-based and vector-based retrieval keep computational costs low: you only need to encode one (usually short) query at inference time and run quick similarity checks. However, this first step is effectively “cold.” The representations of your documents were computed earlier and remain fixed, so they are not adapted to the new query. The system expects the model to encode documents and queries in a comparable way without knowing in advance exactly what the query will be.

Figure 1: A simplified view of the single-stage retrieval pipeline.

This is where re-ranking plays a pivotal role. During re-ranking, the model considers both the query and each candidate document at the same time (i.e., at inference). It can then determine how well they match, capturing subtle connections that might be missed in the initial retrieval step. The downside is that this process is expensive: you must run inference on every candidate document, and you can’t precompute representations in the same way as before. That makes it impractical for large-scale systems, where processing every document on the fly would be prohibitively costly.

So the solution is straightforward: combine both methods. First, use the faster approach to retrieve a smaller pool of documents—say 10, 50, or 100 that seem the most relevant. Then, apply a more computationally intensive re-ranking model to just that short list. This hybrid strategy gives you the efficiency of quick initial retrieval and the accuracy of a query-aware re-ranker, creating an effective two-stage pipeline.

A pipeline diagram illustrating document encoding, query encoding, similarity-based retrieval of top-100 documents, and subsequent refinement through a re-ranking model to produce a final ranked list of top-10 relevant documents. — Figure 2: A simplified view of the retrieve-then-rerank two-stage pipeline.

Why do we need re-ranking?

Beyond the two-stage retrieval pipeline, there’s another key dimension to consider: the variety of re-ranking models themselves. Historically, re-ranking has been dominated by cross-encoder models—binary classifiers using BERT-like architectures. These models take both the query and a candidate document as input and produce a relevance score (often interpreted as a probability). This approach, commonly referred to as pointwise scoring, assigns independent scores to each document. While these scores are later used to determine ranking, the model itself does not explicitly rank documents in a comparative manner.

Over time, re-ranking methods have evolved beyond traditional models. One example is MonoT5, a model trained to classify documents as either “relevant” or “irrelevant”. The model predicts one of these labels, and the probability assigned to "relevant" serves as a relevance score for ranking.

More advanced approaches now adapt large language models (LLMs) for re-ranking tasks. For instance, BGE-Gemma2 fine-tunes a 9-billion-parameter LLM to compute relevance scores by estimating the likelihood of the "relevant" label.

Beyond these models, some methods take a different approach to re-ranking. For example:

Late Interaction Models (e.g., ColBERT) – Instead of producing a direct relevance score, these models compare representations of the query and documents at a later stage, refining ranking results.
Listwise Re-Ranking – Rather than scoring each document separately, these models analyze an entire set of documents at once and reorder them collectively. Traditionally, T5-based models have been used for this, but newer approaches explore LLMs in zero-shot mode (RankGPT) or fine-tune smaller models on outputs from more advanced models (RankZephyr).

In short, there are many approaches to re-ranking, each with its own strengths and weaknesses—and no single method works best in every situation. Figuring out which model is ideal for your use case (and sometimes fine-tuning your own model) can be challenging. What makes this even tougher is that different techniques often require specific input formats and produce their outputs differently, making it hard to switch between models.

Most existing retrieval and re-ranking toolkits focus on specific aspects of the pipeline, but none provide a unified system that seamlessly integrates retrieval, re-ranking, and retrieval-augmented generation (RAG). Rankify fills this gap by offering a comprehensive, flexible, and modular framework, enabling researchers and developers to experiment with different retrieval and ranking strategies effortlessly.

Additionally, datasets for evaluating retrieval and re-ranking methods are scattered across multiple sources, making benchmarking inconsistent and time-consuming.

To address these challenges, we developed Rankify, an open-source framework that integrates retrieval, re-ranking, and retrieval-augmented generation (RAG) into a single, modular system. Rankify is designed to streamline experimentation, making it easier for researchers to compare retrieval and ranking methodologies while ensuring consistency in benchmarking.

With built-in support for sparse and dense retrieval methods, state-of-the-art re-ranking models, and seamless RAG integration, Rankify enables users to:

Experiment with multiple retrieval and re-ranking techniques in a unified framework.
Utilize pre-retrieved datasets to benchmark and compare retrieval performance effectively.
Leverage retrieval-augmented generation (RAG) models to improve knowledge-based text generation.

By providing a cohesive environment for retrieval, re-ranking, and generative modeling, Rankify simplifies the process of building and evaluating information retrieval systems, helping researchers and practitioners advance their methodologies more efficiently.

Rankify

Rankify is a unified framework for retrieval, re-ranking, and retrieval-augmented generation (RAG), designed to streamline experimentation in search and information retrieval. Unlike existing solutions, Rankify offers a modular and flexible approach that integrates:

Retrieval: Supports both sparse (BM25) and dense (DPR, ColBERT, BGE, Contriever) methods, improving search accuracy through hybrid approaches.
Re-Ranking: Enhances initial search results using cutting-edge ranking models, including RankT5, MonoT5, and transformer-based ranking methods.
Retrieval-Augmented Generation (RAG): Enables context-aware text generation by feeding retrieved documents into LLMs like GPT, T5, and RankGPT.
Pre-Retrieved Datasets & Benchmarking: Provides 40+ datasets with precomputed rankings for fair and reproducible evaluation.
Built-in Evaluation Tools: Includes Top-K accuracy, Recall, Precision, and NDCG to measure performance seamlessly.

Rankify's architecture diagram highlighting key modules: Retrievers (BM25, DPR, ColBERT), Re-rankers (RankT5, MonoBERT, FlashRank), RAG methods (Zero-Shot, Fusion-in-Decoder, In-Context Learning), and benchmark corpora (Wikipedia, MS MARCO, NQ). — Figure 3: An architectural overview of Rankify, depicting its retrieval, re-ranking, and RAG modules. The framework integrates various retrieval techniques, re-ranking models, and pre-retrieved datasets to support benchmarking and experimentation in information retrieval.

Join the future of information retrieval with Rankify!

📌 Get Started Today!

GitHub: Rankify Repository
Documentation: Read the Docs
PyPI: Install Rankify

DOI: https://www.doi.org/10.48763/000013

This work is licensed under a Creative Commons Attribution 4.0 International License.

Written by Abdelrahman Abdallah in April/May 2025

PhD candidate at the Department of Computer Science (Data Science Group)

University of Innsbruck

About the author

I am a PhD candidate in the Data Science Group, supervised by Prof. Adam Jatowt. My research focuses on natural language processing and information retrieval, with an emphasis on Dense Retrieval, and open-domain question answering systems. I have published articles in top-tier IR/NLP conferences and leading journals.

Web search

People Search

Rankify: Toolkit for Retrieval, Re-Ranking and RAG

Ranki­fy: Toolkit for Retrieval, Re-Rank­ing and RAG

Rankify: Toolkit for Retrieval, Re-Ranking and RAG