RAGEvaluationML EngineeringMLOpsInformation RetrievalBM25FAISSFastAPIPython

RAG Benchmark Service

Benchmark-driven retrieval system for scientific question answering. Built with evaluation-first discipline, dual retrieval baselines, and production MLOps infrastructure.

TL;DR

  • Evaluation-first RAG system using SciFact benchmark (5,183 docs, 300 test queries)
  • Dual retrieval baselines (BM25 + dense/FAISS) with offline metrics and comparison reports
  • Production FastAPI service with grounded answer generation and citation enforcement
  • Full ML lifecycle: drift detection, scheduled evaluation, gated promotion
  • 239 tests covering validation, retrieval, API contracts, and end-to-end RAG flows

What I Built

The system follows a clear data → retrieval → evaluation → serving → monitoring pipeline.

Data & Benchmarking

I integrated the SciFact dataset from the BEIR benchmark suite with full validation and statistics reporting. The pipeline downloads corpus documents, test queries, and relevance judgments (qrels), then validates schema consistency and referential integrity.

Retrieval Baselines

Built two complementary retrieval approaches to understand trade-offs between lexical and semantic search:

  • BM25 baseline with deterministic indexing and content-addressed artifacts
  • Dense retrieval using sentence-transformers embeddings with FAISS flat index

Both retrievers expose a unified interface for plug-and-play comparison.

Evaluation Harness

Offline metrics drive all decisions. I implemented Recall@K, nDCG@K, and MRR@K with side-by-side retriever comparison and delta reporting. Results are timestamped and archived for reproducibility.

RAG Generation

The system integrates with OpenRouter for LLM inference, enforcing strict citation discipline in prompts. When retrieval quality is weak, the system abstains with "Insufficient evidence" rather than hallucinating.

Production Service

Deployed as a FastAPI service with four endpoints: /health, /v1/retrieve, /v1/query, and /metrics. All requests include timing instrumentation and are logged to JSONL for offline analysis.

ML Lifecycle & Monitoring

Continuous improvement is automated through several mechanisms:

  • Drift detection using PSI to track query length and similarity score distributions
  • Scheduled evaluation with configurable query sampling every Sunday at 03:00 UTC
  • Gated promotion requiring 2% relative nDCG@10 improvement with no Recall@10 regression
  • GitHub Actions orchestrating eval, drift checks, and promotion decisions

Candidate retrievers must pass metric thresholds before updating the production pointer file.


Why It Matters

ML/Research Discipline

I prioritized reproducibility and rigorous evaluation over quick wins. The system uses a standard BEIR dataset with held-out evaluation sets and qrels. Metrics capture multiple dimensions: Recall@K for coverage, nDCG for ranking quality, MRR for first-hit performance. Grounded prompts enforce citations and abstention when evidence is insufficient.

ML Engineering & MLOps

Production ML systems require versioning, monitoring, and gated deployments. Indexes are identified by SHA256 content hash. PSI-based drift detection tracks distribution shifts. The production pointer pattern enables atomic updates with full audit trails. GitHub Actions runs weekly evaluation with artifact uploads.

Software Engineering

The codebase follows clean architecture principles with modular separation of concerns and a factory pattern for retriever interchangeability. 239 tests cover data validation, retrieval correctness, API contracts, and end-to-end RAG smoke tests. Developer experience is prioritized with Makefile targets, Docker for local serving, and environment-based configuration.


Architecture

The system follows a linear pipeline with continuous feedback loops for model improvement.

SciFact Dataset ↓ [Data Pipeline: Download → Validate → Stats] ↓ [Indexing: BM25 + Dense/FAISS] ↓ [Retriever Factory] → [FastAPI Service] ↓ ↓ [Offline Eval] [/v1/query] → [Request Logs (JSONL)] ↓ ↓ [Scheduled Eval] [Drift Detection (PSI)] ↓ ↓ [Promotion Gate] ←─────┘ ↓ [Production Pointer File] → [API Reloads Retriever]

Results

I evaluated both retrievers on SciFact (300 queries with labeled relevance judgments):

RetrieverRecall@10nDCG@10MRR@10
BM250.7760.6520.619
Dense0.7830.6450.605

Dense retrieval achieves +0.8% higher recall (better coverage), while BM25 shows +1.1% higher nDCG (better ranking quality). Both are viable baselines; the choice depends on domain requirements and optimization targets.

All results are reproducible via make eval_bm25 and make eval_dense.


Reliability & Safety

The system enforces strict citation discipline through prompt engineering. All claims must reference retrieved documents, and the system abstains with "Insufficient evidence" when retrieval quality is low. End-to-end tests verify citation correctness.

This is a demonstration system. Users must verify critical information independently.


Future Directions

The roadmap focuses on evaluation rigor and production readiness:

Answer Quality Evaluation — Automated groundedness testing, citation coverage metrics, and regression detection for generation quality.

Reranker Integration — Cross-encoder second-stage ranking to improve relevance beyond first-stage retrieval.

Multi-Domain Benchmarks — Expand evaluation to additional BEIR datasets (NFCorpus, FiQA) to test generalization.

CI/CD Enhancements — PR-level smoke evaluation with quality gates, Slack/email alerting on drift or regression.

Production Hardening — API key management, rate limiting, and authentication layer for deployment.