Evaluation¶
RAGnarok-AI provides comprehensive metrics for evaluating RAG pipelines.
Retrieval Metrics¶
Measure how well your retrieval system finds relevant documents.
Precision@K¶
Fraction of retrieved documents that are relevant.
from ragnarok_ai.evaluators import precision_at_k
retrieved = ["doc_1", "doc_2", "doc_3", "doc_4", "doc_5"]
relevant = ["doc_1", "doc_3", "doc_7"]
p = precision_at_k(retrieved, relevant, k=5)
# p = 2/5 = 0.4 (doc_1 and doc_3 are relevant)
Recall@K¶
Fraction of relevant documents that were retrieved.
from ragnarok_ai.evaluators import recall_at_k
r = recall_at_k(retrieved, relevant, k=5)
# r = 2/3 = 0.67 (found 2 of 3 relevant docs)
MRR (Mean Reciprocal Rank)¶
Position of the first relevant document.
from ragnarok_ai.evaluators import mrr
m = mrr(retrieved, relevant)
# m = 1.0 (first relevant doc is at position 1)
NDCG@K (Normalized Discounted Cumulative Gain)¶
Measures ranking quality with position-based discounting.
from ragnarok_ai.evaluators import ndcg_at_k
n = ndcg_at_k(retrieved, relevant, k=5)
# Higher score = relevant docs ranked higher
Full Retrieval Evaluation¶
Use evaluate_retrieval for all metrics at once:
from ragnarok_ai.core.types import Query, Document, RetrievalResult
from ragnarok_ai.evaluators.retrieval import evaluate_retrieval
query = Query(
text="What is Python?",
ground_truth_docs=["doc_1", "doc_3"],
)
retrieved_docs = [
Document(id="doc_1", content="Python is a programming language..."),
Document(id="doc_2", content="Java is another language..."),
Document(id="doc_3", content="Python was created by Guido..."),
]
result = RetrievalResult(
query=query,
retrieved_docs=retrieved_docs,
scores=[0.95, 0.85, 0.80],
)
metrics = evaluate_retrieval(result, k=10)
print(f"Precision@10: {metrics.precision:.2f}")
print(f"Recall@10: {metrics.recall:.2f}")
print(f"MRR: {metrics.mrr:.2f}")
print(f"NDCG@10: {metrics.ndcg:.2f}")
Generation Metrics¶
Evaluate answer quality with LLM-as-Judge. See LLM-as-Judge.
| Metric | Description |
|---|---|
| Faithfulness | Is the answer grounded in context? |
| Relevance | Does the answer address the question? |
| Hallucination | Does the answer contain fabricated info? |
| Completeness | Are all aspects of the question covered? |
Batch Evaluation¶
Evaluate multiple queries efficiently:
from ragnarok_ai.core.types import TestSet, Query
from ragnarok_ai.evaluators.retrieval import evaluate_retrieval
testset = TestSet(
queries=[
Query(text="Question 1", ground_truth_docs=["doc_1"]),
Query(text="Question 2", ground_truth_docs=["doc_2", "doc_3"]),
# ...
]
)
results = []
for query in testset.queries:
# Get your RAG results
response = await my_rag.query(query.text)
# Evaluate
result = RetrievalResult(
query=query,
retrieved_docs=response.retrieved_docs,
scores=[0.9, 0.8, 0.7],
)
metrics = evaluate_retrieval(result, k=10)
results.append(metrics)
# Aggregate
avg_precision = sum(m.precision for m in results) / len(results)
avg_recall = sum(m.recall for m in results) / len(results)
Checkpointing¶
Resume evaluation on crash:
from ragnarok_ai.checkpoint import CheckpointManager
checkpoint = CheckpointManager("./checkpoints/eval.ckpt")
for i, query in enumerate(testset.queries):
# Skip already processed
if checkpoint.is_completed(i):
continue
# Evaluate
result = await evaluate_query(query)
# Save progress
checkpoint.save(i, result)
Cost Tracking¶
Track evaluation costs:
results = await evaluate(rag, testset, track_cost=True)
print(results.cost)
# +--------------------+------------+----------+
# | Provider | Tokens | Cost |
# +--------------------+------------+----------+
# | ollama (local) | 45,230 | $0.00 |
# | openai | 12,500 | $0.38 |
# +--------------------+------------+----------+
Interpreting Results¶
Retrieval Metrics¶
| Score | Interpretation |
|---|---|
| > 0.8 | Excellent retrieval |
| 0.6 - 0.8 | Good, room for improvement |
| < 0.6 | Needs attention |
LLM-as-Judge¶
| Score | Verdict |
|---|---|
| >= 0.7 | PASS |
| 0.4 - 0.7 | PARTIAL |
| < 0.4 | FAIL |
Advisory Metrics
LLM-as-Judge scores are advisory. Use them for insights, not hard CI gates.
Next Steps¶
- LLM-as-Judge — Prometheus 2 evaluation
- Benchmarking — Track performance over time
- CLI Reference — Command-line interface