Benchmarking¶
Track performance over time and detect regressions.
Overview¶
RAGnarok-AI provides benchmark tracking:
- History — Store evaluation results over time
- Baselines — Set reference points for comparison
- Regression Detection — Alert on quality drops
- Comparison — Side-by-side analysis
CLI Usage¶
Run Demo¶
Output:
RAGnarok-AI Benchmark Demo
========================================
Simulating 3 benchmark runs over time...
Run 1 (Baseline)
Precision: 0.72 Recall: 0.68
MRR: 0.75 NDCG: 0.70
Average: 0.71 -> Set as baseline
Run 2 (Improved)
Precision: 0.78 Recall: 0.74
MRR: 0.80 NDCG: 0.76
Average: 0.77
Run 3 (Regression)
Precision: 0.65 Recall: 0.60
MRR: 0.68 NDCG: 0.62
Average: 0.64
----------------------------------------
Regression Detection (Run 3 vs Baseline)
----------------------------------------
- REGRESSION DETECTED:
- precision: 0.72 -> 0.65 (-9.7%)
- recall: 0.68 -> 0.60 (-11.8%)
List Configurations¶
View History¶
Python API¶
Record Benchmark¶
from ragnarok_ai.benchmarks import BenchmarkHistory
from ragnarok_ai.benchmarks.storage import JSONFileStore
store = JSONFileStore("./benchmarks.json")
history = BenchmarkHistory(store=store)
# Record evaluation result
record = await history.record(
eval_result=result,
config_name="my-rag-v1",
testset=testset,
)
# Set as baseline
await history.set_baseline(record.id)
Detect Regression¶
from ragnarok_ai.regression import RegressionDetector, RegressionThresholds
detector = RegressionDetector(
baseline=baseline_result,
thresholds=RegressionThresholds(
precision_drop=0.05, # Alert if precision drops > 5%
recall_drop=0.05,
),
)
regression = detector.detect(current_result)
if regression.has_regressions:
for alert in regression.alerts:
print(f"{alert.metric}: {alert.baseline_value:.2f} -> {alert.current_value:.2f}")
Storage¶
Benchmark history is stored in JSON format:
{
"records": [
{
"id": "abc123",
"timestamp": "2026-02-13T10:00:00Z",
"config_name": "my-rag-v1",
"is_baseline": true,
"metrics": {
"precision": 0.72,
"recall": 0.68,
"mrr": 0.75,
"ndcg": 0.70
}
}
]
}
Default location: .ragnarok/benchmarks.json
Custom location:
Thresholds¶
Configure regression thresholds:
thresholds = RegressionThresholds(
precision_drop=0.05, # 5% drop
recall_drop=0.05,
mrr_drop=0.10, # 10% drop
ndcg_drop=0.05,
)
CI/CD Integration¶
Use --fail-under for quality gates:
ragnarok benchmark --demo --fail-under 0.7
# Exit code 0 if average >= 0.7
# Exit code 1 if average < 0.7
For GitHub Actions, use the RAGnarok Action with regression detection:
- uses: 2501Pr0ject/ragnarok-evaluate-action@v1
with:
threshold: 0.8
fail-on-threshold: false # Advisory mode
Best Practices¶
- Set a baseline — Mark your initial evaluation as baseline
- Track over time — Run benchmarks on every significant change
- Use thresholds — Define acceptable regression limits
- Review trends — Monitor history for gradual degradation
Next Steps¶
- GitHub Action — CI/CD integration
- CLI Reference — Full command reference