LLM-as-Judge¶

Evaluate answer quality using Prometheus 2, a specialized evaluation model.

Overview¶

LLM-as-Judge provides multi-criteria evaluation:

Criterion	Description
Faithfulness	Is the answer grounded in the provided context?
Relevance	Does the answer address the question?
Hallucination	Does the answer contain fabricated information?
Completeness	Are all aspects of the question covered?

Setup¶

Install Prometheus 2 via Ollama:

ollama pull hf.co/RichardErkhov/prometheus-eval_-_prometheus-7b-v2.0-gguf:Q5_K_M

Requirements:

~5GB disk space
16GB RAM recommended
Ollama running (ollama serve)

Basic Usage¶

Python API¶

from ragnarok_ai.evaluators.judge import LLMJudge

judge = LLMJudge()

# Evaluate faithfulness
result = await judge.evaluate_faithfulness(
    context="Paris is the capital of France. It has a population of 2.1 million.",
    question="What is the capital of France?",
    answer="Paris is the capital of France.",
)

print(f"Verdict: {result.verdict}")      # PASS
print(f"Score: {result.score:.2f}")       # 0.85
print(f"Explanation: {result.explanation}")

CLI¶

ragnarok judge \
  --context "Paris is the capital of France." \
  --question "What is the capital of France?" \
  --answer "Paris is the capital of France."

Evaluation Criteria¶

Faithfulness¶

Checks if the answer is grounded in the provided context.

result = await judge.evaluate_faithfulness(
    context="Python was created by Guido van Rossum in 1991.",
    question="Who created Python?",
    answer="Python was created by Guido van Rossum.",
)
# PASS - answer is supported by context

Relevance¶

Checks if the answer addresses the question.

result = await judge.evaluate_relevance(
    question="What is the capital of France?",
    answer="Paris is the capital of France.",
)
# PASS - directly answers the question

Hallucination Detection¶

Checks for fabricated information not in the context.

result = await judge.detect_hallucination(
    context="Python was created by Guido van Rossum.",
    answer="Python was created by Guido van Rossum in the Netherlands in 1991.",
)
# PARTIAL - "1991" and "Netherlands" not in context

Completeness¶

Checks if the answer covers all aspects of the question.

result = await judge.evaluate_completeness(
    question="What is Python and who created it?",
    answer="Python is a programming language.",
    context="Python is a programming language created by Guido van Rossum.",
)
# PARTIAL - missing creator information

Batch Evaluation¶

Evaluate multiple items from a file:

# items.json
# [
#   {"context": "...", "question": "...", "answer": "..."},
#   {"context": "...", "question": "...", "answer": "..."}
# ]

ragnarok judge --file items.json --criteria faithfulness,relevance

Select Criteria¶

Evaluate specific criteria only:

# Single criterion
ragnarok judge --file items.json --criteria faithfulness

# Multiple criteria
ragnarok judge --file items.json --criteria faithfulness,relevance

# All criteria (default)
ragnarok judge --file items.json --criteria all

Medical Mode¶

Reduce false positives in healthcare RAG evaluation:

judge = LLMJudge(medical_mode=True)

result = await judge.evaluate_faithfulness(
    context="Patient diagnosed with CHF.",
    question="What condition does the patient have?",
    answer="Patient has congestive heart failure.",
)
# PASS - "CHF" and "congestive heart failure" are equivalent

Features:

350+ medical abbreviations (CHF, MI, COPD, DVT...)
Context-aware disambiguation
Multiple formats: dotted (q.d.), slash (s/p), mixed-case (SpO2)

Scoring¶

Prometheus 2 uses a 1-5 rubric, normalized to 0-1:

Raw Score	Normalized	Verdict
5	1.0	PASS
4	0.75	PASS
3	0.5	PARTIAL
2	0.25	FAIL
1	0.0	FAIL

Verdict thresholds:

PASS: score >= 0.7
PARTIAL: 0.4 <= score < 0.7
FAIL: score < 0.4

Performance¶

On Apple M2 16GB:

Criterion	Avg Time
Faithfulness	~25s
Relevance	~22s
Hallucination	~28s
Completeness	~24s

Keep Alive

RAGnarok-AI uses keep_alive by default to prevent Ollama from unloading the model between requests.

Configuration¶

Custom Model¶

judge = LLMJudge(
    model="llama3",  # Use different model
    base_url="http://localhost:11434",
)

CLI Options¶

ragnarok judge \
  --file items.json \
  --model llama3 \
  --ollama-url http://localhost:11434 \
  --fail-under 0.7 \
  --output results.json

Output Formats¶

Console Output¶

  RAGnarok-AI LLM-as-Judge
  ========================================

  Items to evaluate: 1
  Criteria: faithfulness, relevance

  [1/1] Evaluating: What is the capital of France?

  ----------------------------------------
  Results
  ----------------------------------------

  Item 1:
    Question: What is the capital of France?
    [+] faithfulness: 0.85 (PASS)
    [+] relevance: 0.90 (PASS)
    Average: 0.88

  ----------------------------------------
  Overall Average: 0.8750

JSON Output¶

ragnarok judge --file items.json --json

{
  "command": "judge",
  "status": "pass",
  "version": "1.4.0",
  "data": {
    "items_evaluated": 1,
    "criteria": ["faithfulness", "relevance"],
    "results": [
      {
        "question": "What is the capital of France?",
        "criteria": {
          "faithfulness": {"verdict": "PASS", "score": 0.85, "explanation": "..."},
          "relevance": {"verdict": "PASS", "score": 0.90, "explanation": "..."}
        },
        "average_score": 0.875
      }
    ],
    "overall_average": 0.875
  }
}

CI/CD Integration¶

Use --fail-under for quality gates:

ragnarok judge --file items.json --fail-under 0.7

# Exit code 0 if average >= 0.7
# Exit code 1 if average < 0.7

Advisory Scores

LLM-as-Judge scores are advisory. For CI/CD, consider using fail-on-threshold: false in the GitHub Action.

Next Steps¶

Evaluation Guide — Retrieval metrics
GitHub Action — CI/CD integration
CLI Reference — Full command reference