Algorithms, Design, Code and more: t

Sunday, November 2, 2025

DeepEval

DeepEval helps to test and verify the correctness of LLMs. DeepEval is a framework with a suite of Metrics, Synthetic Data generation having integrations across all leading AI/ ML libraries.

DeepEval can be used to set-up one LLM to judge the output of another LLM. This JudgeLLM set-up can be used at both the training as well as live inference stage for MlOps scenarios.

Getting started with DeepEval is simple with Ollama.

(I) Installation

pip install deepeval

Ollama installation was covered previously with a llama3.2 base model.

(II) Set-Ollama model in DeepEval

# Unset the openai model - default for DeepEval

deepeval unset-openai

# Set ollama model for DeepEval

deepeval set-ollama "llama3.2:1b" --base-url="http://localhost:11434"

(III) Create a JudgeLLM.py code

# Set up ollama model

model = OllamaModel(
model="llama3.2:1b",
base_url="http://localhost:11434",
temperature=0.0, # Example: Setting a custom temperature
)

# Set up evaluation metrics

correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the expected output.",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=[
"Check whether the facts are true" ],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
model=model, # ollama model
rubric=[
Rubric(score_range=(0,2), expected_outcome="Factually incorrect."),
Rubric(score_range=(3,6), expected_outcome="Mostly correct."),
Rubric(score_range=(7,9), expected_outcome="Correct but missing minor details."),
Rubric(score_range=(10,10), expected_outcome="100% correct."),
],
# threshold=0.1
)

# define the test case

test_case_maths = LLMTestCase(
input="what is 80 in words? using only 1 word.",
actual_output="eighty",
expected_output="eighty"
)

# Run the evaluation

evaluate(test_cases=[test_case_maths], metrics=[answer_relevancy])

(IV) Execute the JudgeLLM.py

deepeval test run JudgeLLM.py