DeepEval helps to test and verify the correctness of LLMs. DeepEval is a framework with a suite of Metrics, Synthetic Data generation having integrations across all leading AI/ ML libraries.
DeepEval can be used to set-up one LLM to judge the output of another LLM. This JudgeLLM set-up can be used at both the training as well as live inference stage for MlOps scenarios.
Getting started with DeepEval is simple with Ollama.
(I) Installation
pip install deepeval
Ollama installation was covered previously with a llama3.2 base model.
(II) Set-Ollama model in DeepEval
# Unset the openai model - default for DeepEval
deepeval unset-openai
# Set ollama model for DeepEval
deepeval set-ollama "llama3.2:1b" --base-url="http://localhost:11434"
(III) Create a JudgeLLM.py code
# Set up ollama model
model = OllamaModel(
model="llama3.2:1b",
base_url="http://localhost:11434",
temperature=0.0, # Example: Setting a custom temperature
)
# Set up evaluation metrics
correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the expected output.",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=[
"Check whether the facts are true" ],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
model=model, # ollama model
rubric=[
Rubric(score_range=(0,2), expected_outcome="Factually incorrect."),
Rubric(score_range=(3,6), expected_outcome="Mostly correct."),
Rubric(score_range=(7,9), expected_outcome="Correct but missing minor details."),
Rubric(score_range=(10,10), expected_outcome="100% correct."),
],
# threshold=0.1
)
# define the test case
test_case_maths = LLMTestCase(
input="what is 80 in words? using only 1 word.",
actual_output="eighty",
expected_output="eighty"
)
# Run the evaluation
evaluate(test_cases=[test_case_maths], metrics=[answer_relevancy])
(IV) Execute the JudgeLLM.py
deepeval test run JudgeLLM.py