AgentJudge¶
A specialized agent for evaluating and judging outputs from other agents or systems. Acts as a quality control mechanism providing objective assessments and feedback.
Based on the research paper: "Agent-as-a-Judge: Evaluate Agents with Agents" - arXiv:2410.10934
Overview¶
The AgentJudge is designed to evaluate and critique outputs from other AI agents, providing structured feedback on quality, accuracy, and areas for improvement. It supports both single-shot evaluations and iterative refinement through multiple evaluation loops with context building.
Key capabilities:
Capability | Description |
---|---|
Quality Assessment | Evaluates correctness, clarity, and completeness of agent outputs |
Structured Feedback | Provides detailed critiques with strengths, weaknesses, and suggestions |
Multimodal Support | Can evaluate text outputs alongside images |
Context Building | Maintains evaluation context across multiple iterations |
Custom Evaluation Criteria | Supports weighted evaluation criteria for domain-specific assessments |
Batch Processing | Efficiently processes multiple evaluations |
Architecture¶
graph TD
A[Input Task] --> B[AgentJudge]
B --> C{Evaluation Mode}
C -->|step()| D[Single Eval]
C -->|run()| E[Iterative Eval]
C -->|run_batched()| F[Batch Eval]
D --> G[Agent Core]
E --> G
F --> G
G --> H[LLM Model]
H --> I[Quality Analysis]
I --> J[Feedback & Output]
subgraph "Feedback Details"
N[Strengths]
O[Weaknesses]
P[Improvements]
Q[Accuracy Check]
end
J --> N
J --> O
J --> P
J --> Q
Class Reference¶
Constructor¶
AgentJudge(
id: str = str(uuid.uuid4()),
agent_name: str = "Agent Judge",
description: str = "You're an expert AI agent judge...",
system_prompt: str = None,
model_name: str = "openai/o1",
max_loops: int = 1,
verbose: bool = False,
evaluation_criteria: Optional[Dict[str, float]] = None,
return_score: bool = False,
*args,
**kwargs
)
Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
id |
str |
str(uuid.uuid4()) |
Unique identifier for the judge instance |
agent_name |
str |
"Agent Judge" |
Name of the agent judge |
description |
str |
"You're an expert AI agent judge..." |
Description of the agent's role |
system_prompt |
str |
None |
Custom system instructions (uses default if None) |
model_name |
str |
"openai/o1" |
LLM model for evaluation |
max_loops |
int |
1 |
Maximum evaluation iterations |
verbose |
bool |
False |
Enable verbose logging |
evaluation_criteria |
Optional[Dict[str, float]] |
None |
Dictionary of evaluation criteria and weights |
return_score |
bool |
False |
Whether to return a numerical score instead of full conversation |
Methods¶
step()¶
Processes a single task and returns the agent's evaluation.
Parameter | Type | Default | Description |
---|---|---|---|
task |
str |
None |
Single task/output to evaluate |
img |
Optional[str] |
None |
Path to image for multimodal evaluation |
Returns: str
- Detailed evaluation response
Raises: ValueError
- If no task is provided
run()¶
Executes evaluation in multiple iterations with context building.
Parameter | Type | Default | Description |
---|---|---|---|
task |
str |
None |
Single task/output to evaluate |
img |
Optional[str] |
None |
Path to image for multimodal evaluation |
Returns:
- str
- Full conversation context if return_score=False
(default)
- int
- Numerical reward score if return_score=True
run_batched()¶
Executes batch evaluation of multiple tasks.
Parameter | Type | Default | Description |
---|---|---|---|
tasks |
Optional[List[str]] |
None |
List of tasks/outputs to evaluate |
Returns: List[Union[str, int]]
- Evaluation responses for each task
Examples¶
Basic Evaluation¶
from swarms.agents.agent_judge import AgentJudge
# Initialize the agent judge
judge = AgentJudge(
agent_name="quality-judge",
model_name="gpt-4",
max_loops=2
)
# Example agent output to evaluate
agent_output = "The capital of France is Paris. The city is known for its famous Eiffel Tower and delicious croissants. The population is approximately 2.1 million people."
# Run evaluation with context building
evaluations = judge.run(task=agent_output)
Technical Evaluation with Custom Criteria¶
from swarms.agents.agent_judge import AgentJudge
# Initialize the agent judge with custom evaluation criteria
judge = AgentJudge(
agent_name="technical-judge",
model_name="gpt-4",
max_loops=1,
evaluation_criteria={
"accuracy": 0.4,
"completeness": 0.3,
"clarity": 0.2,
"logic": 0.1,
},
)
# Example technical agent output to evaluate
technical_output = "To solve the quadratic equation x² + 5x + 6 = 0, we can use the quadratic formula: x = (-b ± √(b² - 4ac)) / 2a. Here, a=1, b=5, c=6. Substituting: x = (-5 ± √(25 - 24)) / 2 = (-5 ± √1) / 2 = (-5 ± 1) / 2. So x = -2 or x = -3."
# Run evaluation with context building
evaluations = judge.run(task=technical_output)
Creative Content Evaluation¶
from swarms.agents.agent_judge import AgentJudge
# Initialize the agent judge for creative content evaluation
judge = AgentJudge(
agent_name="creative-judge",
model_name="gpt-4",
max_loops=2,
evaluation_criteria={
"creativity": 0.4,
"originality": 0.3,
"engagement": 0.2,
"coherence": 0.1,
},
)
# Example creative agent output to evaluate
creative_output = "The moon hung like a silver coin in the velvet sky, casting shadows that danced with the wind. Ancient trees whispered secrets to the stars, while time itself seemed to pause in reverence of this magical moment. The world held its breath, waiting for the next chapter of the eternal story."
# Run evaluation with context building
evaluations = judge.run(task=creative_output)
Single Task Evaluation¶
from swarms.agents.agent_judge import AgentJudge
# Initialize with default settings
judge = AgentJudge()
# Single task evaluation
result = judge.step(task="The answer is 42.")
Multimodal Evaluation¶
from swarms.agents.agent_judge import AgentJudge
judge = AgentJudge()
# Evaluate with image
evaluation = judge.step(
task="Describe what you see in this image",
img="path/to/image.jpg"
)
Batch Processing¶
from swarms.agents.agent_judge import AgentJudge
judge = AgentJudge()
# Batch evaluation
tasks = [
"The capital of France is Paris.",
"2 + 2 = 4",
"The Earth is flat."
]
# Each task evaluated independently
evaluations = judge.run_batched(tasks=tasks)
Scoring Mode¶
from swarms.agents.agent_judge import AgentJudge
# Initialize with scoring enabled
judge = AgentJudge(
agent_name="scoring-judge",
model_name="gpt-4",
max_loops=2,
return_score=True
)
# Get numerical score instead of full conversation
score = judge.run(task="This is a correct and well-explained answer.")
# Returns: 1 (if positive keywords found) or 0
Reference¶
@misc{zhuge2024agentasajudgeevaluateagentsagents,
title={Agent-as-a-Judge: Evaluate Agents with Agents},
author={Mingchen Zhuge and Changsheng Zhao and Dylan Ashley and Wenyi Wang and Dmitrii Khizbullin and Yunyang Xiong and Zechun Liu and Ernie Chang and Raghuraman Krishnamoorthi and Yuandong Tian and Yangyang Shi and Vikas Chandra and Jürgen Schmidhuber},
year={2024},
eprint={2410.10934},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.10934}
}