Skip to content

AgentJudge

A specialized agent for evaluating and judging outputs from other agents or systems. Acts as a quality control mechanism providing objective assessments and feedback.

Based on the research paper: "Agent-as-a-Judge: Evaluate Agents with Agents" - arXiv:2410.10934

Overview

The AgentJudge is designed to evaluate and critique outputs from other AI agents, providing structured feedback on quality, accuracy, and areas for improvement. It supports both single-shot evaluations and iterative refinement through multiple evaluation loops with context building.

Key capabilities:

Capability Description
Quality Assessment Evaluates correctness, clarity, and completeness of agent outputs
Structured Feedback Provides detailed critiques with strengths, weaknesses, and suggestions
Multimodal Support Can evaluate text outputs alongside images
Context Building Maintains evaluation context across multiple iterations
Custom Evaluation Criteria Supports weighted evaluation criteria for domain-specific assessments
Batch Processing Efficiently processes multiple evaluations

Architecture

graph TD
    A[Input Task] --> B[AgentJudge]
    B --> C{Evaluation Mode}

    C -->|step()| D[Single Eval]
    C -->|run()| E[Iterative Eval]
    C -->|run_batched()| F[Batch Eval]

    D --> G[Agent Core]
    E --> G
    F --> G

    G --> H[LLM Model]
    H --> I[Quality Analysis]
    I --> J[Feedback & Output]

    subgraph "Feedback Details"
        N[Strengths]
        O[Weaknesses]
        P[Improvements]
        Q[Accuracy Check]
    end

    J --> N
    J --> O
    J --> P
    J --> Q

Class Reference

Constructor

AgentJudge(
    id: str = str(uuid.uuid4()),
    agent_name: str = "Agent Judge",
    description: str = "You're an expert AI agent judge...",
    system_prompt: str = None,
    model_name: str = "openai/o1",
    max_loops: int = 1,
    verbose: bool = False,
    evaluation_criteria: Optional[Dict[str, float]] = None,
    return_score: bool = False,
    *args,
    **kwargs
)

Parameters

Parameter Type Default Description
id str str(uuid.uuid4()) Unique identifier for the judge instance
agent_name str "Agent Judge" Name of the agent judge
description str "You're an expert AI agent judge..." Description of the agent's role
system_prompt str None Custom system instructions (uses default if None)
model_name str "openai/o1" LLM model for evaluation
max_loops int 1 Maximum evaluation iterations
verbose bool False Enable verbose logging
evaluation_criteria Optional[Dict[str, float]] None Dictionary of evaluation criteria and weights
return_score bool False Whether to return a numerical score instead of full conversation

Methods

step()

step(
    task: str = None,
    img: Optional[str] = None
) -> str

Processes a single task and returns the agent's evaluation.

Parameter Type Default Description
task str None Single task/output to evaluate
img Optional[str] None Path to image for multimodal evaluation

Returns: str - Detailed evaluation response

Raises: ValueError - If no task is provided

run()

run(
    task: str = None,
    img: Optional[str] = None
) -> Union[str, int]

Executes evaluation in multiple iterations with context building.

Parameter Type Default Description
task str None Single task/output to evaluate
img Optional[str] None Path to image for multimodal evaluation

Returns: - str - Full conversation context if return_score=False (default) - int - Numerical reward score if return_score=True

run_batched()

run_batched(
    tasks: Optional[List[str]] = None
) -> List[Union[str, int]]

Executes batch evaluation of multiple tasks.

Parameter Type Default Description
tasks Optional[List[str]] None List of tasks/outputs to evaluate

Returns: List[Union[str, int]] - Evaluation responses for each task

Examples

Basic Evaluation

from swarms.agents.agent_judge import AgentJudge

# Initialize the agent judge
judge = AgentJudge(
    agent_name="quality-judge",
    model_name="gpt-4",
    max_loops=2
)

# Example agent output to evaluate
agent_output = "The capital of France is Paris. The city is known for its famous Eiffel Tower and delicious croissants. The population is approximately 2.1 million people."

# Run evaluation with context building
evaluations = judge.run(task=agent_output)

Technical Evaluation with Custom Criteria

from swarms.agents.agent_judge import AgentJudge

# Initialize the agent judge with custom evaluation criteria
judge = AgentJudge(
    agent_name="technical-judge",
    model_name="gpt-4",
    max_loops=1,
    evaluation_criteria={
        "accuracy": 0.4,
        "completeness": 0.3,
        "clarity": 0.2,
        "logic": 0.1,
    },
)

# Example technical agent output to evaluate
technical_output = "To solve the quadratic equation x² + 5x + 6 = 0, we can use the quadratic formula: x = (-b ± √(b² - 4ac)) / 2a. Here, a=1, b=5, c=6. Substituting: x = (-5 ± √(25 - 24)) / 2 = (-5 ± √1) / 2 = (-5 ± 1) / 2. So x = -2 or x = -3."

# Run evaluation with context building
evaluations = judge.run(task=technical_output)

Creative Content Evaluation

from swarms.agents.agent_judge import AgentJudge

# Initialize the agent judge for creative content evaluation
judge = AgentJudge(
    agent_name="creative-judge",
    model_name="gpt-4",
    max_loops=2,
    evaluation_criteria={
        "creativity": 0.4,
        "originality": 0.3,
        "engagement": 0.2,
        "coherence": 0.1,
    },
)

# Example creative agent output to evaluate
creative_output = "The moon hung like a silver coin in the velvet sky, casting shadows that danced with the wind. Ancient trees whispered secrets to the stars, while time itself seemed to pause in reverence of this magical moment. The world held its breath, waiting for the next chapter of the eternal story."

# Run evaluation with context building
evaluations = judge.run(task=creative_output)

Single Task Evaluation

from swarms.agents.agent_judge import AgentJudge

# Initialize with default settings
judge = AgentJudge()

# Single task evaluation
result = judge.step(task="The answer is 42.")

Multimodal Evaluation

from swarms.agents.agent_judge import AgentJudge

judge = AgentJudge()

# Evaluate with image
evaluation = judge.step(
    task="Describe what you see in this image",
    img="path/to/image.jpg"
)

Batch Processing

from swarms.agents.agent_judge import AgentJudge

judge = AgentJudge()

# Batch evaluation
tasks = [
    "The capital of France is Paris.",
    "2 + 2 = 4",
    "The Earth is flat."
]

# Each task evaluated independently
evaluations = judge.run_batched(tasks=tasks)

Scoring Mode

from swarms.agents.agent_judge import AgentJudge

# Initialize with scoring enabled
judge = AgentJudge(
    agent_name="scoring-judge",
    model_name="gpt-4",
    max_loops=2,
    return_score=True
)

# Get numerical score instead of full conversation
score = judge.run(task="This is a correct and well-explained answer.")
# Returns: 1 (if positive keywords found) or 0

Reference

@misc{zhuge2024agentasajudgeevaluateagentsagents,
    title={Agent-as-a-Judge: Evaluate Agents with Agents}, 
    author={Mingchen Zhuge and Changsheng Zhao and Dylan Ashley and Wenyi Wang and Dmitrii Khizbullin and Yunyang Xiong and Zechun Liu and Ernie Chang and Raghuraman Krishnamoorthi and Yuandong Tian and Yangyang Shi and Vikas Chandra and Jürgen Schmidhuber},
    year={2024},
    eprint={2410.10934},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2410.10934}
}