Evaluations Overview

Evaluations are automated scripts that continuously score your agent’s interactions. They are the primary mechanism for detecting quality regressions, monitoring specific failure patterns, and providing quantitative feedback on every trace.

What Is an Evaluation

An evaluation is a JavaScript-like sandboxed script that receives a trace’s conversation and metadata, processes them (optionally using LLM calls), and returns a verdict. Each evaluation consists of:

A name: A descriptive identifier (e.g., “Jailbreak Detection”, “Answer Completeness”)
A description: A longer explanation of what the evaluation checks for
A script: The logic that analyzes a trace and produces a verdict
A trigger configuration: Which traces the evaluation should run against, how often, and at what sample rate

How Evaluations Work

A trace completes in your project (after a debounce window with no new spans)
Latitude checks the trace against each active evaluation’s trigger configuration
For each matching evaluation, the script runs with the trace’s data as input
The script returns a verdict using Passed() or Failed() helpers
Latitude creates a score from the result, attached to the trace
If the score fails, it feeds into issue discovery

Evaluation Scripts

Evaluation scripts run inside a host-controlled sandbox with access to:

Passed(score?, feedback): Return a passing verdict. Feedback is always required. Score defaults to 1 if omitted.
Failed(score?, feedback): Return a failing verdict. Feedback is always required. Score defaults to 0 if omitted.
llm(prompt, options?): Make an LLM call through Latitude’s managed infrastructure. Accepts a string prompt and optional configuration (temperature, maxTokens, schema).
parse(value, schema): Validate an unknown value against a Zod schema.
zod: The Zod schema library for structured validation.

A simple evaluation script might look like:

const lastMessage = messages[messages.length - 1]
if (lastMessage.role === 'assistant') {
  const length = lastMessage.content.length
  if (length <= 500) {
    return Passed(`Response is concise at ${length} characters`)
  }
  return Failed(`Response is ${length} characters. Consider being more concise`)
}

More sophisticated evaluations use LLM calls to assess quality:

const result = await llm(
  `You are a quality evaluator. Analyze this conversation and determine if the assistant stayed on topic.\n\nConversation:\n${JSON.stringify(messages)}`,
  {
    schema: z.object({
      on_topic: z.boolean(),
      feedback: z.string()
    })
  }
)

const parsed = await parse(result, z.object({
  on_topic: z.boolean(),
  feedback: z.string()
}))

if (parsed.valid && result.on_topic) {
  return Passed(result.feedback)
}
return Failed(result.feedback)

The script never talks directly to the outside world. All external capabilities are exposed as host-controlled functions. The same scripts run both in backend monitoring and in simulations.

Creating Evaluations

From Issues

The most common path. When Latitude discovers an issue from failing scores, you can click “Generate Evaluation” on the issue detail page. Latitude uses the issue’s description and example failures to generate a monitoring script automatically through an optimization pipeline that maximizes alignment with human judgment.

User-Authored

You can write evaluation scripts directly. This is useful for domain-specific checks that aren’t covered by issue-generated evaluations. The exact user-authored evaluation editor UX is still under development.

Evaluation Lifecycle

Evaluations have a clear lifecycle:

Active: Running on matching traces in real time. An active evaluation has sampling > 0.
Paused: Temporarily disabled by setting sampling to 0. Configuration is preserved; resume by setting sampling back to a positive value.
Archived: Read-only. Archived evaluations are visible in the UI but never trigger. When an issue is manually ignored, its linked evaluations are archived immediately.
Deleted: Soft-deleted from the management UI but still represented in historical analytics.

Next Steps

Triggers: Configure which traces an evaluation monitors
Alignment: Measure how well evaluations agree with human judgment
Issues: How evaluation failures become trackable issues

Getting Started

Telemetry

Observability

Evaluations

Annotations

Scores

Issues

Simulations

Evaluations Overview

Evaluations Overview

What Is an Evaluation

How Evaluations Work

Evaluation Scripts

Creating Evaluations

From Issues

User-Authored

Evaluation Lifecycle

Next Steps

Getting Started

Telemetry

Observability

Evaluations

Annotations

Scores

Issues

Simulations

​Evaluations Overview

​What Is an Evaluation

​How Evaluations Work

​Evaluation Scripts

​Creating Evaluations

​From Issues

​User-Authored

​Evaluation Lifecycle

​Next Steps

Evaluations Overview

What Is an Evaluation

How Evaluations Work

Evaluation Scripts

Creating Evaluations

From Issues

User-Authored

Evaluation Lifecycle

Next Steps