Evaluations Overview
Evaluations are automated scripts that continuously score your agent’s interactions. They are the primary mechanism for detecting quality regressions, monitoring specific failure patterns, and providing quantitative feedback on every trace.What Is an Evaluation
An evaluation is a JavaScript-like sandboxed script that receives a trace’s conversation and metadata, processes them (optionally using LLM calls), and returns a verdict. Each evaluation consists of:- A name: A descriptive identifier (e.g., “Jailbreak Detection”, “Answer Completeness”)
- A description: A longer explanation of what the evaluation checks for
- A script: The logic that analyzes a trace and produces a verdict
- A trigger configuration: Which traces the evaluation should run against, how often, and at what sample rate
How Evaluations Work
- A trace completes in your project (after a debounce window with no new spans)
- Latitude checks the trace against each active evaluation’s trigger configuration
- For each matching evaluation, the script runs with the trace’s data as input
- The script returns a verdict using
Passed()orFailed()helpers - Latitude creates a score from the result, attached to the trace
- If the score fails, it feeds into issue discovery
Evaluation Scripts
Evaluation scripts run inside a host-controlled sandbox with access to:Passed(score?, feedback): Return a passing verdict. Feedback is always required. Score defaults to1if omitted.Failed(score?, feedback): Return a failing verdict. Feedback is always required. Score defaults to0if omitted.llm(prompt, options?): Make an LLM call through Latitude’s managed infrastructure. Accepts a string prompt and optional configuration (temperature, maxTokens, schema).parse(value, schema): Validate an unknown value against a Zod schema.zod: The Zod schema library for structured validation.
Creating Evaluations
From Issues
The most common path. When Latitude discovers an issue from failing scores, you can click “Generate Evaluation” on the issue detail page. Latitude uses the issue’s description and example failures to generate a monitoring script automatically through an optimization pipeline that maximizes alignment with human judgment.User-Authored
You can write evaluation scripts directly. This is useful for domain-specific checks that aren’t covered by issue-generated evaluations. The exact user-authored evaluation editor UX is still under development.Evaluation Lifecycle
Evaluations have a clear lifecycle:- Active: Running on matching traces in real time. An active evaluation has
sampling > 0. - Paused: Temporarily disabled by setting
samplingto0. Configuration is preserved; resume by setting sampling back to a positive value. - Archived: Read-only. Archived evaluations are visible in the UI but never trigger. When an issue is manually ignored, its linked evaluations are archived immediately.
- Deleted: Soft-deleted from the management UI but still represented in historical analytics.