Evaluation Alignment
Alignment measures how closely your automated evaluations match human judgment. It answers a critical question: Can you trust your evaluations?Why Alignment Matters
Automated evaluations are only useful if they agree with what a human reviewer would say. Without alignment tracking:- You don’t know if an evaluation is too strict or too lenient
- You can’t detect when an evaluation starts disagreeing with reality
- You have no signal for when to recalibrate your evaluation scripts
How Alignment Works
Alignment is computed when both an evaluation and a human annotation have scored the same trace. Latitude compares their pass/fail verdicts and computes metrics from a stored confusion matrix. The only persisted alignment primitive is the confusion matrix. All derived metrics (MCC, accuracy, F1) are computed from those stored counts on read.Matthews Correlation Coefficient (MCC)
MCC is the primary alignment metric. It’s a balanced measure that works well even when pass/fail rates are imbalanced. MCC ranges from -1 to +1:| MCC Range | Interpretation |
|---|---|
| 0.7 to 1.0 | Strong alignment: evaluation reliably matches human judgment |
| 0.4 to 0.7 | Moderate alignment: evaluation is useful but has blind spots |
| 0.0 to 0.4 | Weak alignment: evaluation needs recalibration |
| Below 0.0 | Negative correlation: evaluation is systematically wrong |
Confusion Matrix
The confusion matrix breaks down agreement into four categories:- True Positive: Both evaluation and human say “pass”
- True Negative: Both say “fail”
- False Positive: Evaluation says “pass” but human says “fail” (evaluation is too lenient)
- False Negative: Evaluation says “fail” but human says “pass” (evaluation is too strict)
Viewing Alignment
Each evaluation’s detail page shows alignment metrics when annotation data exists. You’ll see:- Current MCC and trend over time
- Confusion matrix for the selected time period
- Last aligned timestamp
- A manual realignment button
Alignment and Evaluation Generation
When you generate an evaluation from an issue, alignment is core to the generation process:- Latitude collects annotation-derived ground truth: at least one finalized, failed annotation linked to the issue (positive examples), plus available negative examples
- The optimizer generates candidate scripts and evaluates them against this ground truth
- The best script is selected based on ordered objectives: maximize MCC, then minimize cost, then minimize duration
- The confusion matrix is stored on the evaluation
Automatic Realignment
Once an evaluation exists, Latitude keeps it calibrated:- Incremental refresh: When the script hash hasn’t changed, new examples are evaluated and added to the existing confusion matrix
- Full re-optimization: When alignment (MCC) degrades beyond a tolerance threshold, the optimizer runs a full pass
- Debounced scheduling: Metric recomputation at most once per hour; full re-optimization at most once every eight hours
- Manual realignment: Available from the evaluation dashboard, rate-limited
Improving Alignment
When alignment is low:- Review the confusion matrix: Is the evaluation too strict or too lenient?
- Examine false positives and false negatives: Look at specific traces where the evaluation and human disagree. What did the evaluation miss?
- Add more annotations: More human-reviewed traces give the optimizer better signal for realignment
- Trigger manual realignment: After adding annotations, use the realignment button to refresh
Next Steps
- Annotations: How human review produces the ground truth for alignment
- Annotation Queues: Building focused review backlogs
- Issues: How failed evaluations become trackable issues