Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 31 additions & 10 deletions docs/docs/learn/evaluation/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,15 @@ sidebar_position: 5

DSPy is a machine learning framework, so you must think about your **automatic metrics** for evaluation (to track your progress) and optimization (so DSPy can make your programs more effective).


## What is a metric and how do I define a metric for my task?

A metric is just a function that will take examples from your data and the output of your system and return a score that quantifies how good the output is. What makes outputs from your system good or bad?
A metric is just a function that will take examples from your data and the output of your system and return a score that quantifies how good the output is. What makes outputs from your system good or bad?

For simple tasks, this could be just "accuracy" or "exact match" or "F1 score". This may be the case for simple classification or short-form QA tasks.

However, for most applications, your system will output long-form outputs. There, your metric should probably be a smaller DSPy program that checks multiple properties of the output (quite possibly using AI feedback from LMs).

Getting this right on the first try is unlikely, but you should start with something simple and iterate.

Getting this right on the first try is unlikely, but you should start with something simple and iterate.

## Simple metrics

Expand Down Expand Up @@ -54,6 +52,34 @@ def validate_context_and_answer(example, pred, trace=None):

Defining a good metric is an iterative process, so doing some initial evaluations and looking at your data and outputs is key.

## Multi-objective metrics with subscores

Many real systems must balance more than one objective: quality vs. leakage, answer accuracy vs. latency, etc. DSPy metrics now expose a simple helper called [`dspy.metrics.subscore`](../../api/index.md) that lets you declare named subscores inside an ordinary Python metric. Each `subscore` behaves like a float so you can keep writing intuitive math, while DSPy records the subscore values, metadata, and the expression you returned.

```python
from dspy.metrics import subscore

def metric(example, pred, ctx=None):
acc = subscore("accuracy", answer_exact_match(example, pred), bounds=(0, 1))
bleu = subscore("bleu", bleu_like(example.answer, pred.answer), bounds=(0, 1))
latency = subscore(
"latency_s",
(ctx.latency_ms or 0) / 1000 if ctx else 0,
maximize=False,
units="s",
)
return acc**2 + 0.3 * bleu - 0.02 * latency
```

When this metric runs during evaluation or optimization, DSPy evaluates the returned expression to obtain the aggregate scalar (preserving backwards compatibility), but also keeps a `Score` object that exposes:

- `scalar`: the numeric value of the expression (`acc**2 + …`).
- `subscores`: the resolved subscores, e.g. `{"accuracy": 1.0, "bleu": 0.73, "latency_s": 0.42}`.
- `info`: metadata such as the canonical expression string and any per-subscore metadata you provided (bounds, maximize, units, cost, …).

Optimizers can use those subscores directly for Pareto frontiers or constrained search, and evaluation tables will include additional columns for each subscore.

Metrics that return subscores typically accept a third argument `ctx`, which contains runtime information (latency, token usage, optional seed). If you omit `subscore`, nothing changes—legacy metrics that return a plain float continue to work as before.

## Evaluation

Expand All @@ -79,7 +105,6 @@ evaluator = Evaluate(devset=YOUR_DEVSET, num_threads=1, display_progress=True, d
evaluator(YOUR_PROGRAM, metric=YOUR_METRIC)
```


## Intermediate: Using AI feedback for your metric

For most applications, your system will output long-form outputs, so your metric should check multiple dimensions of the output using AI feedback from LMs.
Expand All @@ -104,7 +129,7 @@ def metric(gold, pred, trace=None):

engaging = "Does the assessed text make for a self-contained, engaging tweet?"
correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"

correct = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=correct)
engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)

Expand All @@ -117,20 +142,16 @@ def metric(gold, pred, trace=None):

When compiling, `trace is not None`, and we want to be strict about judging things, so we will only return `True` if `score >= 2`. Otherwise, we return a score out of 1.0 (i.e., `score / 2.0`).


## Advanced: Using a DSPy program as your metric

If your metric is itself a DSPy program, one of the most powerful ways to iterate is to compile (optimize) your metric itself. That's usually easy because the output of the metric is usually a simple value (e.g., a score out of 5) so the metric's metric is easy to define and optimize by collecting a few examples.



### Advanced: Accessing the `trace`

When your metric is used during evaluation runs, DSPy will not try to track the steps of your program.

But during compiling (optimization), DSPy will trace your LM calls. The trace will contain inputs/outputs to each DSPy predictor and you can leverage that to validate intermediate steps for optimization.


```python
def validate_hops(example, pred, trace=None):
hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]
Expand Down
130 changes: 122 additions & 8 deletions dspy/evaluate/evaluate.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
import csv
import dataclasses
import importlib
import inspect
import json
import logging
import time
import types
from typing import TYPE_CHECKING, Any, Callable

Expand All @@ -11,6 +14,8 @@
import tqdm

import dspy
from dspy.metrics import Score
from dspy.metrics._subscores import _begin_collect, _end_collect, finalize_scores
from dspy.primitives.prediction import Prediction
from dspy.utils.callback import with_callbacks
from dspy.utils.parallelizer import ParallelExecutor
Expand Down Expand Up @@ -45,6 +50,23 @@ def HTML(x: str) -> str: # noqa: N802
logger = logging.getLogger(__name__)


@dataclasses.dataclass
class EvaluationMetricContext:
usage: dict | None = None
latency_ms: float | None = None
seed: int | None = None

@property
def cache_key(self) -> tuple[Any, ...]:
usage_key = None
if self.usage is not None:
try:
usage_key = json.dumps(self.usage, sort_keys=True, default=repr)
except TypeError:
usage_key = repr(self.usage)
return (self.latency_ms, usage_key, self.seed)


class EvaluationResult(Prediction):
"""
A class that represents the result of an evaluation.
Expand Down Expand Up @@ -109,6 +131,7 @@ def __init__(
self.failure_score = failure_score
self.save_as_csv = save_as_csv
self.save_as_json = save_as_json
self._metric_accepts_ctx_cache: dict[int, bool] = {}

if "return_outputs" in kwargs:
raise ValueError("`return_outputs` is no longer supported. Results are always returned inside the `results` field of the `EvaluationResult` object.")
Expand Down Expand Up @@ -168,16 +191,31 @@ def __call__(
)

def process_item(example):
start_time = time.perf_counter()
prediction = program(**example.inputs())
score = metric(example, prediction)
return prediction, score
latency_ms = (time.perf_counter() - start_time) * 1000.0

prediction_obj = _extract_prediction_object(prediction)
usage = prediction_obj.get_lm_usage() if isinstance(prediction_obj, Prediction) else None
ctx = EvaluationMetricContext(usage=usage, latency_ms=latency_ms)

if isinstance(prediction_obj, Prediction):
prediction_obj.bind_example(example)

scores = self._execute_metric(metric, example, prediction, ctx)

if isinstance(prediction_obj, Prediction) and isinstance(scores, Score):
prediction_obj._store_scores(scores, ctx.cache_key)

return prediction, scores

results = executor.execute(process_item, devset)
assert len(devset) == len(results)

results = [((dspy.Prediction(), self.failure_score) if r is None else r) for r in results]
results = [((dspy.Prediction(), Score(self.failure_score)) if r is None else r) for r in results]
results = [(example, prediction, score) for example, (prediction, score) in zip(devset, results, strict=False)]
ncorrect, ntotal = sum(score for *_, score in results), len(devset)
aggregates = [score.scalar for *_, score in results]
ncorrect, ntotal = sum(aggregates), len(devset)

logger.info(f"Average Metric: {ncorrect} / {ntotal} ({round(100 * ncorrect / ntotal, 1)}%)")

Expand Down Expand Up @@ -227,19 +265,19 @@ def process_item(example):

@staticmethod
def _prepare_results_output(
results: list[tuple["dspy.Example", "dspy.Example", Any]], metric_name: str
results: list[tuple["dspy.Example", "dspy.Example", Score]], metric_name: str
):
return [
(
merge_dicts(example, prediction) | {metric_name: score}
merge_dicts(example, prediction) | _scores_to_row(score, metric_name)
if prediction_is_dictlike(prediction)
else dict(example) | {"prediction": prediction, metric_name: score}
else dict(example) | {"prediction": prediction} | _scores_to_row(score, metric_name)
)
for example, prediction, score in results
]

def _construct_result_table(
self, results: list[tuple["dspy.Example", "dspy.Example", Any]], metric_name: str
self, results: list[tuple["dspy.Example", "dspy.Example", Score]], metric_name: str
) -> "pd.DataFrame":
"""
Construct a pandas DataFrame from the specified result list.
Expand All @@ -262,6 +300,49 @@ def _construct_result_table(

return result_df.rename(columns={"correct": metric_name})

def _execute_metric(
self,
metric: Callable | None,
example: "dspy.Example",
prediction: Any,
ctx: EvaluationMetricContext,
) -> Score:
if metric is None:
if isinstance(prediction, Prediction):
scores = prediction.resolve_score(ctx)
if scores is None:
raise ValueError("Prediction does not provide a score and no metric was supplied.")
return scores
raise ValueError("No metric provided for evaluation.")

token = _begin_collect()
try:
if self._metric_accepts_context(metric):
result = metric(example, prediction, ctx)
else:
result = metric(example, prediction)
finally:
collector = _end_collect(token)

ctx_info: dict[str, Any] = {}
if ctx.usage is not None:
ctx_info["usage"] = ctx.usage
if ctx.latency_ms is not None:
ctx_info["latency_ms"] = ctx.latency_ms
if ctx.seed is not None:
ctx_info["seed"] = ctx.seed

return finalize_scores(result, collector, ctx_info=ctx_info)

def _metric_accepts_context(self, metric: Callable) -> bool:
cache_key = id(metric)
cached = self._metric_accepts_ctx_cache.get(cache_key)
if cached is not None:
return cached
accepts = _callable_accepts_context(metric)
self._metric_accepts_ctx_cache[cache_key] = accepts
return accepts

def _display_result_table(self, result_df: "pd.DataFrame", display_table: bool | int, metric_name: str):
"""
Display the specified result DataFrame in a table format.
Expand Down Expand Up @@ -321,6 +402,39 @@ def merge_dicts(d1, d2) -> dict:
return merged


def _scores_to_row(scores: Score, metric_name: str) -> dict[str, Any]:
row = {metric_name: scores.scalar}
for subscore_name, value in scores.subscores.items():
row[f"{metric_name}.{subscore_name}"] = value
expr = scores.info.get("expr") if isinstance(scores.info, dict) else None
if expr is not None:
row[f"{metric_name}.expr"] = expr
return row


def _extract_prediction_object(prediction: Any) -> Any:
if isinstance(prediction, Prediction):
return prediction
if isinstance(prediction, tuple) and prediction:
first = prediction[0]
if isinstance(first, Prediction):
return first
return prediction


def _callable_accepts_context(metric: Callable) -> bool:
try:
sig = inspect.signature(metric)
except (TypeError, ValueError):
return True

params = list(sig.parameters.values())
for param in params:
if param.kind in (inspect.Parameter.VAR_POSITIONAL, inspect.Parameter.VAR_KEYWORD):
return True
return len(params) >= 3


def truncate_cell(content) -> str:
"""Truncate content of a cell to 25 words."""
words = str(content).split()
Expand Down
23 changes: 23 additions & 0 deletions dspy/metrics/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
"""Public helpers for metric subscores and score aggregation."""

from ._resolver import resolve_metric_score
from ._subscores import (
Score,
coerce_metric_value,
subscore,
subscore_abs,
subscore_clip,
subscore_max,
subscore_min,
)

__all__ = [
"Score",
"subscore",
"subscore_abs",
"subscore_min",
"subscore_max",
"subscore_clip",
"coerce_metric_value",
"resolve_metric_score",
]
Loading