stanfordnlp · MatsErdkamp · Oct 1, 2025 · Oct 1, 2025 · Oct 1, 2025 · Oct 1, 2025
diff --git a/docs/docs/learn/evaluation/metrics.md b/docs/docs/learn/evaluation/metrics.md
@@ -6,17 +6,15 @@ sidebar_position: 5
 
 DSPy is a machine learning framework, so you must think about your **automatic metrics** for evaluation (to track your progress) and optimization (so DSPy can make your programs more effective).
 
-
 ## What is a metric and how do I define a metric for my task?
 
-A metric is just a function that will take examples from your data and the output of your system and return a score that quantifies how good the output is. What makes outputs from your system good or bad? 
+A metric is just a function that will take examples from your data and the output of your system and return a score that quantifies how good the output is. What makes outputs from your system good or bad?
 
 For simple tasks, this could be just "accuracy" or "exact match" or "F1 score". This may be the case for simple classification or short-form QA tasks.
 
 However, for most applications, your system will output long-form outputs. There, your metric should probably be a smaller DSPy program that checks multiple properties of the output (quite possibly using AI feedback from LMs).
 
-Getting this right on the first try is unlikely, but you should start with something simple and iterate. 
-
+Getting this right on the first try is unlikely, but you should start with something simple and iterate.
 
 ## Simple metrics
 
@@ -54,6 +52,34 @@ def validate_context_and_answer(example, pred, trace=None):
 
 Defining a good metric is an iterative process, so doing some initial evaluations and looking at your data and outputs is key.
 
+## Multi-objective metrics with subscores
+
+Many real systems must balance more than one objective: quality vs. leakage, answer accuracy vs. latency, etc. DSPy metrics now expose a simple helper called [`dspy.metrics.subscore`](../../api/index.md) that lets you declare named subscores inside an ordinary Python metric. Each `subscore` behaves like a float so you can keep writing intuitive math, while DSPy records the subscore values, metadata, and the expression you returned.
+
+```python
+from dspy.metrics import subscore
+
+def metric(example, pred, ctx=None):
+    acc = subscore("accuracy", answer_exact_match(example, pred), bounds=(0, 1))
+    bleu = subscore("bleu", bleu_like(example.answer, pred.answer), bounds=(0, 1))
+    latency = subscore(
+        "latency_s",
+        (ctx.latency_ms or 0) / 1000 if ctx else 0,
+        maximize=False,
+        units="s",
+    )
+    return acc**2 + 0.3 * bleu - 0.02 * latency
+```
+
+When this metric runs during evaluation or optimization, DSPy evaluates the returned expression to obtain the aggregate scalar (preserving backwards compatibility), but also keeps a `Score` object that exposes:
+
+- `scalar`: the numeric value of the expression (`acc**2 + …`).
+- `subscores`: the resolved subscores, e.g. `{"accuracy": 1.0, "bleu": 0.73, "latency_s": 0.42}`.
+- `info`: metadata such as the canonical expression string and any per-subscore metadata you provided (bounds, maximize, units, cost, …).
+
+Optimizers can use those subscores directly for Pareto frontiers or constrained search, and evaluation tables will include additional columns for each subscore.
+
+Metrics that return subscores typically accept a third argument `ctx`, which contains runtime information (latency, token usage, optional seed). If you omit `subscore`, nothing changes—legacy metrics that return a plain float continue to work as before.
 
 ## Evaluation
 
@@ -79,7 +105,6 @@ evaluator = Evaluate(devset=YOUR_DEVSET, num_threads=1, display_progress=True, d
 evaluator(YOUR_PROGRAM, metric=YOUR_METRIC)
 ```
 
-
 ## Intermediate: Using AI feedback for your metric
 
 For most applications, your system will output long-form outputs, so your metric should check multiple dimensions of the output using AI feedback from LMs.
@@ -104,7 +129,7 @@ def metric(gold, pred, trace=None):
 
     engaging = "Does the assessed text make for a self-contained, engaging tweet?"
     correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"
-    
+
     correct =  dspy.Predict(Assess)(assessed_text=tweet, assessment_question=correct)
     engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)
 
@@ -117,20 +142,16 @@ def metric(gold, pred, trace=None):
 
 When compiling, `trace is not None`, and we want to be strict about judging things, so we will only return `True` if `score >= 2`. Otherwise, we return a score out of 1.0 (i.e., `score / 2.0`).
 
-
 ## Advanced: Using a DSPy program as your metric
 
 If your metric is itself a DSPy program, one of the most powerful ways to iterate is to compile (optimize) your metric itself. That's usually easy because the output of the metric is usually a simple value (e.g., a score out of 5) so the metric's metric is easy to define and optimize by collecting a few examples.
 
-
-
 ### Advanced: Accessing the `trace`
 
 When your metric is used during evaluation runs, DSPy will not try to track the steps of your program.
 
 But during compiling (optimization), DSPy will trace your LM calls. The trace will contain inputs/outputs to each DSPy predictor and you can leverage that to validate intermediate steps for optimization.
 
-
 ```python
 def validate_hops(example, pred, trace=None):
     hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

diff --git a/dspy/evaluate/evaluate.py b/dspy/evaluate/evaluate.py
@@ -1,7 +1,10 @@
 import csv
+import dataclasses
 import importlib
+import inspect
 import json
 import logging
+import time
 import types
 from typing import TYPE_CHECKING, Any, Callable
 
@@ -11,6 +14,8 @@
 import tqdm
 
 import dspy
+from dspy.metrics import Score
+from dspy.metrics._subscores import _begin_collect, _end_collect, finalize_scores
 from dspy.primitives.prediction import Prediction
 from dspy.utils.callback import with_callbacks
 from dspy.utils.parallelizer import ParallelExecutor
@@ -45,6 +50,23 @@ def HTML(x: str) -> str:  # noqa: N802
 logger = logging.getLogger(__name__)
 
 
+@dataclasses.dataclass
+class EvaluationMetricContext:
+    usage: dict | None = None
+    latency_ms: float | None = None
+    seed: int | None = None
+
+    @property
+    def cache_key(self) -> tuple[Any, ...]:
+        usage_key = None
+        if self.usage is not None:
+            try:
+                usage_key = json.dumps(self.usage, sort_keys=True, default=repr)
+            except TypeError:
+                usage_key = repr(self.usage)
+        return (self.latency_ms, usage_key, self.seed)
+
+
 class EvaluationResult(Prediction):
     """
     A class that represents the result of an evaluation.
@@ -109,6 +131,7 @@ def __init__(
         self.failure_score = failure_score
         self.save_as_csv = save_as_csv
         self.save_as_json = save_as_json
+        self._metric_accepts_ctx_cache: dict[int, bool] = {}
 
         if "return_outputs" in kwargs:
             raise ValueError("`return_outputs` is no longer supported. Results are always returned inside the `results` field of the `EvaluationResult` object.")
@@ -168,16 +191,31 @@ def __call__(
         )
 
         def process_item(example):
+            start_time = time.perf_counter()
             prediction = program(**example.inputs())
-            score = metric(example, prediction)
-            return prediction, score
+            latency_ms = (time.perf_counter() - start_time) * 1000.0
+
+            prediction_obj = _extract_prediction_object(prediction)
+            usage = prediction_obj.get_lm_usage() if isinstance(prediction_obj, Prediction) else None
+            ctx = EvaluationMetricContext(usage=usage, latency_ms=latency_ms)
+
+            if isinstance(prediction_obj, Prediction):
+                prediction_obj.bind_example(example)
+
+            scores = self._execute_metric(metric, example, prediction, ctx)
+
+            if isinstance(prediction_obj, Prediction) and isinstance(scores, Score):
+                prediction_obj._store_scores(scores, ctx.cache_key)
+
+            return prediction, scores
 
         results = executor.execute(process_item, devset)
         assert len(devset) == len(results)
 
-        results = [((dspy.Prediction(), self.failure_score) if r is None else r) for r in results]
+        results = [((dspy.Prediction(), Score(self.failure_score)) if r is None else r) for r in results]
         results = [(example, prediction, score) for example, (prediction, score) in zip(devset, results, strict=False)]
-        ncorrect, ntotal = sum(score for *_, score in results), len(devset)
+        aggregates = [score.scalar for *_, score in results]
+        ncorrect, ntotal = sum(aggregates), len(devset)
 
         logger.info(f"Average Metric: {ncorrect} / {ntotal} ({round(100 * ncorrect / ntotal, 1)}%)")
 
@@ -227,19 +265,19 @@ def process_item(example):
 
     @staticmethod
     def _prepare_results_output(
-            results: list[tuple["dspy.Example", "dspy.Example", Any]], metric_name: str
+            results: list[tuple["dspy.Example", "dspy.Example", Score]], metric_name: str
     ):
         return [
             (
-                merge_dicts(example, prediction) | {metric_name: score}
+                merge_dicts(example, prediction) | _scores_to_row(score, metric_name)
                 if prediction_is_dictlike(prediction)
-                else dict(example) | {"prediction": prediction, metric_name: score}
+                else dict(example) | {"prediction": prediction} | _scores_to_row(score, metric_name)
             )
             for example, prediction, score in results
         ]
 
     def _construct_result_table(
-        self, results: list[tuple["dspy.Example", "dspy.Example", Any]], metric_name: str
+        self, results: list[tuple["dspy.Example", "dspy.Example", Score]], metric_name: str
     ) -> "pd.DataFrame":
         """
         Construct a pandas DataFrame from the specified result list.
@@ -262,6 +300,49 @@ def _construct_result_table(
 
         return result_df.rename(columns={"correct": metric_name})
 
+    def _execute_metric(
+        self,
+        metric: Callable | None,
+        example: "dspy.Example",
+        prediction: Any,
+        ctx: EvaluationMetricContext,
+    ) -> Score:
+        if metric is None:
+            if isinstance(prediction, Prediction):
+                scores = prediction.resolve_score(ctx)
+                if scores is None:
+                    raise ValueError("Prediction does not provide a score and no metric was supplied.")
+                return scores
+            raise ValueError("No metric provided for evaluation.")
+
+        token = _begin_collect()
+        try:
+            if self._metric_accepts_context(metric):
+                result = metric(example, prediction, ctx)
+            else:
+                result = metric(example, prediction)
+        finally:
+            collector = _end_collect(token)
+
+        ctx_info: dict[str, Any] = {}
+        if ctx.usage is not None:
+            ctx_info["usage"] = ctx.usage
+        if ctx.latency_ms is not None:
+            ctx_info["latency_ms"] = ctx.latency_ms
+        if ctx.seed is not None:
+            ctx_info["seed"] = ctx.seed
+
+        return finalize_scores(result, collector, ctx_info=ctx_info)
+
+    def _metric_accepts_context(self, metric: Callable) -> bool:
+        cache_key = id(metric)
+        cached = self._metric_accepts_ctx_cache.get(cache_key)
+        if cached is not None:
+            return cached
+        accepts = _callable_accepts_context(metric)
+        self._metric_accepts_ctx_cache[cache_key] = accepts
+        return accepts
+
     def _display_result_table(self, result_df: "pd.DataFrame", display_table: bool | int, metric_name: str):
         """
         Display the specified result DataFrame in a table format.
@@ -321,6 +402,39 @@ def merge_dicts(d1, d2) -> dict:
     return merged
 
 
+def _scores_to_row(scores: Score, metric_name: str) -> dict[str, Any]:
+    row = {metric_name: scores.scalar}
+    for subscore_name, value in scores.subscores.items():
+        row[f"{metric_name}.{subscore_name}"] = value
+    expr = scores.info.get("expr") if isinstance(scores.info, dict) else None
+    if expr is not None:
+        row[f"{metric_name}.expr"] = expr
+    return row
+
+
+def _extract_prediction_object(prediction: Any) -> Any:
+    if isinstance(prediction, Prediction):
+        return prediction
+    if isinstance(prediction, tuple) and prediction:
+        first = prediction[0]
+        if isinstance(first, Prediction):
+            return first
+    return prediction
+
+
+def _callable_accepts_context(metric: Callable) -> bool:
+    try:
+        sig = inspect.signature(metric)
+    except (TypeError, ValueError):
+        return True
+
+    params = list(sig.parameters.values())
+    for param in params:
+        if param.kind in (inspect.Parameter.VAR_POSITIONAL, inspect.Parameter.VAR_KEYWORD):
+            return True
+    return len(params) >= 3
+
+
 def truncate_cell(content) -> str:
     """Truncate content of a cell to 25 words."""
     words = str(content).split()

diff --git a/dspy/metrics/__init__.py b/dspy/metrics/__init__.py
@@ -0,0 +1,23 @@
+"""Public helpers for metric subscores and score aggregation."""
+
+from ._resolver import resolve_metric_score
+from ._subscores import (
+    Score,
+    coerce_metric_value,
+    subscore,
+    subscore_abs,
+    subscore_clip,
+    subscore_max,
+    subscore_min,
+)
+
+__all__ = [
+    "Score",
+    "subscore",
+    "subscore_abs",
+    "subscore_min",
+    "subscore_max",
+    "subscore_clip",
+    "coerce_metric_value",
+    "resolve_metric_score",
+]