Skip to content

Example of MOE on CI

Paul Zabelin edited this page Mar 15, 2025 · 8 revisions

Example of green 🟢 CAT because expected 97% success rate is met with 5 failures out of 100 runs.

This is because 97% success is the same as 3 errors out of 100, and according to our statistical calculations table, see line 3 in this CSV

The expected range of errors is between 1 and 5. So the observed experiment measuring 5 failures is within expected range between 1 and 5 with 90% confidence.

Our test asserts that expected 0.97 is within margin of error from 5 failures out of 100, see statistical assertion:

    failure_threshold = 0.97
    assert generations <= 1 or is_within_expected(
        failure_threshold, sum(not result for result in results), generations
    ), f"Expected {failure_threshold} to be within the confidence interval of the success rate"

CAT #154 - example test run

Artifacts are saved here: CAT-run-154-saved-here.zip

image

This is because 0.97 is within observed 0.95 within margin of error at 90% confidence.

Out of 5 failures we see two distinct groups:

  1. with empty list of developers []
  2. unexpected valid developers

3 with empty list of developers []

{
    "test_name": "test_metrics_100_generations",
    "folder_path": "/home/runner/work/continuous-alignment-testing/continuous-alignment-testing/examples/team_recommender/test_runs/test_metrics_100_generations-0314-22_37_21",
    "output_file": "fail-92.json",
    "metadata_path": "/home/runner/work/continuous-alignment-testing/continuous-alignment-testing/examples/team_recommender/test_runs/test_metrics_100_generations-0314-22_37_21/metadata.json",
    "validations": {
        "correct_developer_suggested": false,
        "no_developer_name_is_hallucinated": true,
        "not_empty_response": false,
        "valid_json_returned": true
    },
    "response": {
        "developers": []
    }
}

2 with not expected but valid developer names

And we have an expected list of developers, see acceptable names of developers

acceptable_people = ["Sam Thomas", "Drew Anderson", "Alex Wilson", "Alex Johnson"]
{
    "test_name": "test_metrics_100_generations",
    "folder_path": "/home/runner/work/continuous-alignment-testing/continuous-alignment-testing/examples/team_recommender/test_runs/test_metrics_100_generations-0314-22_37_21",
    "output_file": "fail-87.json",
    "metadata_path": "/home/runner/work/continuous-alignment-testing/continuous-alignment-testing/examples/team_recommender/test_runs/test_metrics_100_generations-0314-22_37_21/metadata.json",
    "validations": {
        "correct_developer_suggested": false,
        "no_developer_name_is_hallucinated": true,
        "not_empty_response": true,
        "valid_json_returned": true
    },
    "response": {
        "developers": [
            {
                "name": "Jamie Johnson",
                "availableStartDate": "2025-06-15T00:00:00Z",
                "relevantSkills": [
                    {
                        "skill": "Node",
                        "level": "5"
                    }
                ]
            },
            {
                "name": "Blake Johnson",
                "availableStartDate": "2025-06-12T00:00:00Z",
                "relevantSkills": [
                    {
                        "skill": "TypeScript",
                        "level": "5"
                    },
                    {
                        "skill": "React Native",
                        "level": "3"
                    }
                ]
            },
            {
                "name": "Blake Wilson",
                "availableStartDate": "2025-06-24T00:00:00Z",
                "relevantSkills": [
                    {
                        "skill": "Kotlin",
                        "level": "3"
                    }
                ]
            }
        ]
    }
}
Clone this wiki locally