Example of MOE on CI

Example of green 🟢 CAT because expected 97% success rate is met with 5 failures out of 100 runs.

This is because 97% success is the same as 3 errors out of 100, and according to our statistical calculations table, see line 3 in this CSV

The expected range of errors is between 1 and 5. So the observed experiment measuring 5 failures is within expected range between 1 and 5 with 90% confidence.

Our test asserts that expected 0.97 is within margin of error from 5 failures out of 100, see statistical assertion:

    failure_threshold = 0.97
    assert generations <= 1 or is_within_expected(
        failure_threshold, sum(not result for result in results), generations
    ), f"Expected {failure_threshold} to be within the confidence interval of the success rate"

CAT #154 - example test run

Artifacts are saved here: CAT-run-154-saved-here.zip

This is because 0.97 is within observed 0.95 within margin of error at 90% confidence.

Out of 5 failures we see two distinct groups:

with empty list of developers []
unexpected valid developers

3 with empty list of developers []

{
    "test_name": "test_metrics_100_generations",
    "folder_path": "/home/runner/work/continuous-alignment-testing/continuous-alignment-testing/examples/team_recommender/test_runs/test_metrics_100_generations-0314-22_37_21",
    "output_file": "fail-92.json",
    "metadata_path": "/home/runner/work/continuous-alignment-testing/continuous-alignment-testing/examples/team_recommender/test_runs/test_metrics_100_generations-0314-22_37_21/metadata.json",
    "validations": {
        "correct_developer_suggested": false,
        "no_developer_name_is_hallucinated": true,
        "not_empty_response": false,
        "valid_json_returned": true
    },
    "response": {
        "developers": []
    }
}

2 with not expected but valid developer names

And we have an expected list of developers, see acceptable names of developers

acceptable_people = ["Sam Thomas", "Drew Anderson", "Alex Wilson", "Alex Johnson"]

{
    "test_name": "test_metrics_100_generations",
    "folder_path": "/home/runner/work/continuous-alignment-testing/continuous-alignment-testing/examples/team_recommender/test_runs/test_metrics_100_generations-0314-22_37_21",
    "output_file": "fail-87.json",
    "metadata_path": "/home/runner/work/continuous-alignment-testing/continuous-alignment-testing/examples/team_recommender/test_runs/test_metrics_100_generations-0314-22_37_21/metadata.json",
    "validations": {
        "correct_developer_suggested": false,
        "no_developer_name_is_hallucinated": true,
        "not_empty_response": true,
        "valid_json_returned": true
    },
    "response": {
        "developers": [
            {
                "name": "Jamie Johnson",
                "availableStartDate": "2025-06-15T00:00:00Z",
                "relevantSkills": [
                    {
                        "skill": "Node",
                        "level": "5"
                    }
                ]
            },
            {
                "name": "Blake Johnson",
                "availableStartDate": "2025-06-12T00:00:00Z",
                "relevantSkills": [
                    {
                        "skill": "TypeScript",
                        "level": "5"
                    },
                    {
                        "skill": "React Native",
                        "level": "3"
                    }
                ]
            },
            {
                "name": "Blake Wilson",
                "availableStartDate": "2025-06-24T00:00:00Z",
                "relevantSkills": [
                    {
                        "skill": "Kotlin",
                        "level": "3"
                    }
                ]
            }
        ]
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Example of MOE on CI

CAT #154 - example test run

3 with empty list of developers []

2 with not expected but valid developer names

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally