Skip to content

Conversation

@nnshah1
Copy link
Contributor

@nnshah1 nnshah1 commented Jun 9, 2025

Overview:

Adds a set of fault tolerance tests by injecting failures and summarizing the impact.

Details:

Uses pytest as a test runner with different options and configurations for different scenarios.

Where should the reviewer start?

test_runner.py

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Introduced a comprehensive fault tolerance test suite, including new test scenarios, client utilities, metrics collection, result parsing, and Circus watcher management tools.
    • Added multiple configuration files for diverse deployment and fault injection scenarios.
    • Enhanced process and logging management for test reliability and analysis.
    • Added support for configurable logging and dynamic respawn behavior via environment variables.
    • Added new pytest markers for GPU-based tests.
  • Bug Fixes

    • Improved consistent directory usage for temporary data in test servers.
  • Documentation

    • Added detailed README for the fault tolerance test suite, covering architecture, execution, and results analysis.
  • Refactor

    • Centralized and improved process termination logic for managed processes.
    • Updated internal references for component addresses to use instance-level attributes.
  • Tests

    • Added new fixtures, utilities, and parameterization for flexible and robust testing of distributed inference serving.
  • Style

    • Added missing newline to a configuration file for consistency.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 9, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nnshah1 nnshah1 force-pushed the neelays/fault_tolerance_tests branch from 7dd0da7 to 08b7599 Compare June 16, 2025 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants