PyTorch Monitoring & Observation Infrastructure

This repository contains Terraform configurations for managing Datadog monitoring and observability infrastructure for the PyTorch Foundation

Overview

This infrastructure-as-code setup manages:

Datadog Users: User accounts and role assignments
Datadog Roles: Custom role definitions and permissions
Monitoring Resources: Datadog monitors, dashboards, synthetics

Back to top

Prerequisites

Terraform >= 1.0
Datadog provider configured
Appropriate Datadog API and APP keys (handled by CI/CD)
Access to the PyTorch Datadog organization
Valid Linux Foundation ID (LFID) for SSO access

Back to top

Structure

.
├── datadog-users.tf      # User management configuration
├── datadog-roles.tf      # Custom role definitions
├── datadog-monitors.tf   # Monitor and alert definitions
├── datadog-synthetics_tests.tf # Synthetics API tests
├── variables.tf          # Variable definitions (if present)
├── terraform.tfvars      # Variable values (not committed)
├── scripts/              # Synthetics JavaScript checks
└── README.md             # This file

Back to top

Configuration

Variables Reference

User Variables (`dd_users`)

Field	Type	Required	Description
`email`	string	Yes	User's email address
`roles`	list(string)	No	List of role IDs to assign (defaults to empty)
`disabled`	bool	No	Whether account is disabled (defaults to false)

Role Variables (`dd_roles`)

Field	Type	Required	Description
`name`	string	Yes	Display name for the role
`permissions`	list(string)	No	List of permission IDs (defaults empty)

Available Permissions

Common permissions you can use in custom roles:

Read Permissions:

logs_read_data - Read log data
logs_read_index_data - Read indexed logs
synthetics_read - View synthetic tests
cases_read - View support cases
audit_logs_read - View audit logs

Write Permissions:

dashboards_write - Create/edit dashboards
monitors_write - Create/edit monitors
synthetics_write - Create/edit synthetic tests
cases_write - Create/edit support cases
notebooks_write - Create/edit notebooks
incident_write - Create/edit incidents

Custom Roles

The repository defines a "Custom Read Write" role (referenced as "Limited Read Write" in users) that provides:

Read Permissions:

Log data and archives access
Synthetics monitoring view
Cases and audit logs access

Write Permissions:

Dashboard creation and editing
Monitor management
Synthetics test creation
Case and notebook management
Incident response capabilities

Back to top

Usage

Adding Yourself as a User

To add yourself with the "Limited Read Write" role, create or update your terraform.tfvars file:

# terraform.tfvars
dd_users = {
  "your-username" = {
    email    = "[email protected]"
    roles    = [datadog_role.roles["custom-read-write"].id]
    disabled = false
  }
}

Example for a new team member:

# terraform.tfvars
dd_users = {
  "jane-smith" = {
    email    = "[email protected]"
    roles    = [datadog_role.roles["custom-read-write"].id]
    disabled = false
  },
  "john-doe" = {
    email    = "[email protected]"
    roles    = [datadog_role.roles["custom-read-write"].id]
    disabled = false
  }
}

Using Existing Datadog Roles

To assign existing Datadog roles instead of custom ones:

# terraform.tfvars
dd_users = {
  "readonly-user" = {
    email    = "[email protected]"
    roles    = [data.datadog_role.ro_role.id]  # Datadog Read Only Role
    disabled = false
  },
  "standard-user" = {
    email    = "[email protected]"
    roles    = [data.datadog_role.standard_role.id]  # Datadog Standard Role  
    disabled = false
  }
}

Adding Multiple Users

You can add multiple users at once:

# terraform.tfvars
dd_users = {
  "team-member-1" = {
    email    = "[email protected]"
    roles    = [datadog_role.roles["custom-read-write"].id]
    disabled = false
  },
  "team-member-2" = {
    email    = "[email protected]" 
    roles    = [datadog_role.roles["custom-read-write"].id]
    disabled = false
  },
  "contractor" = {
    email    = "[email protected]"
    roles    = [datadog_role.roles["custom-read-write"].id]
    disabled = false
  }
}

Creating Custom Roles

To define additional custom roles:

# terraform.tfvars
dd_roles = {
  "developer-role" = {
    name = "Developer Access"
    permissions = [
      "dashboards_read",
      "monitors_read",
      "logs_read_data",
      "synthetics_read"
    ]
  },
  "ops-team-role" = {
    name = "Operations Team"
    permissions = [
      "dashboards_write",
      "monitors_write", 
      "incidents_write",
      "logs_read_data"
    ]
  }
}

Back to top

Monitoring and Alerts

Synthetics website and API checks

These lightweight API checks verify availability and basic correctness for public PyTorch properties every 5 minutes:

pytorch.org
- GET https://pytorch.org → status 200 and body contains "Install PyTorch"
- Alerts: @slack-pytorch-infra-alerts
docs.pytorch.org
- GET https://docs.pytorch.org/docs/stable/index.html → status 200 and body contains "PyTorch documentation"
- Alerts: @slack-pytorch-infra-alerts
pytorch.org/docs redirect
- GET https://pytorch.org/docs → status 301; headers:
  - location is https://docs.pytorch.org/docs
  - server is nginx
- Alerts: @slack-pytorch-infra-alerts
download.pytorch.org (CDN index)
- GET https://download.pytorch.org/whl → status 200 and body contains "pytorch"
- Alerts: @slack-pytorch-infra-alerts
hud.pytorch.org
- GET https://hud.pytorch.org → status 200 and body contains "pytorch/pytorch"
- Alerts: @slack-pytorch-infra-alerts
landscape.pytorch.org
- GET https://landscape.pytorch.org → status 200 and body contains "landscape"
- Alerts: @slack-pytorch-infra-alerts
discuss.pytorch.org
- GET https://discuss.pytorch.org → status 200 and body contains "PyTorch Forums"
- Alerts: @webhook-lf-incident-io (follow LF runbook)
dev-discuss.pytorch.org
- GET https://dev-discuss.pytorch.org → status 200 and body contains "PyTorch releases"
- Alerts: @slack-pytorch-infra-alerts

Cadence: tick_every = 300s; retries: 3 attempts, 300,000 ms interval.

GitHub ci-sev issues check

Watches for open issues labeled "ci: sev" in pytorch/pytorch. Fails if any are found.

GET https://github.com/pytorch/pytorch/issues?q=state%3Aopen%20label%3A%22ci%3A%20sev%22
Expect status 200 and body contains "No results"
Alerts: @slack-pytorch-infra-alerts
Cadence: tick_every = 300s

Synthetics queue checks (scripts/)

These API tests detect long GitHub Actions runner queues and alert Slack.

How it works:

Each test calls the HUD endpoint https://hud.pytorch.org/api/clickhouse/queued_jobs_by_label?parameters=%7B%7D
The script expects HTTP 200, parses JSON, and filters by machine_type pattern
If any item exceeds a per-vendor queue time threshold, the test fails
On failure, the script logs a human message which is included in the Datadog alert and sent to Slack

Scripts and thresholds:

check-long-queue-lf.js
- Filter: machine_type startsWith 'lf.'
- Threshold: > 10,800s (3h)
check-long-queue-nvidia.js
- Filter: machine_type includes '.dgx.'
- Threshold: > 10,800s (3h)
check-long-queue-rocm.js
- Filter: machine_type includes '.rocm.'
- Threshold: > 14,400s (4h)
check-long-queue-s390x.js
- Filter: machine_type includes '.s390x'
- Threshold: > 7,200s (2h)
check-long-queue-intel.js
- Filter: machine_type includes '.idc.'
- Threshold: > 10,800s (3h)
check-long-queue-meta-h100.js
- Filter: machine_type equals 'linux.aws.h100'
- Threshold: > 21,600s (6h)
check-long-queue-meta.js
- Filter: excludes '.dgx.', '.rocm.', '.s390x', '^lf\.', '^linux.aws.h100'
- Threshold: > 10,800s (3h)

Example failure message (from script stderr):

High queue detected for machine types containing .s390x: linux.s390x (7300s)

Datadog monitors (ALI/GitHub API)

Event and metric-based monitors supporting autoscaler and GitHub API health:

ALI AutoScaler Dead Letter Queue High Number Of Messages
- Query: sum(last_5m):max:aws.sqs.number_of_messages_sent{ queuename:ghci-lf-queued-builds-dead-letter}.as_count() > 5000
- Thresholds: warning 1000; critical 5000
- Action: check scale-up logs; alerts to @webhook-lf-incident-io, @slack-PyTorch-pytorch-infra-alerts, @slack-Linux_Foundation-pytorch-alerts
ALI ValidationException Detected
- Type: event-v2 alert on SNS event with title "ALI ValidationException Detected" in last 5 minutes
- Critical when count > 0
- Action: review scale-up Lambda logs; possibly revert test-infra release
- Alerts: @slack-PyTorch-pytorch-infra-alerts, @slack-Linux_Foundation-pytorch-alerts, @webhook-lf-incident-io
GitHub API usage unusually high
- Type: event-v2 alert on SNS event with title "GitHub API usage unusually high" in last 5 minutes
- Critical when count > 0
- Action: review ALI rate limit metrics and API call counts
- Alerts: @slack-PyTorch-pytorch-infra-alerts, @slack-Linux_Foundation-pytorch-alerts, @webhook-lf-incident-io

Back to top

Deployment

Automated Deployment via GitHub Actions

All infrastructure changes are deployed automatically through GitHub Actions workflows. The deployment process includes:

Code Quality Checks: All commits must pass MegaLinter validation
Terraform Planning: Changes are planned and validated before deployment
Automated Apply: Approved changes are automatically applied to the Datadog organization

GitHub Actions Workflow

The repository uses GitHub Actions with MegaLinter for continuous deployment:

On Pull Request: Runs MegaLinter suite (includes tflint, tofu fmt, security checks)
On Merge to Main: Automatically applies changes after all checks pass
Manual Triggers: Infrastructure team can manually trigger deployments when needed

All commits pushed to any branch must pass the complete MegaLinter validation suite:

✅ Terraform Formatting (tofu fmt) - Code formatting with tofu fmt
✅ Terraform Linting (tflint) - Best practices and error detection
✅ Security Scanning - Infrastructure security checks
✅ Documentation - README and code documentation validation
✅ Configuration Validation (terraform plan) - Syntax and logic validation

Commits that fail MegaLinter checks will be rejected and cannot be merged.

Code Quality Requirements

Before any deployment, all code must pass MegaLinter validation, which includes:

TFLint: Terraform linting and best practices
OpenTofu Formatting: Code formatting with tofu fmt
Security Scanning: Infrastructure security checks
Documentation: README and code documentation validation

Manual Validation (Optional)

If you want to run individual checks for troubleshooting:

# Format all files
tofu fmt

# Check formatting
tofu fmt -check

# Run Terraform linting
tflint

Manual Deployment Steps (for testing)

If you need to test changes locally after MegaLinter validation:

Initialize Terraform:
```
terraform init
```
Review the plan:
```
terraform plan
```
Apply changes (caution - this affects production):
```
terraform apply
```
Verify deployment: Check the Datadog UI to confirm users and roles were created correctly.

Back to top

Accessing Datadog

Once your user account has been provisioned through this Terraform configuration, you can access the PyTorch Datadog organization at:

https://datadog.pytorch.org

Single Sign-On (SSO) Login

The PyTorch Datadog organization is integrated with Linux Foundation Identity (LFID) for authentication:

Navigate to https://datadog.pytorch.org
Click "Login with SSO" or "Single Sign-On"
Use your Linux Foundation ID (LFID) credentials
You will be automatically redirected to Datadog with the appropriate role permissions

First-Time Access

When accessing Datadog for the first time:

Ensure your user is provisioned: Your email must be added to this Terraform configuration and deployed
Use your LFID: Login with the same email address that was provisioned in the Terraform config
Verify permissions: Check that you can access the appropriate dashboards and features based on your assigned role

Troubleshooting Access

If you cannot access Datadog:

Check user provisioning: Ensure your user has been added to terraform.tfvars and deployed
Verify email match: Your LFID email must exactly match the email in the Terraform configuration
Role assignment: Confirm your user has been assigned the correct role (e.g., "Limited Read Write")
SSO configuration: Contact the LF PyTorch infrastructure team if SSO login fails

Role Capabilities

After logging in with SSO, your access will be determined by your assigned role:

Limited Read Write Role: Can view all monitoring data and create/edit dashboards, monitors, and incidents
Admin Role: Full administrative access (reserved for infrastructure team)
Read Only Role: View-only access to monitoring data
Standard Role: Basic Datadog access with limited write permissions

Back to top

Security Considerations

Principle of Least Privilege: Only assign necessary permissions
Regular Review: Periodically audit user access and roles
Disabled Accounts: Use disabled = true instead of deleting users when access is temporarily revoked
External Users: Consider using separate roles for contractors/external users

Back to top

Troubleshooting

Common Issues

Permission not found errors:
- Check that permission names match exactly
- Verify permissions exist in your Datadog org
- Use terraform plan to see available permissions
Role assignment failures:
- Ensure roles are created before assigning to users
- Check that role IDs are correctly referenced
User creation failures:
- Verify email addresses are valid
- Check that users don't already exist in Datadog

Getting Help

Check Terraform logs: TF_LOG=DEBUG terraform apply
Review Datadog provider documentation
Contact the PyTorch infrastructure team

Back to top

Contributing

Development Workflow

Create a feature branch from main
Make your changes to the Terraform configuration
Run MegaLinter locally to validate all code quality requirements
Fix any issues identified by MegaLinter
Test locally: Run terraform plan to validate your changes
Commit and push: Push your branch to trigger GitHub Actions checks
Submit a pull request with a clear description of changes
Address feedback: Fix any issues identified by reviewers or MegaLinter
Merge after approval: Once approved and all checks pass, merge to main

Pre-commit Requirements

Before committing code, ensure:

Code passes MegaLinter validation (run locally)
terraform plan runs successfully
Changes are tested and documented
All security requirements are met

Automated Checks

All pull requests will automatically run MegaLinter, which includes:

Terraform Formatting (tofu fmt): Ensures code follows formatting standards
Terraform Linting (tflint): Validates best practices and catches common errors
Security Scanning: Checks for security issues in the configuration
Documentation Validation: Ensures README and comments are up to date
Plan Validation: Confirms the configuration is valid and shows planned changes

Changes cannot be merged until all MegaLinter checks pass.

MegaLinter Configuration

The repository uses MegaLinter's Terraform flavor, which includes:

Multiple Terraform/OpenTofu validators
Security scanners (Checkov, TFSec)
Documentation linters
General code quality tools

For detailed configuration, see .mega-linter.yml (if present) or the default Terraform flavor settings.

Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
LICENSES		LICENSES
scripts		scripts
.gitignore		.gitignore
.mega-linter.yml		.mega-linter.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
datadog-dashboards.tf		datadog-dashboards.tf
datadog-integrations.tf		datadog-integrations.tf
datadog-monitors.tf		datadog-monitors.tf
datadog-roles.tf		datadog-roles.tf
datadog-synthetics_tests.tf		datadog-synthetics_tests.tf
datadog-teams.tf		datadog-teams.tf
datadog-users.tf		datadog-users.tf
provider.tf		provider.tf
variables.tf		variables.tf

License

pytorch-fdn/monitoring-observability

Folders and files

Latest commit

History

Repository files navigation

PyTorch Monitoring & Observation Infrastructure

Overview

Table of Contents

Prerequisites

Structure

Configuration

Variables Reference

User Variables (dd_users)

Role Variables (dd_roles)

Available Permissions

Custom Roles

Usage

Adding Yourself as a User

Using Existing Datadog Roles

Adding Multiple Users

Creating Custom Roles

Monitoring and Alerts

Synthetics website and API checks

GitHub ci-sev issues check

Synthetics queue checks (scripts/)

Datadog monitors (ALI/GitHub API)

Deployment

Automated Deployment via GitHub Actions

GitHub Actions Workflow

Code Quality Requirements

Manual Validation (Optional)

Manual Deployment Steps (for testing)

Accessing Datadog

Single Sign-On (SSO) Login

First-Time Access

Troubleshooting Access

Role Capabilities

Security Considerations

Troubleshooting

Common Issues

Getting Help

Contributing

Development Workflow

Pre-commit Requirements

Automated Checks

MegaLinter Configuration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages

User Variables (`dd_users`)

Role Variables (`dd_roles`)