A modern web application for tracking outage troubleshooting notes for Site Reliability Engineers (SREs), inspired by Outalator from the Google SRE book.
- Outage Tracking: Create and manage outages with comprehensive details
- Multi-Service Alerts: Import alerts from multiple oncall notification services
- PagerDuty
- OpsGenie
- Extensible architecture for additional services
- Note-Taking: Add plaintext or markdown notes to outages
- Tagging System: Organize outages with flexible key-value tags (e.g., Jira tickets, services, regions)
- Modular Storage: Interface-based storage layer with PostgreSQL implementation
- RESTful API: Clean HTTP API for all operations
- Slack Bot Integration: Interact with outages directly from Slack
- Create outages and add notes via messages
- Tag messages with emoji reactions to add them as notes
- MCP Server: Model Context Protocol interface for AI assistants
- Claude Desktop integration
- Natural language outage management
The application is built with Go and follows clean architecture principles:
- Domain Layer: Core business entities (Outage, Alert, Note, Tag)
- Storage Layer: Pluggable storage interface with PostgreSQL implementation
- Notification Layer: Extensible notification service integrations
- Service Layer: Business logic and orchestration
- API Layer: HTTP handlers and routing
- Go 1.21 or higher
- PostgreSQL 12 or higher
- (Optional) PagerDuty API key
- (Optional) OpsGenie API key
- (Optional) Slack workspace with bot permissions
- (Optional) Claude Desktop for MCP integration
- Clone the repository:
git clone https://github.com/conall/outalator.git
cd outalator- Install Go dependencies:
go mod download- Set up PostgreSQL:
# Create database and user
createdb outalator
createuser outalator
# Run migrations
psql -U outalator -d outalator -f migrations/001_initial_schema.sql- Configure the application:
cp config.example.yaml config.yaml
# Edit config.yaml with your settings- Build and run:
go build -o outalator cmd/outalator/main.go
./outalator -config config.yamlOr run directly:
go run cmd/outalator/main.go -config config.yamlOutalator includes a tool to bootstrap your database with historical incidents from PagerDuty or OpsGenie.
# Build the import tool
make build-import
# List available teams
./bin/import-history -service pagerduty -list-teams
# Import historical data (dry run first)
./bin/import-history -service pagerduty -since 2024-01-01T00:00:00Z -dry-run
# Import for real
./bin/import-history -service pagerduty -since 2024-01-01T00:00:00Z
# Import specific teams only
./bin/import-history -service pagerduty -since 2024-01-01T00:00:00Z -teams "TEAM_ID_1,TEAM_ID_2"For complete documentation including examples, troubleshooting, and best practices, see docs/IMPORT_HISTORY.md.
Configuration can be provided via YAML file and/or environment variables.
server:
host: 0.0.0.0
port: 8080
database:
host: localhost
port: 5432
user: outalator
password: outalator
dbname: outalator
sslmode: disable
# Optional: PagerDuty integration
pagerduty:
api_key: your-pagerduty-api-key
# Optional: OpsGenie integration
opsgenie:
api_key: your-opsgenie-api-keyEnvironment variables override config file values:
SERVER_HOST- Server hostSERVER_PORT- Server portDB_HOST- Database hostDB_PORT- Database portDB_USER- Database userDB_PASSWORD- Database passwordDB_NAME- Database namePAGERDUTY_API_KEY- PagerDuty API keyOPSGENIE_API_KEY- OpsGenie API key
POST /api/v1/outages
Content-Type: application/json
{
"title": "API Gateway Outage",
"description": "Users unable to access API endpoints",
"severity": "high",
"alert_ids": ["PXYZ123"],
"tags": [
{"key": "jira", "value": "OPS-1234"},
{"key": "service", "value": "api-gateway"}
]
}GET /api/v1/outages?limit=50&offset=0GET /api/v1/outages/{id}PATCH /api/v1/outages/{id}
Content-Type: application/json
{
"status": "resolved",
"severity": "medium"
}POST /api/v1/outages/{id}/notes
Content-Type: application/json
{
"content": "Identified root cause: database connection pool exhaustion",
"format": "markdown",
"author": "[email protected]"
}POST /api/v1/outages/{id}/tags
Content-Type: application/json
{
"key": "jira",
"value": "OPS-5678"
}GET /api/v1/tags/search?key=jira&value=OPS-1234POST /api/v1/alerts/import
Content-Type: application/json
{
"source": "pagerduty",
"external_id": "PXYZ123",
"outage_id": "optional-uuid-to-associate"
}GET /healthOutalator supports OIDC authentication with providers like Okta, Auth0, Google, etc. When authentication is enabled, all notes are automatically tagged with the authenticated user's email address.
Add to your config.yaml:
auth:
enabled: true
issuer: https://your-company.okta.com
client_id: your-okta-client-id
client_secret: your-okta-client-secret
redirect_url: http://localhost:8080/auth/callback
session_key: generate-a-random-32-byte-base64-keyGenerate a session key:
openssl rand -base64 32When authentication is disabled, the application runs without authentication (useful for development).
Outalator includes a Slack bot that allows teams to interact with outages directly from Slack.
- Create outages using simple text commands
- Add notes to outages via direct messages
- Tag existing Slack messages to add them as notes using emoji reactions
- Create a Slack app and get your bot token and signing secret
- Configure Outalator using your preferred method:
Config file:
slack:
enabled: true
bot_token: xoxb-your-bot-token
signing_secret: your-signing-secret
reaction_emoji: outage_note # Any emoji name without colonsCLI flags:
./outalator -slack-enabled -slack-bot-token=xoxb-... -slack-reaction-emoji=bookmarkEnvironment variables:
export SLACK_ENABLED=true SLACK_BOT_TOKEN=xoxb-... SLACK_REACTION_EMOJI=memo- Set up event subscriptions in Slack to point to
https://your-server.com/slack/events
Create an outage:
outage API Gateway is down | Users cannot authenticate | critical
Add a note:
note 123e4567-e89b-12d3-a456-426614174000 Restarted the API gateway service
Tag a message:
- Post a message mentioning the outage ID
- React with your configured emoji (e.g.,
:outage_note:,:bookmark:, etc.) - The message is automatically added as a note
For complete setup instructions and troubleshooting, see docs/SLACK_INTEGRATION.md.
The MCP (Model Context Protocol) server provides a standardized interface for AI assistants like Claude to interact with outages.
# Build the MCP server
go build -o mcp-server ./cmd/mcp-server
# Run it
./mcp-server -config config.yamlAdd to your Claude Desktop configuration:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"outalator": {
"command": "/path/to/outalator/mcp-server",
"args": ["-config", "/path/to/config.yaml"]
}
}
}list_outages: List all outages with paginationget_outage: Get details of a specific outagecreate_outage: Create a new outage entryadd_note: Add a note to an existing outageupdate_outage: Update an outage's status or severity
- "What outages have we had in the past week?"
- "Create an outage for the database connection issues"
- "Add a note that we restarted the Redis cluster"
- "Update outage 123... to resolved status"
For complete documentation, see docs/MCP_SERVER.md.
Outalator can be deployed to Kubernetes using either Helm or Kustomize.
# Create namespace
kubectl create namespace outalator
# Install with Helm
helm install outalator ./helm/outalator \
--namespace outalator \
--set ingress.hosts[0].host=outalator.example.com \
--set config.auth.issuer=https://your-company.okta.com \
--set secrets.auth.clientId=your-client-id \
--set secrets.auth.clientSecret=your-client-secret# Update configuration in k8s/base/configmap.yaml and k8s/base/secret.yaml
# Then apply
kubectl apply -k k8s/overlays/prodFor complete Kubernetes deployment documentation including:
- Detailed Helm configuration
- Kustomize overlays
- Okta/OIDC setup
- Security best practices
- Monitoring and troubleshooting
docker build -t outalator:latest .
docker tag outalator:latest your-registry/outalator:v1.0.0
docker push your-registry/outalator:v1.0.0The application uses PostgreSQL with the following tables:
- outages: Main outage tracking table
- alerts: Imported alerts from notification services
- notes: Troubleshooting notes attached to outages (with author attribution)
- tags: Key-value metadata tags
See migrations/001_initial_schema.sql for the complete schema.
-
Create a new package under
internal/notification/ -
Implement the
notification.Serviceinterface:Name() stringFetchAlert(ctx, alertID) (*Alert, error)FetchRecentAlerts(ctx, since) ([]*Alert, error)WebhookHandler() interface{}
-
Register the service in
cmd/outalator/main.go
- Create a new package under
internal/storage/ - Implement the
storage.Storageinterface - Update
cmd/outalator/main.goto use the new storage
go test ./...outalator/
├── cmd/
│ ├── outalator/ # Main application entry point
│ ├── mcp-server/ # MCP server for AI assistants
│ └── import-history/ # Historical data import tool
├── internal/
│ ├── api/ # HTTP handlers and routes
│ ├── config/ # Configuration management
│ ├── domain/ # Domain models and DTOs
│ ├── mcp/ # MCP server implementation
│ ├── slack/ # Slack bot integration
│ ├── notification/ # Notification service integrations
│ │ ├── opsgenie/
│ │ └── pagerduty/
│ ├── service/ # Business logic
│ └── storage/ # Storage layer
│ └── postgres/ # PostgreSQL implementation
├── docs/ # Documentation
│ ├── SLACK_INTEGRATION.md
│ └── MCP_SERVER.md
├── migrations/ # Database migration scripts
├── config.example.yaml # Example configuration file
├── go.mod # Go module definition
└── README.md # This file
Contributions are welcome! Please feel free to submit issues and pull requests.
MIT License - See LICENSE file for details