BuildCheck is a comprehensive tool for analyzing GitHub organizations to identify build tools, their versions, and Artifactory repository usage across Jenkins jobs and other CI/CD pipelines.
- 🔍 Repository Discovery: Automatically finds all repositories in a GitHub organization
- 🎯 Single Repository Analysis: Target specific repositories for focused analysis
- 🛠️ Build Tool Detection: Identifies Maven, Gradle, npm, Grunt, Packer, Docker, and Jenkins configurations
- 📦 Artifactory Analysis: Discovers Artifactory repositories used for dependencies and artifacts
- 📊 Comprehensive Reporting: Generates detailed reports with tool versions and repository usage
- 🚀 Jenkins Pipeline Analysis: Specialized analysis for Jenkinsfiles and pipeline configurations
- ⚡ High Performance: Only scans root directory for build files (much faster than full repository scan)
- 📋 Missing Build Detection: Reports repositories that don't have build configurations
- 🚫 Smart Filtering: Automatically excludes infrastructure and Terraform repositories
- ⚡ Jenkins-Only Mode: Ultra-fast analysis of only repositories with Jenkinsfiles
- 🚀 Parallel Processing: Multi-threaded analysis for faster processing
- 📊 Enhanced Progress: Real-time progress with file analysis tracking
- 🔍 Repository Discovery Progress: Visual progress tracking for repository discovery and filtering
- 📊 API Call Prediction: Predict API usage before starting analysis to avoid rate limits
- ⚡ Bulk File Fetching: Reduce API calls by up to 80% using optimized bulk operations
- 🎯 Organization Size Estimation: Quick assessment of repository count without full enumeration
- Maven:
pom.xml
,maven-wrapper.properties
- Gradle:
build.gradle
,gradle-wrapper.properties
- npm:
package.json
,package-lock.json
- Docker:
Dockerfile
,docker-compose.yml
- Grunt:
Gruntfile.js
,Gruntfile.coffee
,package.json
(for grunt dependencies) - Packer:
*.pkr.hcl
,*.pkr.json
,packer.json
,packer.pkr.hcl
- Jenkins:
Jenkinsfile
, pipeline configurations
- Clone this repository:
git clone <repository-url>
cd BuildCheck
- Set up the virtual environment and install dependencies:
./setup.sh
- Set up your GitHub token:
export GITHUB_TOKEN=your_github_personal_access_token
BuildCheck supports a YAML configuration file that allows you to set your organization name, parallelism settings, and repository exclusions without having to use command line options every time.
# Create a configuration file with your organization settings
python setup_config.py --org your-organization-name
# Or with additional options
python setup_config.py --org your-organization-name --jenkins-only --max-workers 6 --verbose
The configuration file (config.yaml
) supports the following settings:
# GitHub Organization Configuration
organization: "your-org-name" # Required: GitHub organization to analyze
# Performance Settings
parallelism:
max_workers: 8 # Number of parallel workers (default: 8, recommended: 4-8)
rate_limit_delay: 0.05 # Delay between API calls in seconds (default: 0.05)
# Repository Exclusions
exclusions:
# Exact repository names to exclude
repositories:
- "infrastructure-environments"
- "infrastructure-modules"
- "documentation"
- "wiki-content"
# Pattern-based exclusions (supports wildcards and regex)
patterns:
- "terraform-*" # Exclude all repositories starting with "terraform-"
- "*-infra" # Exclude repositories ending with "-infra"
- "legacy-*" # Exclude repositories starting with "legacy-"
- "test-*" # Exclude test repositories
- "demo-*" # Exclude demo repositories
# Analysis Mode
analysis:
jenkins_only: false # Only analyze repositories with Jenkinsfiles (much faster)
single_repository: null # Analyze specific repository (e.g., "my-repo") or null for all
# Caching Configuration
caching:
enabled: true # Enable caching of repository lists
directory: ".cache" # Directory to store cache files
duration: 3600 # Cache duration in seconds (1 hour)
# Output Configuration
output:
json_report: null # Output file for JSON report (e.g., "report.json") or null to skip
csv_report: null # Output file for CSV report (e.g., "report.csv") or null to skip
html_report: null # Output file for HTML report (e.g., "report.html") or null to skip
verbose: false # Enable verbose logging
Once you have a configuration file, you can run BuildCheck without command line arguments:
# Run with default config.yaml
python build_check.py
# Run with custom configuration file
python build_check.py --config my-config.yaml
# Command line options override configuration file settings
python build_check.py --verbose --max-workers 4
BuildCheck supports multiple output formats for different use cases:
- Rich, colored display with tables and progress bars
- Summary sections and detailed analysis
- Perfect for interactive use and quick overview
- Structured data format for programmatic processing
- Complete dataset with metadata
- Ideal for integration with other tools and APIs
- Spreadsheet-friendly format
- Single table with all findings
- Perfect for Excel, Google Sheets, or data analysis tools
- Web-friendly format with styling
- Interactive tables and summary statistics
- Great for sharing with stakeholders or embedding in dashboards
You can generate multiple formats simultaneously:
python build_check.py --org your-org --output report.json --csv report.csv --html report.html
# Show current configuration settings
python setup_config.py show-config
# Show configuration from specific file
python setup_config.py show-config --config my-config.yaml
# Create default configuration file
python build_check.py --create-config
# Create configuration file with custom path
python build_check.py --create-config --config my-config.yaml
# Set up virtual environment and install dependencies
./setup.sh
# Set your GitHub token
export GITHUB_TOKEN=your_github_personal_access_token
# Activate virtual environment (if not already active)
source venv/bin/activate
# Run analysis on all repositories in the organization
python build_check.py --org your-organization-name
# Analyze a specific repository
python build_check.py --org your-organization-name --repo your-repo-name
# JSON report (structured data)
python build_check.py --org your-organization-name --output report.json
# CSV report (spreadsheet format)
python build_check.py --org your-organization-name --csv report.csv
# HTML report (web-friendly format)
python build_check.py --org your-organization-name --html report.html
# Multiple formats at once
python build_check.py --org your-organization-name --output report.json --csv report.csv --html report.html
# Full analysis (all repositories)
./run_analysis.sh your-organization-name
# Jenkins-only mode (much faster)
./run_analysis.sh your-organization-name jenkins-only
# Parallel processing with 8 workers
./run_analysis.sh your-organization-name 8
# Jenkins-only with 6 parallel workers
./run_analysis.sh your-organization-name jenkins-only 6
# Use custom delay between API calls (default: 0.05 seconds - optimized for performance)
python build_check.py --org your-organization-name --rate-limit-delay 0.1
# Analyze a specific repository with custom delay
python build_check.py --org your-organization-name --repo your-repo-name --rate-limit-delay 0.1
# Jenkins-only mode with custom delay
python build_check.py --org your-organization-name --jenkins-only --rate-limit-delay 0.05
# Parallel processing with 8 workers
python build_check.py --org your-organization-name --max-workers 8
# Combine options
python build_check.py --org your-organization-name --output report.json --rate-limit-delay 0.05 --max-workers 6
# Single repository with verbose logging
python build_check.py --org your-organization-name --repo your-repo-name --verbose --output repo-analysis.json
# Enable verbose logging for debugging
python build_check.py --org your-organization-name --verbose
# Verbose logging with Jenkins-only mode
python build_check.py --org your-organization-name --verbose --jenkins-only
The tool supports caching repository lists to reduce API calls during development and testing:
# Enable caching (reduces API calls significantly)
python build_check.py --org your-organization-name --use-cache
# Use custom cache directory
python build_check.py --org your-organization-name --use-cache --cache-dir /tmp/buildcheck-cache
# Clear cache before running
python build_check.py --org your-organization-name --clear-cache
# Jenkins-only mode with caching
python build_check.py --org your-organization-name --jenkins-only --use-cache
Use the cache manager utility to inspect and manage cache files:
# List all cache files
python cache_manager.py list
# Clear all cache files
python cache_manager.py clear
# Clear cache for specific organization
python cache_manager.py clear --org your-organization-name
# Inspect a specific cache file
python cache_manager.py inspect your-org_jenkins_repos.pkl
BuildCheck includes advanced API optimization features to handle large organizations efficiently:
Predict API usage before starting analysis:
# Predict API usage for full analysis
python build_check.py --org your-org --predict-api
# Predict API usage for Jenkins-only mode
python build_check.py --org your-org --jenkins-only --predict-api
Use bulk file fetching to dramatically reduce API calls:
# Use bulk analysis for better efficiency
python build_check.py --org your-org --bulk-analysis
# Combine with other optimizations
python build_check.py --org your-org --jenkins-only --bulk-analysis --predict-api
Mode | API Calls | Time | Use Case |
---|---|---|---|
Full Analysis | ~10 per repo | Slow | Complete analysis |
Jenkins-Only | ~3 per repo | Fast | CI/CD focus |
Bulk Analysis | ~2 per repo | Very Fast | Large organizations |
Cached | ~1 per repo | Instant | Repeated runs |
# For organizations with 500+ repositories
python build_check.py --org your-org --jenkins-only --bulk-analysis --predict-api --use-cache --max-workers 4
# For development and testing
python build_check.py --org your-org --use-cache --verbose
# For maximum efficiency
python build_check.py --org your-org --jenkins-only --bulk-analysis --rate-limit-delay 0.05
For detailed information about API optimization, see API_OPTIMIZATION_GUIDE.md.
## What It Analyzes
### Build Tools Detected
#### Maven
- **Files**: `pom.xml`, `maven-wrapper.properties`
- **Version Detection**: Maven version, wrapper version
- **Usage**: Maven build configurations and dependencies
#### Gradle
- **Files**: `build.gradle`, `gradle-wrapper.properties`, `gradle.properties`
- **Version Detection**: Gradle version, wrapper distribution
- **Usage**: Gradle build scripts and configurations
#### npm
- **Files**: `package.json`, `package-lock.json`
- **Version Detection**: Node.js and npm engine requirements
- **Usage**: JavaScript/Node.js project dependencies
#### Grunt (New!)
- **Files**: `Gruntfile.js`, `Gruntfile.coffee`, `package.json`
- **Version Detection**: Grunt and grunt-cli versions
- **Usage**: JavaScript task automation and build processes
#### Packer (New!)
- **Files**: `*.pkr.hcl`, `*.pkr.json`, `packer.json`, `packer.pkr.hcl`
- **Version Detection**: Packer version requirements
- **Usage**: Infrastructure as Code image building
#### Docker
- **Files**: `Dockerfile`, `docker-compose.yml`, `docker-compose.yaml`
- **Version Detection**: Base image versions
- **Usage**: Containerization and deployment
#### Jenkins
- **Files**: `Jenkinsfile`, `Jenkinsfile.groovy`, `.jenkins/pipeline.groovy`
- **Version Detection**: Agent labels and tool configurations
- **Usage**: CI/CD pipeline definitions
### Artifactory Integration
- Repository URLs and names
- Usage patterns (pull dependencies vs push artifacts)
- Credential configurations
- Repository references in build scripts
## Output
The tool provides:
1. **Console Report**: Rich formatted output with tables and summaries
2. **JSON Report**: Structured data for further processing
3. **Detailed Analysis**: Per-repository breakdown of tools and configurations
## Example Output
Maven: 3.8.6, 3.9.0 Repositories: service-a, service-b, api-gateway
Gradle: 7.6, 8.0 Repositories: mobile-app, backend-service
Grunt: 1.4.3, 1.5.0 Repositories: frontend-app, ui-components
Packer: 1.8.0, 1.9.0 Repositories: infrastructure, ami-builder
libs-release-local: push Used in: service-a, service-b
libs-snapshot-local: both Used in: api-gateway, mobile-app
Excluded 8 infrastructure/Terraform repositories from analysis: • infrastructure-environments • infrastructure-modules • terraform-aws • terraform-azure • terraform-gcp • terraform-modules • terraform-templates • terraform-variables
Found 5 repositories without build configurations: • documentation • readme-updates • test-repo • wiki-content • legacy-project
GITHUB_TOKEN
: Your GitHub personal access token (required)
--org
: GitHub organization name (can also be set in config file)--repo
: Specific repository name to analyze (e.g., "my-repo"). If not specified, analyzes all repositories in the organization.--token
: GitHub personal access token (optional if set in environment)--output
: Output file for JSON report (optional)--csv
: Output file for CSV report (optional)--html
: Output file for HTML report (optional)--rate-limit-delay
: Delay between API calls in seconds (default: 0.05 - optimized for performance)--jenkins-only
: Only analyze repositories with Jenkinsfiles (much faster)--max-workers
: Maximum number of parallel workers (default: 8, recommended: 4-8)--verbose
: Enable verbose logging for detailed API request information--use-cache
: Enable caching of repository lists to reduce API calls during development--cache-dir
: Directory to store cache files (default: .cache)--clear-cache
: Clear all cache files before running analysis--config
,-c
: Path to configuration file (default: config.yaml)--create-config
: Create a default configuration file and exit
- Python 3.7+
- virtualenv (will be installed automatically if missing)
- GitHub Personal Access Token with
repo
scope - Internet connection for GitHub API access
The tool has been optimized for significantly better performance:
- Cached Rate Limit Checking: Rate limit information is cached for 30 seconds to avoid excessive API calls
- Reduced API Calls: Eliminated double API calls that were causing performance issues
- Optimized Default Delay: Reduced default delay from 0.1s to 0.05s (20 calls/second vs 10 calls/second)
- Smart Rate Limit Checking: Only checks rate limits every 10 API calls instead of every call
- 50% Faster: Typical analysis is now 50% faster due to reduced API overhead
- Better Parallel Processing: Optimized rate limiting works better with parallel workers
- Reduced Verbose Overhead: Verbose logging no longer makes additional API calls
- Before: 2 API calls per actual API call (rate limit check + actual call)
- After: 1 API call per actual API call (with smart caching)
- Before: 0.1s delay between calls (10 calls/second)
- After: 0.05s delay between calls (20 calls/second)
- Development Speed: Cache repository lists for 1 hour to avoid re-discovery
- API Call Reduction: Subsequent runs use cached data instead of API calls
- Testing Efficiency: Perfect for iterative development and testing
- Cache Management: Built-in tools to inspect and manage cache files
For organizations with 500+ repositories, use the optimized mode to minimize API calls:
- Pagination: Fetches 100 repositories per API call (maximum allowed)
- Bulk Metadata: Uses search API to get repository metadata in batches
- Reduced Workers: Uses fewer parallel workers to avoid overwhelming the API
- Batch Processing: Processes repositories in smaller batches with rate limit checks
# Use optimized mode for large organizations
python build_check.py --org your-org --optimized --use-cache
# Or use the dedicated large organization script
python optimize_large_orgs.py --org your-org
# With custom settings
python build_check.py --org your-org --optimized --rate-limit-delay 0.02 --max-workers 4
Add to your config.yaml
:
# Performance Settings for Large Organizations
parallelism:
max_workers: 4 # Reduced for large orgs
rate_limit_delay: 0.02 # Faster rate limiting
optimized: true # Enable optimized mode
# Caching Configuration
caching:
enabled: true # Always enable for large orgs
directory: ".cache"
duration: 3600 # 1 hour cache
For a 760-repository organization:
- Standard Mode: ~800 API calls (1 per repo + overhead)
- Optimized Mode: ~50 API calls (8 pages of 100 repos + bulk metadata)
- With Caching: ~10 API calls (subsequent runs)
The optimized mode includes:
- Pre-flight Checks: Warns if API calls are low
- Batch Processing: Processes repos in 50-repo batches
- Rate Limit Monitoring: Checks remaining calls after each batch
- Automatic Pausing: Pauses when approaching limits
The tool includes comprehensive GitHub API rate limiting management:
- Automatic Rate Limit Checking: Monitors remaining API calls and resets
- Intelligent Delays: Adds delays between API calls to respect limits
- Graceful Handling: Waits for rate limit resets when exceeded
- Progress Tracking: Shows remaining API calls and reset times
- Configurable Delays: Adjust delay between calls with
--rate-limit-delay
- Warning: Shows warning when < 100 requests remaining
- Auto-delay: Adds extra delay when approaching limits
- Wait and retry: Automatically waits for reset when limit exceeded
- Statistics: Reports total API calls made and remaining
- DateTime Handling: Properly converts GitHub API datetime objects to timestamps
- Root-Only Scanning: Only checks files in the repository root directory
- Fast Processing: Dramatically reduces API calls and processing time
- Build File Detection: Focuses on common build configuration file locations
The tool supports wildcard patterns for file detection, particularly useful for Packer files:
*.pkr.hcl
- Packer HCL configuration files*.pkr.json
- Packer JSON configuration files
Each tool has specific regex patterns for version detection:
- Grunt: Looks for grunt and grunt-cli versions in package.json
- Packer: Searches for packer_version and required_version in configuration files
The Jenkins analyzer specifically looks for:
- Grunt commands in pipeline stages
- Packer build and validation commands
- Artifact publishing patterns for both tools
- Identifies Repositories: Lists repositories without build configurations
- Helps with Auditing: Useful for finding projects that might need CI/CD setup
- Summary Statistics: Provides counts of repositories with/without builds
- Infrastructure Exclusion: Automatically excludes
infrastructure-environments
andinfrastructure-modules
- Terraform Exclusion: Excludes all repositories starting with
terraform-
- Focused Analysis: Concentrates on application repositories with build configurations
- Transparent Reporting: Shows which repositories were excluded and why
- GitHub Search API: Uses GitHub's search API to find repositories with Jenkinsfiles
- Ultra-Fast Analysis: Only analyzes repositories that actually have CI/CD pipelines
- Reduced API Calls: Dramatically fewer API calls compared to full analysis
- Perfect for CI/CD Audits: Ideal for teams focused on Jenkins pipeline analysis
- Multi-threaded Analysis: Uses ThreadPoolExecutor for concurrent repository analysis
- Configurable Workers: Adjust number of parallel workers (default: 4, max: 8)
- Rate Limit Aware: Respects GitHub API rate limits even with parallel processing
- Progress Tracking: Real-time progress with file analysis counts
- Error Handling: Graceful handling of individual repository failures
- Repository Discovery: Shows progress when fetching and filtering repositories
- Jenkins Search: Displays progress when searching for repositories with Jenkinsfiles
- Analysis Progress: Real-time progress bars for repository analysis
- Detailed Status: Shows counts of found, skipped, and processed repositories
- Visual Feedback: Rich progress bars with spinners and percentage completion
- Repository Counting: Shows "Counting repositories..." before processing
- Processing Status: Displays current repository being processed
- Filtering Information: Shows how many repositories are skipped (archived/empty)
- Found Repositories: Tracks how many valid repositories are discovered
- Search Results: For Jenkins-only mode, shows search result processing
- Detailed API Tracking: Logs every GitHub API call with timestamps and descriptions
- Rate Limit Monitoring: Shows real-time rate limit status and reset times
- Repository Analysis: Tracks which files are being checked in each repository
- Error Debugging: Provides detailed error information for troubleshooting
- Performance Insights: Shows API call counts and processing statistics
# Enable verbose logging to see detailed API request information
python build_check.py --org your-organization-name --verbose
# Combine with other options
python build_check.py --org your-organization-name --verbose --jenkins-only --output report.json
When --verbose
is enabled, you'll see detailed information like:
2024-01-15 14:30:00 - INFO - Verbose logging enabled - detailed API request information will be shown
2024-01-15 14:30:01 - DEBUG - API Call #1: Get organization repositories
2024-01-15 14:30:01 - DEBUG - - Rate Limit: 4850/5000 requests remaining
2024-01-15 14:30:01 - DEBUG - - Rate Limit Reset: 2024-01-15 15:30:00
2024-01-15 14:30:01 - DEBUG - - Delay Applied: 0.1s
2024-01-15 14:30:02 - INFO - Starting analysis of repository: my-service
2024-01-15 14:30:02 - DEBUG - Successfully retrieved pom.xml from my-service (2048 characters)
2024-01-15 14:30:02 - INFO - Found maven version 3.8.6 in my-service (pom.xml)
The project includes a comprehensive test suite organized according to Python best practices.
Tests are located in the tests/
directory and follow pytest conventions:
tests/
├── __init__.py # Makes tests a Python package
├── conftest.py # Shared fixtures and configuration
├── test_build_check.py # Tests for main build_check module
├── test_caching.py # Tests for caching functionality
├── test_performance.py # Tests for performance optimizations
├── test_rate_limit.py # Tests for rate limiting functionality
└── README.md # Detailed test documentation
# Run all tests
./run_tests.sh
# Run tests with coverage
./run_tests.sh --coverage
# Run tests with verbose output
./run_tests.sh --verbose
# Run only unit tests
./run_tests.sh --unit
# Run only integration tests
./run_tests.sh --integration
# Run only slow tests
./run_tests.sh --slow
# Run all tests
python -m pytest tests/
# Run tests with coverage
python -m pytest tests/ --cov=build_check --cov=cache_manager --cov=jenkins_analyzer --cov-report=html
# Run specific test file
python -m pytest tests/test_caching.py
# Run specific test class
python -m pytest tests/test_caching.py::TestCaching
# Run specific test method
python -m pytest tests/test_caching.py::TestCaching::test_caching_creation
- Unit Tests: Test individual functions and methods in isolation (marked with
@pytest.mark.unit
) - Integration Tests: Test multiple components working together (marked with
@pytest.mark.integration
) - Slow Tests: Tests that take significant time to run (marked with
@pytest.mark.slow
)
Test dependencies are included in requirements.txt
:
pytest
: Test frameworkpytest-cov
: Coverage reportingpytest-mock
: Mocking utilities
Tests require the following environment variables:
GITHUB_TOKEN
: GitHub API token for authentication
You can set these in a .env
file in the project root:
GITHUB_TOKEN=your_github_token_here
For detailed testing information, see tests/README.md.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE file for details
I've implemented several key optimizations to handle your 760-repository organization efficiently:
- Before: The code was making 2 API calls to get repositories (one to count, one to process)
- After: Single API call with pagination
- Before: Default 30 repos per page
- After: 100 repos per page (GitHub's maximum)
- Impact: Reduces API calls from ~26 to ~8 for 760 repos
- New
_get_repository_metadata_bulk()
method uses search API - Fetches metadata for 100 repos in a single API call
- Falls back to individual calls if search fails
- New
--optimized
flag for large organizations - Uses
get_repositories_optimized()
method - Reduced workers (4 instead of 8) to avoid overwhelming API
optimize_large_orgs.py
for organizations with 500+ repos- Batch processing with rate limit monitoring
- Automatic pausing when approaching limits
- Cache repository lists for 1 hour
- Subsequent runs use ~10 API calls instead of 800
# Option 1: Use optimized mode
python build_check.py --org your-org --optimized --use-cache
# Option 2: Use dedicated script
python optimize_large_orgs.py --org your-org
# Option 3: Custom settings
python build_check.py --org your-org --optimized --rate-limit-delay 0.02 --max-workers 4
- Before: ~800 API calls (exhausting your limit)
- After: ~50 API calls (8 pages + bulk metadata)
- With Caching: ~10 API calls (subsequent runs)
Add to your config.yaml
:
parallelism:
max_workers: 4
rate_limit_delay: 0.02
optimized: true
caching:
enabled: true
These optimizations should allow you to analyze your 760-repository organization without exhausting the API limit, and subsequent runs will be much faster due to caching.