A comprehensive scraper API service with FastAPI, SeleniumBase, and Cloudscraper for bypassing Cloudflare's anti-bot protection.
- FastAPI-based REST API with async support
- Multiple scraper backends:
- CloudScraper for Cloudflare bypass
- SeleniumBase for JavaScript-heavy sites
- Job queue system with Redis and in-memory options
- Background job processing with status tracking
- Database integration with SQLAlchemy
- Health checks and monitoring
This project uses uv
for dependency management. To get started:
# Install dependencies
uv sync
# Install development dependencies
uv sync --extra dev
# Development server
uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
# Production server
uv run uvicorn app.main:app --host 0.0.0.0 --port 8000
GET /health
POST /api/v1/scrape
Content-Type: application/json
{
"url": "https://example.com",
"scraper_type": "cloudscraper",
"method": "GET",
"headers": {},
"timeout": 30
}
GET /api/v1/jobs/{task_id}
GET /api/v1/jobs?status=completed&limit=10
GET /api/v1/queue/status
Run the demo script to see the API in action:
# Start the server first
uv run uvicorn app.main:app --host 0.0.0.0 --port 8000
# In another terminal, run the demo
uv run python demo.py
Environment variables can be set in a .env
file:
DATABASE_URL=sqlite:///./cfscraper.db
REDIS_URL=redis://localhost:6379
MAX_CONCURRENT_JOBS=10
JOB_TIMEOUT=300
USE_IN_MEMORY_QUEUE=true
-
FastAPI Application (
app/main.py
)- REST API with async support
- Health checks and monitoring
- Automatic database initialization
-
Scraper Classes (
app/scrapers/
)- Base scraper interface
- CloudScraper implementation
- SeleniumBase implementation
- Factory pattern for scraper creation
-
Job Queue System (
app/utils/queue.py
)- Abstract queue interface
- In-memory queue for development
- Redis queue for production
- Job status tracking
-
Database Models (
app/models/
)- SQLAlchemy models for jobs and results
- Job status tracking
- Result storage
-
Background Processing (
app/utils/executor.py
)- Async job execution
- Concurrent job handling
- Error handling and retries
Run the test suite:
uv run pytest tests/ -v
cfscraper/
├── alembic/ # Database migration scripts
├── app/
│ ├── api/ # API routes
│ ├── core/ # Core configuration
│ ├── models/ # Database models
│ ├── scrapers/ # Scraper implementations
│ ├── utils/ # Utilities (queue, executor)
│ └── main.py # FastAPI application
├── docs/ # Documentation
├── examples/ # Demo scripts
├── tests/ # Test files
├── pyproject.toml # Project configuration
└── README.md # This file
- Create a new scraper class inheriting from
BaseScraper
- Implement the required methods
- Register it in the
ScraperFactory
- Add it to the
ScraperType
enum
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.