Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: ci
on:
pull_request:
push:
branches: [ main, dev ]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with: { terraform_version: 1.9.5 }
- name: Terraform fmt
run: terraform fmt -check -recursive
- name: Terraform validate
working-directory: terraform/envs/dev
run: |
terraform init -backend=false
terraform validate
- name: Python lint (ruff)
uses: chartboost/ruff-action@v1
with: { src: "lambda" }
24 changes: 24 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Python
__pycache__/
*.pyc
.venv/
.env

# Build artifacts
build/
dist/
*.zip

# Terraform
.terraform/
.terraform.lock.hcl
terraform.tfstate
terraform.tfstate.*
crash.log

# OS/editor
.DS_Store
Thumbs.db
*.swp
.vscode/
.idea/
15 changes: 15 additions & 0 deletions .pre-commit-config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
repos:
- repo: https://github.com/psf/black
rev: 24.8.0
hooks: [{ id: black, args: ["--line-length=100"], additional_dependencies: ["click<8.1.8"] }]
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.8
hooks: [{ id: ruff, args: ["--fix"] }]
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.88.4
hooks:
- id: terraform_fmt
- id: terraform_validate
- repo: https://github.com/Yelp/detect-secrets
rev: v1.5.0
hooks: [{ id: detect-secrets, args: ["--baseline", ".secrets.baseline"] }]
284 changes: 219 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,91 +1,245 @@
# 🚀 Cloud Data Engineer Challenge
# NanLabs Cloud Data Engineer Challenge — README

Welcome to the **Cloud Data Engineer Challenge!** 🎉 This challenge is designed to evaluate your ability to work with **Infrastructure as Code (IaC), AWS data services, and data engineering workflows**, ensuring efficient data ingestion, storage, and querying.
**Author:** Renzo Burga (cookiezGit)
**Region:** `us-east-1`
**Name prefix:** `renzob-nanlabs-dev-*`

This repository delivers a free‑tier–friendly AWS data pipeline using **Terraform + Python**:

- **S3** (`incoming/`) → **Lambda (ingest)** → **RDS PostgreSQL (PostGIS)**
- **API Lambda (FastAPI + Mangum)** → **API Gateway** (`GET /aggregated-data`)
- **CloudWatch Logs** for both Lambdas
- **Networking**: VPC with public/private subnets; Lambdas in **private subnets**
- **No NAT cost**: Uses **VPC Endpoints** (S3 Gateway, Secrets/Logs Interface) instead of a NAT gateway
- **Local dev**: Docker Compose for **PostGIS**, **MinIO**, and **API (uvicorn)**

> **NAT note (challenge requirement vs. implementation):** The challenge mentions a NAT gateway. To stay within free tier and still keep Lambdas private, this project uses **VPC Endpoints** for egress to AWS services. If a NAT is strictly required by reviewers, it can be enabled with a small module without changing any application code.

---

> [!NOTE]
> You can use **any IaC tool of your choice** (Terraform preferred, but alternatives are allowed). If you choose a different tool or a combination of tools, **justify your decision!**
## Repository Layout

## ⚡ Challenge Overview
```
.
├── README.md # you are here
├── docs/
├── examples/
│ ├── airbnb_listings_sample.csv # example input
│ └── s3_put_event.json # sample S3 PUT event
├── docker-compose.yml # PostGIS + MinIO + API (uvicorn)
├── docker/
│ ├── initdb/01_enable_postgis.sql # CREATE EXTENSION postgis;
│ └── lambda_layer/ # optional helpers (not required)
├── lambda/
│ ├── ingest/ # S3-triggered CSV→aggregation→upsert
│ └── api/ # FastAPI + Mangum (GET /aggregated-data)
└── terraform/
├── envs/dev/main.tf # composes modules & S3→Lambda notify
├── modules/{vpc,endpoints,s3,rds,iam,lambdas,apigw,backup}
├── providers.tf
├── variables.tf
└── outputs.tf
```

---

## Architecture (high level)

- **S3** receives CSVs under `incoming/`
- **Ingest Lambda** (private subnet) is triggered on `ObjectCreated:*` with prefix `incoming/` and suffix `.csv`
- Parses CSV (`city`, optional `price`), normalizes city names, computes `listing_count` and `avg_price`
- Upserts into `aggregated_city_stats` on **RDS PostgreSQL (PostGIS)**
- Ensures PostGIS extension, table, and indexes exist (idempotent)
- **API Lambda** serves **FastAPI** via **Mangum** → **API Gateway HTTP API** (`GET /aggregated-data?limit=&city=`)
- **VPC Endpoints**: S3 (Gateway), Secrets Manager (Interface), CloudWatch Logs (Interface)
- **CloudWatch** log groups with retention; optional basic alarms to SNS

**Indexes created:**
`idx_agg_count_city (listing_count DESC, city ASC)`, `idx_agg_city (city)`, and `idx_agg_geom (GIST)`

---

## Prerequisites

- **AWS CLI v2**, **Terraform ≥ 1.6**, **Docker Desktop**
- **Python 3.10** locally (only for vendoring)
- Windows users: `make` is optional (PowerShell alternatives below)

Authenticate:
```bash
aws configure # set region to us-east-1
aws sts get-caller-identity
```

---

## Build — Lambda zips with Linux-compatible wheels

We vendor dependencies **inside** the Lambda Python 3.10 image to match Amazon Linux.

### Windows (PowerShell)

```powershell
# from repo root
make clean
make build-linux-all
```

If you don’t use `make`, use these one-liners (from repo root):
```powershell
# API
docker run --rm -v "${PWD}:/var/task" -w /var/task public.ecr.aws/lambda/python:3.10 `
/bin/sh -lc "rm -rf build/api/package && mkdir -p build/api/package && pip install -r lambda/api/requirements.txt -t build/api/package && cp -r lambda/api/* build/api/package/ && cd build/api/package && zip -r ../../api.zip ."

# Ingest
docker run --rm -v "${PWD}:/var/task" -w /var/task public.ecr.aws/lambda/python:3.10 `
/bin/sh -lc "rm -rf build/ingest/package && mkdir -p build/ingest/package && pip install -r lambda/ingest/requirements.txt -t build/ingest/package && cp -r lambda/ingest/* build/ingest/package/ && cd build/ingest/package && zip -r ../../ingest.zip ."
```

---

## Deploy — Terraform (dev)

```bash
cd terraform/envs/dev
terraform init
terraform apply -var="prefix=renzob-nanlabs" -var="env=dev" -auto-approve
```

Grab outputs:
```bash
terraform output -raw s3_bucket
terraform output -raw api_base_url
terraform output -raw db_endpoint
```

> The module sets `source_code_hash` on Lambdas so code changes re-deploy cleanly on `terraform apply`.

---

## Test — End to End

### 1) Trigger via S3 PUT (recommended)
```bash
BUCKET=$(terraform output -raw s3_bucket)
aws s3 cp ../../examples/airbnb_listings_sample.csv "s3://${BUCKET}/incoming/airbnb_listings_sample.csv" --content-type text/csv
aws logs tail "/aws/lambda/renzob-nanlabs-dev-ingest" --follow
```
Expected logs: `s3_get_object_before`, `csv_parsed`, `db_connect_*`, `done`

### 2) Query the API
```bash
API=$(terraform output -raw api_base_url)
curl "$API/healthz"
curl "$API/aggregated-data?limit=100"
curl "$API/aggregated-data?city=Berlin&limit=50"
```

### 3) Manual Lambda Test (optional)
```bash
BUCKET=$(terraform output -raw s3_bucket)
cat > event.json <<EOF
{
"Records": [{
"eventSource": "aws:s3",
"awsRegion": "us-east-1",
"eventName": "ObjectCreated:Put",
"s3": {
"bucket": { "name": "${BUCKET}" },
"object": { "key": "incoming/airbnb_listings_sample.csv" }
}
}]
}
EOF

Your task is to deploy the following infrastructure on AWS:
aws lambda invoke --function-name renzob-nanlabs-dev-ingest --payload fileb://event.json out.json
cat out.json
```

> 🎯 **Key Objectives:**
---

- **An S3 bucket** that will receive data files as new objects.
- **A Lambda function** that is triggered by a `PUT` event in the S3 bucket.
- **The Lambda function must:**
- Process the ingested data and perform a minimal aggregation.
- Store the processed data in a **PostgreSQL database with PostGIS enabled**.
- Expose an API Gateway endpoint (`GET /aggregated-data`) to query and retrieve the aggregated data.
- **A PostgreSQL database** running in a private subnet with PostGIS enabled.
- **Networking must include:** VPC, public/private subnets, and security groups.
- **The Lambda must be in a private subnet** and use a NAT Gateway in a public subnet for internet access 🌍
- **CloudWatch logs** should capture Lambda execution details and possible errors.
## Local Development (Docker Compose)

> [!IMPORTANT]
> Ensure that your solution is modular, well-documented, and follows best practices for security and maintainability.
```bash
docker compose up -d --build
# MinIO: http://localhost:9001 (user: minio / pass: minio123)
# Create bucket 'nanlabs' and upload examples/airbnb_listings_sample.csv to incoming/
# API: http://localhost:8000/aggregated-data
```

## 📌 Requirements
**Windows PowerShell:**
```powershell
docker compose up -d --build
# Then browse http://localhost:9001 (minio/minio123)
# Local API: Invoke-WebRequest http://localhost:8000/aggregated-data | Select-Object -ExpandProperty Content
```

### 🛠 Tech Stack
---

> ⚡ **Must Include:**
## Configuration Notes

- **IaC:** Any tool of your choice (**Terraform preferred**, but others are allowed if justified).
- **AWS Services:** S3, Lambda, API Gateway, CloudWatch, PostgreSQL with PostGIS (RDS or self-hosted on EC2).
- **Runtime:** Python `3.10` for both Lambdas
- **Handlers:** `main.handler` (zips root contain `main.py`)
- **Secrets:** Pulled at runtime from **AWS Secrets Manager** (host, port, db, user, password)
- **Indexes:** Created on first connect (idempotent); `ANALYZE` after each ingest
- **S3 Notification:** `ObjectCreated:*` with `prefix=incoming/`, `suffix=.csv`

### 📄 Expected Deliverables
---

> 📥 **Your submission must be a Pull Request that includes:**
## Troubleshooting

- **An IaC module** that deploys the entire architecture.
- **A `README.md`** with deployment instructions and tool selection justification.
- **A working API Gateway endpoint** that returns the aggregated data stored in PostgreSQL.
- **CloudWatch logs** capturing Lambda execution details.
- **Example input files** to trigger the data pipeline (placed in an `examples/` directory).
- **A sample event payload** (JSON format) to simulate the S3 `PUT` event.
- **No logs on upload**
- Verify the notification configuration:
`aws s3api get-bucket-notification-configuration --bucket $BUCKET`
Should include `ObjectCreated:*`, `prefix=incoming/`, `suffix=.csv` and your Lambda ARN.
- Upload with a **new key** (e.g., timestamp suffix).

> [!TIP]
> Use the `docs` folder to store any additional documentation or diagrams that help explain your solution.
> Mention any assumptions or constraints in your `README.md`.
- **`psycopg2` missing on Lambda**
- Rebuild zips inside the Lambda Docker image: `make build-linux-all`
- Then `terraform apply` to update code (uses `source_code_hash`).

## 🌟 Nice to Have
- **API base URL empty**
- Run `terraform output -raw api_base_url` **in `terraform/envs/dev`**.
- Alternatively, discover via AWS CLI:
`aws apigatewayv2 get-apis --query "Items[?Name=='renzob-nanlabs-dev-api'].ApiEndpoint" --output text`

> 💡 **Bonus Points For:**
- **500 from API**
- Tail `/aws/lambda/renzob-nanlabs-dev-api`; ensure RDS is finished creating and Secrets are accessible.

- **Data Quality & Validation**: Implementing **schema validation before storing data in PostgreSQL**.
- **Indexing & Query Optimization**: Using **PostGIS spatial indexing** for efficient geospatial queries.
- **Monitoring & Alerts**: Setting up **AWS CloudWatch Alarms** for S3 event failures or Lambda errors.
- **Automated Data Backups**: Creating periodic **database backups to S3** using AWS Lambda or AWS Backup.
- **GitHub Actions for validation**: Running **`terraform fmt`, `terraform validate`**, or equivalent for the chosen IaC tool.
- **Pre-commit hooks**: Ensuring linting and security checks before committing.
- **Docker for local testing**: Using **Docker Compose to spin up**:
- Running a local PostgreSQL database with PostGIS to simulate the cloud environment 🛠
- Providing a local S3-compatible service (e.g., MinIO) to test file ingestion before deployment 🖥
- **S3 upload doesn’t trigger**
- Confirm region `us-east-1`, bucket exists, and you used `incoming/` + `.csv` suffix.

> [!TIP]
> Looking for inspiration or additional ideas to earn extra points? Check out our **[Awesome NaNLABS repository](https://github.com/nanlabs/awesome-nan)** for reference projects and best practices! 🚀
---

## 📥 Submission Guidelines
## Cleanup (avoid charges)

> 📌 **Follow these steps to submit your solution:**
```bash
cd terraform/envs/dev
terraform destroy -var="prefix=renzob-nanlabs" -var="env=dev" -auto-approve
docker compose down -v
```

1. **Fork this repository.**
2. **Create a feature branch** for your implementation.
3. **Commit your changes** with meaningful commit messages.
4. **Open a Pull Request** following the provided template.
5. **Our team will review** and provide feedback.
---

## ✅ Evaluation Criteria
## Extensibility & “Nice to Have” Highlights

> 🔍 **What we'll be looking at:**
- **Data Quality:** City normalization, tolerant price parsing, sanity ranges; invalids skipped (no DLQ by choice).
- **Indexing:** Composite ordering index, city lookup index, and GIST on `geom` to enable spatial queries.
- **Monitoring:** Optional CloudWatch alarms (`Errors > 0`) → SNS topic.
- **Backups:** Optional **AWS Backup** daily plan for the RDS instance.
- **CI / Pre-commit:** Optional GitHub Actions and pre-commit config for fmt/validate/lint/secrets checks.

- **Correctness and completeness** of the **data pipeline**.
- **Use of best practices for event-driven processing** (S3 triggers, Lambda execution).
- **Data transformation & aggregation logic** implemented in Lambda.
- **Optimization for geospatial queries** using PostGIS.
- **Data backup & integrity strategies** (optional, e.g., automated S3 backups).
- **CI/CD automation using GitHub Actions and pre-commit hooks** (optional).
- **Documentation clarity**: Clear explanation of data flow, transformation logic, and infrastructure choices.
---

## Assumptions

- CSV minimal schema contains `city` and optional `price` columns; others ignored.
- Cities aggregated case-insensitively (Title Case stored).
- `geom` is nullable; future job can geocode and fill points.

---

## License

MIT (or per challenge repository’s default).

## 🎯 **Good luck and happy coding!** 🚀
Loading