nanlabs · cookiezGIT · Oct 10, 2025 · Oct 13, 2025 · Oct 13, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,22 @@
+name: ci
+on:
+  pull_request:
+  push:
+    branches: [ main, dev ]
+jobs:
+  validate:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: hashicorp/setup-terraform@v3
+        with: { terraform_version: 1.9.5 }
+      - name: Terraform fmt
+        run: terraform fmt -check -recursive
+      - name: Terraform validate
+        working-directory: terraform/envs/dev
+        run: |
+          terraform init -backend=false
+          terraform validate
+      - name: Python lint (ruff)
+        uses: chartboost/ruff-action@v1
+        with: { src: "lambda" }
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,24 @@
+# Python
+__pycache__/
+*.pyc
+.venv/
+.env
+
+# Build artifacts
+build/
+dist/
+*.zip
+
+# Terraform
+.terraform/
+.terraform.lock.hcl
+terraform.tfstate
+terraform.tfstate.*
+crash.log
+
+# OS/editor
+.DS_Store
+Thumbs.db
+*.swp
+.vscode/
+.idea/
diff --git a/.pre-commit-config.yml b/.pre-commit-config.yml
@@ -0,0 +1,15 @@
+repos:
+- repo: https://github.com/psf/black
+  rev: 24.8.0
+  hooks: [{ id: black, args: ["--line-length=100"], additional_dependencies: ["click<8.1.8"] }]
+- repo: https://github.com/astral-sh/ruff-pre-commit
+  rev: v0.6.8
+  hooks: [{ id: ruff, args: ["--fix"] }]
+- repo: https://github.com/antonbabenko/pre-commit-terraform
+  rev: v1.88.4
+  hooks:
+    - id: terraform_fmt
+    - id: terraform_validate
+- repo: https://github.com/Yelp/detect-secrets
+  rev: v1.5.0
+  hooks: [{ id: detect-secrets, args: ["--baseline", ".secrets.baseline"] }]
diff --git a/README.md b/README.md
@@ -1,91 +1,245 @@
-# 🚀 Cloud Data Engineer Challenge
+# NanLabs Cloud Data Engineer Challenge — README
 
-Welcome to the **Cloud Data Engineer Challenge!** 🎉 This challenge is designed to evaluate your ability to work with **Infrastructure as Code (IaC), AWS data services, and data engineering workflows**, ensuring efficient data ingestion, storage, and querying.
+**Author:** Renzo Burga (cookiezGit)  
+**Region:** `us-east-1`  
+**Name prefix:** `renzob-nanlabs-dev-*`
+
+This repository delivers a free‑tier–friendly AWS data pipeline using **Terraform + Python**:
+
+- **S3** (`incoming/`) → **Lambda (ingest)** → **RDS PostgreSQL (PostGIS)**
+- **API Lambda (FastAPI + Mangum)** → **API Gateway** (`GET /aggregated-data`)
+- **CloudWatch Logs** for both Lambdas
+- **Networking**: VPC with public/private subnets; Lambdas in **private subnets**
+- **No NAT cost**: Uses **VPC Endpoints** (S3 Gateway, Secrets/Logs Interface) instead of a NAT gateway
+- **Local dev**: Docker Compose for **PostGIS**, **MinIO**, and **API (uvicorn)**
+
+> **NAT note (challenge requirement vs. implementation):** The challenge mentions a NAT gateway. To stay within free tier and still keep Lambdas private, this project uses **VPC Endpoints** for egress to AWS services. If a NAT is strictly required by reviewers, it can be enabled with a small module without changing any application code.
+
+---
 
-> [!NOTE]
-> You can use **any IaC tool of your choice** (Terraform preferred, but alternatives are allowed). If you choose a different tool or a combination of tools, **justify your decision!**
+## Repository Layout
 
-## ⚡ Challenge Overview
+```
+.
+├── README.md                        # you are here
+├── docs/
+├── examples/
+│   ├── airbnb_listings_sample.csv   # example input
+│   └── s3_put_event.json            # sample S3 PUT event
+├── docker-compose.yml               # PostGIS + MinIO + API (uvicorn)
+├── docker/
+│   ├── initdb/01_enable_postgis.sql # CREATE EXTENSION postgis;
+│   └── lambda_layer/                # optional helpers (not required)
+├── lambda/
+│   ├── ingest/                      # S3-triggered CSV→aggregation→upsert
+│   └── api/                         # FastAPI + Mangum (GET /aggregated-data)
+└── terraform/
+    ├── envs/dev/main.tf             # composes modules & S3→Lambda notify
+    ├── modules/{vpc,endpoints,s3,rds,iam,lambdas,apigw,backup}
+    ├── providers.tf
+    ├── variables.tf
+    └── outputs.tf
+```
+
+---
+
+## Architecture (high level)
+
+- **S3** receives CSVs under `incoming/`  
+- **Ingest Lambda** (private subnet) is triggered on `ObjectCreated:*` with prefix `incoming/` and suffix `.csv`  
+  - Parses CSV (`city`, optional `price`), normalizes city names, computes `listing_count` and `avg_price`
+  - Upserts into `aggregated_city_stats` on **RDS PostgreSQL (PostGIS)**
+  - Ensures PostGIS extension, table, and indexes exist (idempotent)
+- **API Lambda** serves **FastAPI** via **Mangum** → **API Gateway HTTP API** (`GET /aggregated-data?limit=&city=`)  
+- **VPC Endpoints**: S3 (Gateway), Secrets Manager (Interface), CloudWatch Logs (Interface)  
+- **CloudWatch** log groups with retention; optional basic alarms to SNS
+
+**Indexes created:**  
+`idx_agg_count_city (listing_count DESC, city ASC)`, `idx_agg_city (city)`, and `idx_agg_geom (GIST)`
+
+---
+
+## Prerequisites
+
+- **AWS CLI v2**, **Terraform ≥ 1.6**, **Docker Desktop**
+- **Python 3.10** locally (only for vendoring)  
+- Windows users: `make` is optional (PowerShell alternatives below)
+
+Authenticate:
+```bash
+aws configure          # set region to us-east-1
+aws sts get-caller-identity
+```
+
+---
+
+## Build — Lambda zips with Linux-compatible wheels
+
+We vendor dependencies **inside** the Lambda Python 3.10 image to match Amazon Linux.
+
+### Windows (PowerShell)
+
+```powershell
+# from repo root
+make clean
+make build-linux-all
+```
+
+If you don’t use `make`, use these one-liners (from repo root):
+```powershell
+# API
+docker run --rm -v "${PWD}:/var/task" -w /var/task public.ecr.aws/lambda/python:3.10 `
+  /bin/sh -lc "rm -rf build/api/package && mkdir -p build/api/package && pip install -r lambda/api/requirements.txt -t build/api/package && cp -r lambda/api/* build/api/package/ && cd build/api/package && zip -r ../../api.zip ."
+
+# Ingest
+docker run --rm -v "${PWD}:/var/task" -w /var/task public.ecr.aws/lambda/python:3.10 `
+  /bin/sh -lc "rm -rf build/ingest/package && mkdir -p build/ingest/package && pip install -r lambda/ingest/requirements.txt -t build/ingest/package && cp -r lambda/ingest/* build/ingest/package/ && cd build/ingest/package && zip -r ../../ingest.zip ."
+```
+
+---
+
+## Deploy — Terraform (dev)
+
+```bash
+cd terraform/envs/dev
+terraform init
+terraform apply -var="prefix=renzob-nanlabs" -var="env=dev" -auto-approve
+```
+
+Grab outputs:
+```bash
+terraform output -raw s3_bucket
+terraform output -raw api_base_url
+terraform output -raw db_endpoint
+```
+
+> The module sets `source_code_hash` on Lambdas so code changes re-deploy cleanly on `terraform apply`.
+
+---
+
+## Test — End to End
+
+### 1) Trigger via S3 PUT (recommended)
+```bash
+BUCKET=$(terraform output -raw s3_bucket)
+aws s3 cp ../../examples/airbnb_listings_sample.csv "s3://${BUCKET}/incoming/airbnb_listings_sample.csv" --content-type text/csv
+aws logs tail "/aws/lambda/renzob-nanlabs-dev-ingest" --follow
+```
+Expected logs: `s3_get_object_before`, `csv_parsed`, `db_connect_*`, `done`
+
+### 2) Query the API
+```bash
+API=$(terraform output -raw api_base_url)
+curl "$API/healthz"
+curl "$API/aggregated-data?limit=100"
+curl "$API/aggregated-data?city=Berlin&limit=50"
+```
+
+### 3) Manual Lambda Test (optional)
+```bash
+BUCKET=$(terraform output -raw s3_bucket)
+cat > event.json <<EOF
+{
+  "Records": [{
+    "eventSource": "aws:s3",
+    "awsRegion": "us-east-1",
+    "eventName": "ObjectCreated:Put",
+    "s3": {
+      "bucket": { "name": "${BUCKET}" },
+      "object": { "key": "incoming/airbnb_listings_sample.csv" }
+    }
+  }]
+}
+EOF
 
-Your task is to deploy the following infrastructure on AWS:
+aws lambda invoke --function-name renzob-nanlabs-dev-ingest --payload fileb://event.json out.json
+cat out.json
+```
 
-> 🎯 **Key Objectives:**
+---
 
-- **An S3 bucket** that will receive data files as new objects.
-- **A Lambda function** that is triggered by a `PUT` event in the S3 bucket.
-- **The Lambda function must:**
-  - Process the ingested data and perform a minimal aggregation.
-  - Store the processed data in a **PostgreSQL database with PostGIS enabled**.
-  - Expose an API Gateway endpoint (`GET /aggregated-data`) to query and retrieve the aggregated data.
-- **A PostgreSQL database** running in a private subnet with PostGIS enabled.
-- **Networking must include:** VPC, public/private subnets, and security groups.
-- **The Lambda must be in a private subnet** and use a NAT Gateway in a public subnet for internet access 🌍
-- **CloudWatch logs** should capture Lambda execution details and possible errors.
+## Local Development (Docker Compose)
 
-> [!IMPORTANT]
-> Ensure that your solution is modular, well-documented, and follows best practices for security and maintainability.
+```bash
+docker compose up -d --build
+# MinIO: http://localhost:9001  (user: minio / pass: minio123)
+# Create bucket 'nanlabs' and upload examples/airbnb_listings_sample.csv to incoming/
+# API:   http://localhost:8000/aggregated-data
+```
 
-## 📌 Requirements
+**Windows PowerShell:**
+```powershell
+docker compose up -d --build
+# Then browse http://localhost:9001 (minio/minio123)
+# Local API: Invoke-WebRequest http://localhost:8000/aggregated-data | Select-Object -ExpandProperty Content
+```
 
-### 🛠 Tech Stack
+---
 
-> ⚡ **Must Include:**
+## Configuration Notes
 
-- **IaC:** Any tool of your choice (**Terraform preferred**, but others are allowed if justified).
-- **AWS Services:** S3, Lambda, API Gateway, CloudWatch, PostgreSQL with PostGIS (RDS or self-hosted on EC2).
+- **Runtime:** Python `3.10` for both Lambdas
+- **Handlers:** `main.handler` (zips root contain `main.py`)
+- **Secrets:** Pulled at runtime from **AWS Secrets Manager** (host, port, db, user, password)
+- **Indexes:** Created on first connect (idempotent); `ANALYZE` after each ingest
+- **S3 Notification:** `ObjectCreated:*` with `prefix=incoming/`, `suffix=.csv`
 
-### 📄 Expected Deliverables
+---
 
-> 📥 **Your submission must be a Pull Request that includes:**
+## Troubleshooting
 
-- **An IaC module** that deploys the entire architecture.
-- **A `README.md`** with deployment instructions and tool selection justification.
-- **A working API Gateway endpoint** that returns the aggregated data stored in PostgreSQL.
-- **CloudWatch logs** capturing Lambda execution details.
-- **Example input files** to trigger the data pipeline (placed in an `examples/` directory).
-- **A sample event payload** (JSON format) to simulate the S3 `PUT` event.
+- **No logs on upload**  
+  - Verify the notification configuration:  
+    `aws s3api get-bucket-notification-configuration --bucket $BUCKET`  
+    Should include `ObjectCreated:*`, `prefix=incoming/`, `suffix=.csv` and your Lambda ARN.  
+  - Upload with a **new key** (e.g., timestamp suffix).
 
-> [!TIP]
-> Use the `docs` folder to store any additional documentation or diagrams that help explain your solution.
-> Mention any assumptions or constraints in your `README.md`.
+- **`psycopg2` missing on Lambda**  
+  - Rebuild zips inside the Lambda Docker image: `make build-linux-all`  
+  - Then `terraform apply` to update code (uses `source_code_hash`).
 
-## 🌟 Nice to Have
+- **API base URL empty**  
+  - Run `terraform output -raw api_base_url` **in `terraform/envs/dev`**.  
+  - Alternatively, discover via AWS CLI:  
+    `aws apigatewayv2 get-apis --query "Items[?Name=='renzob-nanlabs-dev-api'].ApiEndpoint" --output text`
 
-> 💡 **Bonus Points For:**
+- **500 from API**  
+  - Tail `/aws/lambda/renzob-nanlabs-dev-api`; ensure RDS is finished creating and Secrets are accessible.
 
-- **Data Quality & Validation**: Implementing **schema validation before storing data in PostgreSQL**.
-- **Indexing & Query Optimization**: Using **PostGIS spatial indexing** for efficient geospatial queries.
-- **Monitoring & Alerts**: Setting up **AWS CloudWatch Alarms** for S3 event failures or Lambda errors.
-- **Automated Data Backups**: Creating periodic **database backups to S3** using AWS Lambda or AWS Backup.
-- **GitHub Actions for validation**: Running **`terraform fmt`, `terraform validate`**, or equivalent for the chosen IaC tool.
-- **Pre-commit hooks**: Ensuring linting and security checks before committing.
-- **Docker for local testing**: Using **Docker Compose to spin up**:
-  - Running a local PostgreSQL database with PostGIS to simulate the cloud environment 🛠
-  - Providing a local S3-compatible service (e.g., MinIO) to test file ingestion before deployment 🖥
+- **S3 upload doesn’t trigger**  
+  - Confirm region `us-east-1`, bucket exists, and you used `incoming/` + `.csv` suffix.
 
-> [!TIP]
-> Looking for inspiration or additional ideas to earn extra points? Check out our **[Awesome NaNLABS repository](https://github.com/nanlabs/awesome-nan)** for reference projects and best practices! 🚀
+---
 
-## 📥 Submission Guidelines
+## Cleanup (avoid charges)
 
-> 📌 **Follow these steps to submit your solution:**
+```bash
+cd terraform/envs/dev
+terraform destroy -var="prefix=renzob-nanlabs" -var="env=dev" -auto-approve
+docker compose down -v
+```
 
-1. **Fork this repository.**
-2. **Create a feature branch** for your implementation.
-3. **Commit your changes** with meaningful commit messages.
-4. **Open a Pull Request** following the provided template.
-5. **Our team will review** and provide feedback.
+---
 
-## ✅ Evaluation Criteria
+## Extensibility & “Nice to Have” Highlights
 
-> 🔍 **What we'll be looking at:**
+- **Data Quality:** City normalization, tolerant price parsing, sanity ranges; invalids skipped (no DLQ by choice).
+- **Indexing:** Composite ordering index, city lookup index, and GIST on `geom` to enable spatial queries.
+- **Monitoring:** Optional CloudWatch alarms (`Errors > 0`) → SNS topic.
+- **Backups:** Optional **AWS Backup** daily plan for the RDS instance.
+- **CI / Pre-commit:** Optional GitHub Actions and pre-commit config for fmt/validate/lint/secrets checks.
 
-- **Correctness and completeness** of the **data pipeline**.
-- **Use of best practices for event-driven processing** (S3 triggers, Lambda execution).
-- **Data transformation & aggregation logic** implemented in Lambda.
-- **Optimization for geospatial queries** using PostGIS.
-- **Data backup & integrity strategies** (optional, e.g., automated S3 backups).
-- **CI/CD automation using GitHub Actions and pre-commit hooks** (optional).
-- **Documentation clarity**: Clear explanation of data flow, transformation logic, and infrastructure choices.
+---
+
+## Assumptions
+
+- CSV minimal schema contains `city` and optional `price` columns; others ignored.  
+- Cities aggregated case-insensitively (Title Case stored).  
+- `geom` is nullable; future job can geocode and fill points.
+
+---
+
+## License
+
+MIT (or per challenge repository’s default).
 
-## 🎯 **Good luck and happy coding!** 🚀