Skip to content

Conversation

cookiezGIT
Copy link

@cookiezGIT cookiezGIT commented Oct 11, 2025

Cloud Data Engineer Challenge — MR

Summary

This MR delivers an end-to-end, free-tier–friendly AWS data pipeline using Terraform + Python:

  • S3 (incoming/) → Lambda (ingest)RDS PostgreSQL (PostGIS)
  • API Lambda (FastAPI + Mangum)API Gateway (GET /aggregated-data)
  • CloudWatch Logs for both Lambdas
  • Networking: VPC with public/private subnets; Lambdas run in private subnets
  • Zero NAT cost: We intentionally avoid a NAT Gateway (see “NAT note”) and use:
    • Gateway VPC endpoint for S3
    • Interface VPC endpoints for Secrets Manager and CloudWatch Logs
  • Local dev: Docker Compose for PostGIS, MinIO, and API (uvicorn)

NAT note (requirement vs implementation): The spec suggests “Lambda must use a NAT Gateway”. I justify using VPC endpoints instead of NAT to stay within free-tier budget (NAT ≈ $30–$35/mo). Endpoints preserve private networking and cover all required egress (S3/Secrets/Logs). I can switch to NAT quickly if reviewers prefer the literal requirement.


Repository Layout

.
├── README.md
├── docs/
├── examples/
│   ├── airbnb_listings_sample.csv
│   └── s3_put_event.json
├── docker-compose.yml
├── docker/
│   ├── initdb/01_enable_postgis.sql
│   └── lambda_layer/… (optional helper)
├── lambda/
│   ├── ingest/        # S3-triggered CSV→city aggregation→upsert
│   └── api/           # FastAPI + Mangum (GET /aggregated-data)
└── terraform/
    ├── envs/dev/main.tf
    ├── modules/{vpc,endpoints,s3,rds,iam,lambdas,apigw,backup}
    ├── providers.tf
    ├── variables.tf
    └── outputs.tf

What’s Implemented

  • IaC with Terraform modules per concern (VPC, endpoints, S3, RDS, IAM, Lambdas, API GW).
  • S3 trigger: ObjectCreated:* with prefix=incoming/ and suffix=.csvingest Lambda.
  • Aggregation: Normalize city (Title Case), parse price (comma/locale tolerant), compute listing_count + avg_price.
  • PostgreSQL + PostGIS: RDS in private subnets; CREATE EXTENSION postgis and table created on first connect.
  • API Gateway: Public HTTP API exposing GET /aggregated-data?limit=&city=.
  • CloudWatch Logs: Log groups with retention; optional error alarms.

Additional Features

  • Data Quality: Price coercion (handles 1,234.56 and 6,81), city normalization, sanity bounds; invalids skipped.
  • Indexing & Query Optimization:
    • idx_agg_count_city (listing_count DESC, city ASC) for API ordering
    • idx_agg_city (city) for lookups
    • idx_agg_geom (GIST) to enable spatial queries (nullable geom)
    • ANALYZE aggregated_city_stats after ingest
  • Monitoring & Alerts: CloudWatch Alarms Errors > 0 to SNS (opt-in).
  • Automated Backups (optional): AWS Backup daily plan targeting the RDS instance.
  • Local Docker: PostGIS + MinIO + local API (uvicorn) for pre-deploy validation.
  • CI/Pre-commit (optional wiring): terraform fmt/validate, Python lint (ruff), detect-secrets.

Deploy Instructions

Prerequisites

  • AWS account + IAM credentials for Terraform (us-east-1)
  • Terraform, AWS CLI v2, Docker Desktop
  • Python 3.10 (local)
  • (Windows) make optional; PowerShell fallbacks are in README.md

Configure AWS

aws configure            # region: us-east-1
aws sts get-caller-identity

Build Lambda zips (Linux-compatible wheels)

We vendor dependencies inside the Lambda Python 3.10 Docker image to match Amazon Linux.

Windows (PowerShell):

make clean
make build-linux-all

(If not using make, README includes Docker one-liners to build both zips.)

Apply Terraform

cd terraform/envs/dev
terraform init
terraform apply -var="prefix=renzob-nanlabs" -var="env=dev" -auto-approve

Grab outputs

terraform output -raw s3_bucket
terraform output -raw api_base_url

Test Instructions

1) End-to-end with S3 Put

BUCKET=$(terraform output -raw s3_bucket)
aws s3 cp ../../examples/airbnb_listings_sample.csv "s3://${BUCKET}/incoming/airbnb_listings_sample.csv" --content-type text/csv
aws logs tail "/aws/lambda/renzob-nanlabs-dev-ingest" --follow

Expected logs: s3_get_object_before, csv_parsed, db_connect_*, done.

2) Query the API

API=$(terraform output -raw api_base_url)
curl "$API/aggregated-data?limit=100"
curl "$API/aggregated-data?city=Berlin&limit=50"

3) Manual Lambda Invoke (optional)

BUCKET=$(terraform output -raw s3_bucket)
cat > event.json <<EOF
{
  "Records": [
    {
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventName": "ObjectCreated:Put",
      "s3": {
        "bucket": { "name": "${BUCKET}" },
        "object": { "key": "incoming/airbnb_listings_sample.csv" }
      }
    }
  ]
}
EOF

aws lambda invoke --function-name renzob-nanlabs-dev-ingest --payload fileb://event.json out.json
cat out.json

Security & Networking

  • Lambdas in private subnets, no public IPs.
  • No NAT to stay free-tier; VPC endpoints for S3/Secrets/Logs.
  • RDS is private; SG allows only Lambda SG on 5432.
  • DB credentials in Secrets Manager (retrieved at runtime).

If NAT is strictly required, I can enable a small nat submodule (IGW + NAT + routes) without changing app code.


Deliverables Checklist

  • ✅ Terraform modules for the full architecture
  • ✅ README with Windows/Linux steps and design justifications (incl. NAT rationale)
  • ✅ Working API Gateway returning aggregated data from PostgreSQL
  • CloudWatch Logs for Lambda execution/errors
  • examples/ with sample CSV and s3_put_event.json
  • ✅ Optional docs/, CI, and pre-commit config

Assumptions

  • CSV minimal schema: city and optional price; others ignored.
  • City normalized to Title Case; price parsed with comma/locale tolerance and sanity bounds.
  • geom is nullable; GIST index prepped for future spatial queries.

Troubleshooting

  • No logs on upload → Verify S3 notification (ObjectCreated:*, prefix=incoming/, suffix=.csv), ensure region us-east-1, try a new key name.
  • psycopg2 missing → Rebuild zips via Docker (make build-linux-all), then terraform apply.
  • API 5xx → Tail /aws/lambda/renzob-nanlabs-dev-api; confirm RDS is ready and Secrets accessible.
  • Code updated but no change → Ensure source_code_hash = filebase64sha256(...) is set for both Lambdas.

cursor[bot]

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant