A serverless application that scrapes the Wikipedia page about Large Language Models (LLMs) using Puppeteer in an AWS Lambda function, captures a screenshot, and uploads it to an S3 bucket.
- 🌐 Scrapes content from Wikipedia's Large Language Model page
- 📸 Takes full-page screenshots
- ☁️ Uploads screenshots to S3 with automatic cleanup (30-day retention)
- ⏰ Scheduled to run daily using AWS EventBridge
- 🔄 Returns structured JSON with scraped content and screenshot URL
- 🐳 Local development with Docker
- Node.js v18 or later
- pnpm v8 or later
- AWS CLI v2.x configured with appropriate permissions
- AWS SAM CLI
- Docker (for local testing)
# Clone the repository
git clone https://github.com/arunkumar201/aws-lambda-puppeteer.git
cd aws-lambda-puppeteer
# Install dependencies using pnpm
cd browser-function
pnpm install
# Install AWS SAM CLI (if not already installed)
pip install --user aws-sam-cli
-
Build the function
sam build
-
Run locally
# Using pnpm script pnpm local # Or directly with SAM sam local invoke WikipediaScraperFunction -e event.json --debug
Create a .env
file in the root directory:
AWS_REGION=your-aws-region
S3_BUCKET_NAME=your-s3-bucket-name
-
Build the application
sam build
-
Deploy to AWS
sam deploy --guided
Follow the interactive prompts to configure your deployment.
-
Verify deployment Check the CloudFormation stack in the AWS Console for deployment status and outputs.
npm install
```
AWS SAM (Serverless Application Model) is a framework used for building and deploying serverless applications. To deploy this Lambda function using SAM, follow these steps:
Before deploying your application, you need to create an S3 bucket to store the deployment artifacts. You can create a new S3 bucket using the AWS CLI:
aws s3 mb s3://your-s3-bucket-name
Replace your-s3-bucket-name
with a unique name for your S3 bucket. Make sure to define this bucket name in the template.yaml
file under the Deploy
section.
Before building, you can validate your SAM template to ensure everything is correct:
sam validate
Build the Lambda function using SAM:
sam build
This command packages your application and prepares it for deployment.
Deploy the function to AWS:
sam deploy --guided
This command will guide you through the deployment process, where you can specify:
- Stack Name
- AWS Region
- S3 Bucket for deployment artifacts (use the bucket name you created earlier)
- Whether to allow SAM to create IAM roles for your functions
During this process, SAM will upload your deployment artifacts to the specified S3 bucket and create the necessary AWS resources defined in your template.yaml
.
Once deployed, the Lambda function can be triggered according to the schedule defined in the template.yaml
. By default, it runs every 30 minutes. The function will perform the following actions:
- Navigate to
https://github.com
- Take a screenshot of the page
- (Optional) Generate a PDF of the page
- (Optional) Extract text from the page
- (Optional) Take a screenshot of a specific element
To view the logs for your Lambda function, use the following command:
sam logs -n BrowserFunction --stack-name <your-stack-name> --tail
Replace <your-stack-name>
with the name of your stack.
To delete the deployed stack and all associated resources:
sam delete --stack-name <your-stack-name>
This command will remove the CloudFormation stack, including the Lambda function, IAM roles, and other resources.
This project is licensed under the MIT License. See the LICENSE file for details.