AWS GenAI Storage Workshop

Prerequisites 事前準備

AWS Account
GitHub Account

Table of Contents 目次

Overview 概要
Setup セットアップ
Test S3 Range
Prepare Dataset データセット準備
Prepare Vector Database
Agent Search

Overview

Business Use-Case

An engineering firm needs to audit and track public infrastructure for safety via drone arial footage. They have captured tens of thousands of images (eg. cracks in bridges) and have stored them within archives based on year and month.

As a Cloud Engineer you have been tasked to building a proof-of-concept where you can use GenAI to use natural language to retrieve an image from the archive.

You need to report back possible technical paths and technical considerations for this project.

エンジニアリング会社は、ドローンの空撮映像を通じて公共インフラの安全性を監査・追跡する必要があります。彼らは数万枚の画像（例：橋のひび割れ）を撮影し、年月別のアーカイブに保存しています。クラウドエンジニアとして、あなたはGenAIを使用して自然言語でアーカイブから画像を検索できる概念実証を構築する任務を与えられました。このプロジェクトの技術的な道筋と技術的考慮事項を報告する必要があります。

Considertions and Requirements

All resources will be created in ap-northeast-1 Asia Pacific (Tokyo)
We'll be using GitHub Codespaces so we have a consistent developer enviroment
We are not using free-tier services but the cost should be under $1 USD for the duration of the workshop
We'll be using the following repo: https://github.com/ExamProCo/aws-storage-genai-workshop
We may need to rebuild the container for AWS CLI to be installed

devcontainers doesn't always work on Codespaces and requires lengthly rebuild and then even still hangs.

Technical Uncertainty

Can we extract specific bytes from an S3 file and read them?
Can we use Amazon Nova to generate mock images to vary our dataset?
Can we annotate the images in structure json output using Amazon Nova?
Can we extract a specific image file from a zip archive from s3 (without the need to download archive)
Can we use Nova Titans to create embeddings for our vector search database?
Can we deploy pgvector database via container on a t3.micro?
Can we get Amazon Nova to generate our query to our vector database and return the results?

S3ファイルから特定のバイトを抽出して読み取ることはできますか？
Amazon Novaを使用してデータセットを多様化するためのモック画像を生成することはできますか？
Amazon Novaを使用して構造化されたJSON出力で画像に注釈を付けることはできますか？
S3のzipアーカイブから特定の画像ファイルを抽出することはできますか（アーカイブをダウンロードする必要なく）？
Nova Titansを使用してベクター検索データベース用の埋め込みを作成することはできますか？
t3.microでコンテナ経由でpgvectorデータベースをデプロイすることはできますか？
Amazon Novaにベクターデータベースへのクエリを生成させて結果を返すことはできますか？

Technical Diagram

Public Dataset

We are using the CUBIT Infrastructure Defect Detection Dataset

CUBIT インフラ欠陥検出データセットを使用しています

https://github.com/BenyunZhao/CUBIT

Setup

AWS Account Setup

Enable All Amazon Bedrock Models

Drop down the region changer
Change your region your to 東京 ap-northeast-1

In the search bar type bedrock
Click on Amazon Bedrock to go to this service.

In the left hand column click on モデルアクセス

Click on すべてのモデルを有効にする

Click on 次へ

Click on 送信

See that the models Nova Pro, Nova Canvas are enabled

Setup AWS Infrastructure

We need the two subnets from the default VPC.
We need to run this command in CloudShell:

aws ec2 describe-subnets \
--region ap-northeast-1 \
--filters "Name=vpc-id,Values=$(aws ec2 describe-vpcs --region ap-northeast-1 --filters "Name=is-default,Values=true" --query 'Vpcs[0].VpcId' --output text)" --query 'Subnets[0:2].SubnetId' --output text | tr '\t' ','

Open CloudShell
Paste the AWS CLI command from above
Copy the Subnet IDS for the next step

Lets deploy the following AWS Infrastructure:

AWS User with AWS Credentials
S3 Bucket
RDS Instance

Please click this button to deploy:

Write the name for the stack スタック名: GenAIStorageStack
Paste in the SubnetIds from the previous step
Set the database password Testing123!
Enable extra permissions
Create stack (and wait 5 mins)

Click on outputs
See the outputs, we will use them soon.

Prepare GitHub CodeSpaces Environment

Click on Code
Click on Codespaces
Click on Create codespace on main

Create copy of .env.example and name it .env`
Update AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_BUCKET_NAME (get the values from the Cloudformation Stack)

Install Ruby Libraries by running bundle install

cd /workspaces/aws-storage-genai-workshop 
bundle install

To install nokogiri will takes 1-2 mins

Install AWS CLI

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "/tmp/awscliv2.zip" && \
cd /tmp && unzip awscliv2.zip && sudo ./aws/install && \
rm -rf awscliv2.zip aws/ && cd -

🎉 Setup Complete セットアップ完了 🎉

Test S3 Range

Technical Uncertainty

We want to determine if we can read part of a file without downloading the entire file. Amazon S3 suggests you can use a RANGE Http Header to specific the byte range to download.

Upload File

We will upload a file called hello_world.txt to our bucket.

The contents of this file is こんにちは世界.

./bin/upload_file

Read Part Of File

We will specfic the byte range to only read 世界.

./bin/read_range

Prepare Dataset

Generate Mock Images

If our dataset has missing image examples we can generate our own to help later test the edge cases for our application.

We are using Amazon Nova Canvas to generate images.

./bin/generate

This will output a file to 010__prepare_dataset/outputs/images/

Example of generated image using the following prompt: The image shows the eaves of a building with visible cracks, spalling, and missing components. The surface appears deteriorated, with signs of water damage and discoloration. The eaves are part of the building's exterior, and the defects are concentrated along the edge where the roof meets the wall.

Annotate Images

We need to generate out annotation (metadata) information so we can search our iamgs.

We are using Amazon Nova Pro to to analyze the image.

The challenge is generated structured json output. While this implementation of ./bin/annotate works, there is a chance for 1,000 of runs it might fail and so more work need to put to catch edgecases.

./bin/annotate

Here is a example of annoation output: annotate.json.example

This will annotate our real images, not the mock ones. If we can to include the mock ones we need to copy them into the input directory

Create Archive, Inventory File and Upload to S3

Zip our images to an archive
Read the zip file and create an inventory file with byte ranges for exact files
Upload the zip archive to our S3 bucket

./bin/upload

Test Downloading Single Image from the Archive

This script will read the inventory file to get the byte range, we will use the byte range to download the image from inside the archive.

We have to decompress the partial data to get to the final file.

./bin/download hk0155.jpg

Create Embedding Data

We will use an embedding model to convert our annotation data int vector embeddings. We'll generate out a SQL file to mass import our data into our database.

./bin/embedd

Prepare Vector Database

Install PSQL

In order to interact with our Postgres database we will need to install the postgres client

sudo apt update
sudo apt install postgresql-client -y

Load Data into Databaase

We will enable vector extension
We will setup our tables

./bin/execute ./sql/setup.sql

We will insert our database

./bin/execute ./sql/insert.sql ⚠️⚠️⚠️⚠️ 生成されたファイルで実際のファイル名を確認してください。

⚠️ This file is autogenerated with a timestamp so you'll need to autocomplete eg. ./bin/execute ./sql/insert-1751397185.sql

Will will create our indexes

./bin/execute ./sql/indexes.sql

These warnings is due to our low amount of data. In our production use-case we need to have indexes.

psql:sql/indexes.sql:9: NOTICE:  ivfflat index created with little data
DETAIL:  This will cause low recall.
HINT:  Drop the index until the table has more data.
CREATE INDEX
psql:sql/indexes.sql:11: NOTICE:  ivfflat index created with little data
DETAIL:  This will cause low recall.
HINT:  Drop the index until the table has more data.
CREATE INDEX

Agent Search

Agent

Using the converse API and Amazon Bedrock Pro we can search against our vector database.

Example queries:

./bin/agent "cracks in wall that are not a concern"
./bin/agent "severe structural cracks in concrete walls"
./bin/agent "building defects requiring immediate action"
./bin/agent "roof problems with water damage"
./bin/agent "moderate spalling on urban structures"
./bin/agent "all safety concerns in buildings"

Cleanup

Empty S3 Bucket
Delete Stack

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
bin		bin
cfn		cfn
docs/assets		docs/assets
inputs		inputs
outputs		outputs
sql		sql
.env.example		.env.example
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
Readme.md		Readme.md
extracted_hk0155.jpg		extracted_hk0155.jpg
hello_world.txt		hello_world.txt

ExamProCo/aws-storage-genai-workshop

Folders and files

Latest commit

History

Repository files navigation

AWS GenAI Storage Workshop

Prerequisites 事前準備

Table of Contents 目次

Overview

Business Use-Case

Considertions and Requirements

Technical Uncertainty

Technical Diagram

Public Dataset

Setup

AWS Account Setup

Enable All Amazon Bedrock Models

Setup AWS Infrastructure

Prepare GitHub CodeSpaces Environment

Test S3 Range

Technical Uncertainty

Upload File

Read Part Of File

Prepare Dataset

Generate Mock Images

Annotate Images

Create Archive, Inventory File and Upload to S3

Test Downloading Single Image from the Archive

Create Embedding Data

Prepare Vector Database

Install PSQL

Load Data into Databaase

Agent Search

Agent

Cleanup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages