- AWS Account
- GitHub Account
- Overview 概要
- Setup セットアップ
- Test S3 Range
- Prepare Dataset データセット準備
- Prepare Vector Database
- Agent Search
An engineering firm needs to audit and track public infrastructure for safety via drone arial footage. They have captured tens of thousands of images (eg. cracks in bridges) and have stored them within archives based on year and month.
As a Cloud Engineer you have been tasked to building a proof-of-concept where you can use GenAI to use natural language to retrieve an image from the archive.
You need to report back possible technical paths and technical considerations for this project.
エンジニアリング会社は、ドローンの空撮映像を通じて公共インフラの安全性を監査・追跡する必要があります。彼らは数万枚の画像(例:橋のひび割れ)を撮影し、年月別のアーカイブに保存しています。 クラウドエンジニアとして、あなたはGenAIを使用して自然言語でアーカイブから画像を検索できる概念実証を構築する任務を与えられました。 このプロジェクトの技術的な道筋と技術的考慮事項を報告する必要があります。
- All resources will be created in
ap-northeast-1
Asia Pacific (Tokyo) - We'll be using GitHub Codespaces so we have a consistent developer enviroment
- We are not using free-tier services but the cost should be under $1 USD for the duration of the workshop
- We'll be using the following repo: https://github.com/ExamProCo/aws-storage-genai-workshop
- We may need to rebuild the container for AWS CLI to be installed
devcontainers doesn't always work on Codespaces and requires lengthly rebuild and then even still hangs.
- Can we extract specific bytes from an S3 file and read them?
- Can we use Amazon Nova to generate mock images to vary our dataset?
- Can we annotate the images in structure json output using Amazon Nova?
- Can we extract a specific image file from a zip archive from s3 (without the need to download archive)
- Can we use Nova Titans to create embeddings for our vector search database?
- Can we deploy pgvector database via container on a t3.micro?
- Can we get Amazon Nova to generate our query to our vector database and return the results?
- S3ファイルから特定のバイトを抽出して読み取ることはできますか?
- Amazon Novaを使用してデータセットを多様化するためのモック画像を生成することはできますか?
- Amazon Novaを使用して構造化されたJSON出力で画像に注釈を付けることはできますか?
- S3のzipアーカイブから特定の画像ファイルを抽出することはできますか(アーカイブをダウンロードする必要なく)?
- Nova Titansを使用してベクター検索データベース用の埋め込みを作成することはできますか?
- t3.microでコンテナ経由でpgvectorデータベースをデプロイすることはできますか?
- Amazon Novaにベクターデータベースへのクエリを生成させて結果を返すことはできますか?
We are using the CUBIT Infrastructure Defect Detection Dataset
CUBIT インフラ欠陥検出データセットを使用しています
https://github.com/BenyunZhao/CUBIT
- Drop down the region changer
- Change your region your to
東京 ap-northeast-1
- In the search bar type
bedrock
- Click on Amazon Bedrock to go to this service.
- In the left hand column click on
モデルアクセス
- Click on
すべてのモデルを有効にする
- Click on
次へ
- Click on
送信
- See that the models
Nova Pro
,Nova Canvas
are enabled
- We need the two subnets from the default VPC.
- We need to run this command in CloudShell:
aws ec2 describe-subnets \
--region ap-northeast-1 \
--filters "Name=vpc-id,Values=$(aws ec2 describe-vpcs --region ap-northeast-1 --filters "Name=is-default,Values=true" --query 'Vpcs[0].VpcId' --output text)" --query 'Subnets[0:2].SubnetId' --output text | tr '\t' ','
- Open CloudShell
- Paste the AWS CLI command from above
- Copy the Subnet IDS for the next step
Lets deploy the following AWS Infrastructure:
- AWS User with AWS Credentials
- S3 Bucket
- RDS Instance
Please click this button to deploy:

- Write the name for the stack スタック名:
GenAIStorageStack
- Paste in the SubnetIds from the previous step
- Set the database password
Testing123!
- Enable extra permissions
- Create stack (and wait 5 mins)
- Click on outputs
- See the outputs, we will use them soon.
- Click on
Code
- Click on
Codespaces
- Click on
Create codespace on main
- Create copy of
.env.example and name it
.env` - Update
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
andAWS_BUCKET_NAME
(get the values from the Cloudformation Stack)
- Install Ruby Libraries by running
bundle install
cd /workspaces/aws-storage-genai-workshop
bundle install
To install nokogiri will takes 1-2 mins
- Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "/tmp/awscliv2.zip" && \
cd /tmp && unzip awscliv2.zip && sudo ./aws/install && \
rm -rf awscliv2.zip aws/ && cd -
🎉 Setup Complete セットアップ完了 🎉
We want to determine if we can read part of a file without downloading the entire file. Amazon S3 suggests you can use a RANGE Http Header to specific the byte range to download.
We will upload a file called hello_world.txt
to our bucket.
The contents of this file is こんにちは世界
.
./bin/upload_file
We will specfic the byte range to only read 世界
.
./bin/read_range
If our dataset has missing image examples we can generate our own to help later test the edge cases for our application.
We are using Amazon Nova Canvas
to generate images.
./bin/generate
This will output a file to 010__prepare_dataset/outputs/images/
Example of generated image using the following prompt: The image shows the eaves of a building with visible cracks, spalling, and missing components. The surface appears deteriorated, with signs of water damage and discoloration. The eaves are part of the building's exterior, and the defects are concentrated along the edge where the roof meets the wall.
We need to generate out annotation (metadata) information so we can search our iamgs.
We are using Amazon Nova Pro
to to analyze the image.
The challenge is generated structured json output.
While this implementation of ./bin/annotate
works, there is a chance for 1,000 of runs it might fail and so more work need to put to catch edgecases.
./bin/annotate
Here is a example of annoation output: annotate.json.example
This will annotate our real images, not the mock ones. If we can to include the mock ones we need to copy them into the input directory
- Zip our images to an archive
- Read the zip file and create an inventory file with byte ranges for exact files
- Upload the zip archive to our S3 bucket
./bin/upload
This script will read the inventory file to get the byte range, we will use the byte range to download the image from inside the archive.
We have to decompress the partial data to get to the final file.
./bin/download hk0155.jpg
We will use an embedding model to convert our annotation data int vector embeddings. We'll generate out a SQL file to mass import our data into our database.
./bin/embedd
In order to interact with our Postgres database we will need to install the postgres client
sudo apt update
sudo apt install postgresql-client -y
- We will enable vector extension
- We will setup our tables
./bin/execute ./sql/setup.sql
- We will insert our database
./bin/execute ./sql/insert.sql ⚠️⚠️⚠️⚠️ 生成されたファイルで実際のファイル名を確認してください。
⚠️ This file is autogenerated with a timestamp so you'll need to autocomplete eg. ./bin/execute ./sql/insert-1751397185.sql
- Will will create our indexes
./bin/execute ./sql/indexes.sql
These warnings is due to our low amount of data. In our production use-case we need to have indexes.
psql:sql/indexes.sql:9: NOTICE: ivfflat index created with little data
DETAIL: This will cause low recall.
HINT: Drop the index until the table has more data.
CREATE INDEX
psql:sql/indexes.sql:11: NOTICE: ivfflat index created with little data
DETAIL: This will cause low recall.
HINT: Drop the index until the table has more data.
CREATE INDEX
Using the converse API and Amazon Bedrock Pro we can search against our vector database.
Example queries:
./bin/agent "cracks in wall that are not a concern"
./bin/agent "severe structural cracks in concrete walls"
./bin/agent "building defects requiring immediate action"
./bin/agent "roof problems with water damage"
./bin/agent "moderate spalling on urban structures"
./bin/agent "all safety concerns in buildings"
- Empty S3 Bucket
- Delete Stack