-
Notifications
You must be signed in to change notification settings - Fork 819
Add optional step to archive post-reexecution state to S3 #4172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional step to archive post-reexecution state to S3 #4172
Conversation
…o work w/ github action benchmark
| S3_DST: '{{.S3_DST}}' | ||
| cmds: | ||
| - cmd: s5cmd cp {{.LOCAL_SRC}} {{.S3_DST}} | ||
| - cmd: bash -x ./scripts/copy_dir.sh {{.LOCAL_SRC}} {{.S3_DST}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switch to use copy dir script to benefit from check for non-empty s3 directory and avoid accidental overwrites
scripts/copy_dir.sh
Outdated
| # Ensure destination directory exists (after validation) | ||
| mkdir -p "$dest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing code already does this... but won't this create dirs on our local fs with the path s3://...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, moving into the conditional block below where we are using cp
|
|
||
| # Check if source starts with s3:// | ||
| if [[ "$source" == s3://* ]]; then | ||
| echo "Copying from S3: $source -> $dest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated log removal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add back
| if [[ "$OUTPUT" == *"no object found"* ]]; then | ||
| echo "Verified S3 destination: '$dst' is empty" | ||
| else | ||
| echo "Error: failed to check for contents of $dst" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the bucket does not exist - should we be creating it and then copying? Currently I think that if we're trying to copy from a source and the destination bucket doesn't exist, we'll have to create it manually - which is inconsistent w/ the behavior of this script when copying between local dirs where we'll create the directory before proceeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true if the bucket does not exist, but it will create the directory for you. This seems like the right default to me as I would not intend to copy directories to a brand new S3 bucket, just copy a directory into an existing bucket.
If you feel strongly that the behavior should match the local file system, I'd prefer to remove mkdir -p <dst> from the local file system handling than to create the bucket if it doesn't exist yet here.
| push-post-state: | ||
| description: 'S3 location to push post-execution state directory. Skips this step if left unpopulated.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're a bit inconsistent w/ the naming of this (we're calling it post state/post execution state/re-executed in a few places). Since we're already referring to this as re-execute in the current code, could we use that naming in lieu of the new naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. re-execution refers to the fact we are re-executing blocks from an actual network. Post execution refers to the fact that we are pushing the result after completing execution. post-execution is an adjective here, not a name and post-reexecution doesn't sound right to me.
| - name: Push Post-State to S3 (if not exists) | ||
| if: ${{ inputs.push-post-state != '' }} | ||
| shell: nix develop --command bash -x {0} | ||
| run: ./scripts/run_task.sh export-dir-to-s3 LOCAL_SRC=${{ env.EXECUTION_DATA_DIR }}/current-state/ S3_DST=${{ inputs.push-post-state }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this a bit confusing/I'm not sure what was intended here so just dumping my thoughts:
- It looks like we log
EXECUTION_DATA_DIRas"EXECUTION_DATA_DIR=${{ inputs.workspace }}/reexecution-data"earlier (appended a dir for a log) - We pass in the env var as
EXECUTION_DATA_DIR=${{ env.EXECUTION_DATA_DIR }} \(unmodified) - But the actual directory we're copying over is
LOCAL_SRC=${{ env.EXECUTION_DATA_DIR }}/current-state/(different appended path, this is also a bit confusing w/ naming since we have an env var forCURRENT_STATE_DIR)
Would it make sense for us to pass in env.EXECUTION_DATA_DIR }}/current-state/ when calling run_task and update the log? Based on the README it also looks like it might be possible that we meant to use CURRENT_STATE_DIR, since that's documented as having the current-state dir format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reexecute-cchain-range - takes in CURRENT_STATE_DIR and SOURCE_BLOCK_DIR as separate params and reads from the block dir and executes on and overwrites CURRENT_STATE_DIR.
reexecute-cchain-range-with-copied-data - this copies data in from S3 buckets or a local directory (supports local directory for testing on snoopy/linus instances with local copy of each)
EXECUTION_DATA_DIR is where both of these will be copied into and that's why there's a suffix added to both for the current state and blocks.
That can definitely be confusing. I'd propose using different var names for these two tasks, so that the copy job is clear that CURRENT_STATE_DIR and SOURCE_BLOCK_DIR are the source to copy from as opposed to what's actually used. I think the root of the confusion is re-using the same var names across both jobs.
If we used CURRENT_STATE_DIR here, that would be incorrect because that's the original S3 bucket where we copied the data from.
Since this is using the existing naming, I'd prefer to make the rename a separate PR as a follow up to this. Will make one on top of this PR for now and if you think it should be part of this PR, we can block on it to merge this in.
| { | ||
| echo "EXECUTION_DATA_DIR=${{ inputs.workspace }}/reexecution-data" | ||
| echo "BENCHMARK_OUTPUT_FILE=${{ inputs.workspace }}/reexecute-cchain-range-benchmark-res.txt" | ||
| echo "BENCHMARK_OUTPUT_FILE=output.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this re-named? Is this a related change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some weirdness in how GitHub Action Benchmark handles paths, where it was not handling absolute paths correctly. I found while working on this PR that env vars from $GITHUB_ENV are not picked up correctly by the task run from within run-monitored-tmpnet-cmd, so I switched to passing each var in explicitly and that triggered the mishandling of absolute paths.
| prometheus-password: ${{ secrets.PROMETHEUS_PASSWORD || '' }} | ||
| push-github-action-benchmark: ${{ github.event_name == 'schedule' }} | ||
| aws-role: ${{ secrets.AWS_S3_READ_ONLY_ROLE }} | ||
| aws-role: ${{ github.event.inputs.push-post-state != '' && secrets.AWS_S3_RW_ROLE || secrets.AWS_S3_READ_ONLY_ROLE }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is this over-engineering? Feels like we could just use r/w perms since this job now is able to write to s3 when configured to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehh there's definitely an argument there. I thought this was a reasonable extra protection / defense in depth against accidentally corrupting it.
Writes would only occur via manual trigger, so are run infrequently and if something breaks that would overwrite or push incorrect data by accident, this minimizes the access level of > 90% of runs.
It's also pretty straight forward here imo, so the real cost imo is maintaining two roles instead of one, but that is already set up and they are named explicitly.
|
Addressed all comments except #4172 (comment). As mentioned, will make a separate PR on top of this to improve the names, so that they are not shared between the tasks |
60d5188 to
b6a2fee
Compare
|
Very strange: GitHub Action Benchmark is failing to checkout the |
@aaronbuchwald I'm also facing the same issue on a separate PR: https://github.com/ava-labs/avalanchego/actions/runs/17069126455/job/48393320563?pr=4181 |
Looks like this is caused by the The documentation for Caching in GitHub Actions explicitly uses It's possible that using a path included in Switching to use |
| source-block-dir: | ||
| description: 'The source block directory. Supports S3 directory/zip and local directories.' | ||
| default: 's3://avalanchego-bootstrap-testing/cchain-mainnet-blocks-1m-ldb.zip' | ||
| default: 's3://avalanchego-bootstrap-testing/cchain-mainnet-blocks-1m-ldb/**' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain what this ** means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* matches any key at a given level of an S3 bucket and ** matches any key at any level of recursive nesting within the bucket.
| with: | ||
| run: ./scripts/run_task.sh reexecute-cchain-range-with-copied-data | ||
| run: | | ||
| ./scripts/run_task.sh reexecute-cchain-range-with-copied-data \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these capital cased variables are only defined in this file, so what is consuming them via run_task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run task invokes the Taskfile (ref: https://taskfile.dev/), which uses this syntax.
These variables are defined here:
Line 184 in e5593be
| reexecute-cchain-range: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I know what that is, it's just that my IDE searched for the strings and couldn't find them. Now it does, odd... must have been still indexing
| # a change in beahvior would cause the script to fail to copy rather | ||
| # than allow accidental overwrites. | ||
| echo "Checking if S3 path exists: $dst" | ||
| if ! OUTPUT=$(s5cmd ls "$dst" 2>&1); then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a stupid question but - how do we have the credentials to invoke this command? Do we have them set somewhere as env vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup we assume AWS credentials are set. In CI, we use aws-actions/configure-aws-credentials@v4:
| uses: aws-actions/configure-aws-credentials@v4 |
| if [[ "$dst" == s3://* ]]; then | ||
| # Validate the S3 path format as s3://<bucket-name>/<directory-name>/ | ||
| echo "Checking S3 path format: $dst" | ||
| if ! [[ "$dst" =~ ^s3://[^/]+/([^/]+/)$ ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like a rather permissive regexp [^/]+/([^/]+/). It basically lets everything that is separated by a / and ends with a /.
Is there really a concern here that we'll get an incorrect input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this PR provides an optional feature to archive the state to S3 and takes in a destination path from the user (this is gated to developers on our team). This enforces that we don't nest two different directories together like putting a current state directory nested inside of another one from a different chain height. It is not perfect, but seemed a good safeguard to add.
Open to better ideas to reduce the likelihood that we corrupt the S3 bucket we're using here.
commit 66ca7dc Author: rodrigo <[email protected]> Date: Tue Sep 2 16:49:55 2025 -0400 feat(load): add firewood flag (#4235) commit 274541b Author: Suyan Qu <[email protected]> Date: Tue Sep 2 14:10:48 2025 -0500 feat: add parameters to disable metrics (#4214) commit e03af84 Author: aaronbuchwald <[email protected]> Date: Tue Sep 2 12:30:07 2025 -0400 Add timeout parameter to C-Chain re-execution jobs (#4223) commit 0e20485 Author: aaronbuchwald <[email protected]> Date: Tue Sep 2 11:58:29 2025 -0400 Comment out schedule trigger for re-execution on w/container (#4234) Signed-off-by: aaronbuchwald <[email protected]> Co-authored-by: Copilot <[email protected]> commit 847eba1 Author: aaronbuchwald <[email protected]> Date: Fri Aug 29 14:38:11 2025 -0400 Add back empty schedule entry for reexecute w/ container job (#4230) commit a958b8a Author: aaronbuchwald <[email protected]> Date: Fri Aug 29 12:56:06 2025 -0400 Add newline to workflow dispatch (#4229) Signed-off-by: aaronbuchwald <[email protected]> commit 7ec2258 Author: aaronbuchwald <[email protected]> Date: Thu Aug 28 17:11:38 2025 -0400 Push benchmark re-execute results on master workflow dispatch (#4224) commit 34f983e Author: aaronbuchwald <[email protected]> Date: Thu Aug 28 15:33:12 2025 -0400 Disambiguate source vs exec variable names in reexecute tasks (#4200) Signed-off-by: aaronbuchwald <[email protected]> Co-authored-by: Copilot <[email protected]> commit 99578a2 Author: aaronbuchwald <[email protected]> Date: Thu Aug 28 12:52:31 2025 -0400 Write grafana link to logs and github step summary (#4219) commit 814300c Author: aaronbuchwald <[email protected]> Date: Thu Aug 28 12:37:05 2025 -0400 Remove firewood entry from PR triggers due to flakes (#4227) commit 40fbcd5 Author: rodrigo <[email protected]> Date: Thu Aug 28 00:24:54 2025 -0400 refactor(load): simulator contract (#4181) commit 6195e1f Author: rodrigo <[email protected]> Date: Wed Aug 27 17:31:51 2025 -0400 chore: remove unzip mention (#4226) commit 59e88f3 Author: aaronbuchwald <[email protected]> Date: Wed Aug 27 11:17:27 2025 -0400 Remove schedule trigger for w/ container job that evaluates to empty matrix (#4218) commit c2563d1 Author: Stephen Buttolph <[email protected]> Date: Tue Aug 26 19:07:47 2025 -0400 Update versions for v1.13.5 (#4217) commit a0ccd66 Author: aaronbuchwald <[email protected]> Date: Tue Aug 26 12:34:54 2025 -0400 Add support for passing config and predefined configs to VM re-execution tests (#4180) commit cc3242f Author: Joshua Kim <[email protected]> Date: Mon Aug 25 18:49:28 2025 -0400 Dynamically update mempool gossip request rate limit (#4162) Signed-off-by: Joshua Kim <[email protected]> Co-authored-by: Stephen Buttolph <[email protected]> commit f2e3273 Author: Draco <[email protected]> Date: Mon Aug 25 15:00:56 2025 -0400 Add ability to create zstd compressor with compression level (#4203) commit 441f441 Author: Joshua Kim <[email protected]> Date: Mon Aug 25 12:11:07 2025 -0400 Remove buf lint action (#4189) Signed-off-by: Joshua Kim <[email protected]> commit 4bcb221 Author: Stephen Buttolph <[email protected]> Date: Sat Aug 23 15:43:06 2025 -0400 Update block + validator + pgo checkpoints to 2025-08-23 (#4205) commit b18ffc1 Author: rodrigo <[email protected]> Date: Fri Aug 22 16:34:53 2025 -0400 Add s5cmd progress bar (#4204) commit 2100bee Author: Sam Liokumovich <[email protected]> Date: Fri Aug 22 11:52:31 2025 -0400 Rename Engine Types (#4193) Signed-off-by: Sam Liokumovich <[email protected]> Co-authored-by: Copilot <[email protected]> commit 33727a8 Author: Joshua Kim <[email protected]> Date: Fri Aug 22 11:00:12 2025 -0400 Count throttled requests as hits (#4199) Signed-off-by: Joshua Kim <[email protected]> commit b939be4 Author: Draco <[email protected]> Date: Thu Aug 21 14:22:54 2025 -0400 fix: blockdb file eviction race issue (#4186) commit 778ccfe Author: aaronbuchwald <[email protected]> Date: Thu Aug 21 11:40:03 2025 -0400 Add config option for AWS S3 read only credential duration (#4192) commit ae41355 Author: Stephen Buttolph <[email protected]> Date: Wed Aug 20 16:46:49 2025 -0400 Add redundant import alias linting (#4191) Signed-off-by: Stephen Buttolph <[email protected]> Co-authored-by: Copilot <[email protected]> commit a3b5c6a Author: Stephen Buttolph <[email protected]> Date: Wed Aug 20 10:53:24 2025 -0400 Make Draco the codeowner of the blockdb (#4187) commit a24ac68 Author: queryfast <[email protected]> Date: Wed Aug 20 22:18:21 2025 +0800 refactor: replace []byte(fmt.Sprintf) with fmt.Appendf (#4161) Signed-off-by: queryfast <[email protected]> commit 7aa6a17 Author: Sam Liokumovich <[email protected]> Date: Tue Aug 19 14:39:40 2025 -0400 Rename height field to numBlocks (#4184) commit 7d7e1fe Author: aaronbuchwald <[email protected]> Date: Tue Aug 19 13:24:59 2025 -0400 Add optional step to archive post-reexecution state to S3 (#4172) Signed-off-by: aaronbuchwald <[email protected]> Co-authored-by: Copilot <[email protected]> commit ebe0558 Author: aaronbuchwald <[email protected]> Date: Tue Aug 19 12:11:34 2025 -0400 Change cache path to tmp included in gitignore (#4183) commit e5593be Author: Draco <[email protected]> Date: Tue Aug 19 12:01:43 2025 -0400 Block Database (#4027) commit 940b96f Author: Sam Liokumovich <[email protected]> Date: Tue Aug 19 11:36:37 2025 -0400 Storage Component For Simplex (#4122) Signed-off-by: Sam Liokumovich <[email protected]> commit 6d7e2dc Author: Nicolas Arnedo Villanueva <[email protected]> Date: Tue Aug 19 16:59:58 2025 +0200 `config/config.md:` Added Env Variable representation of flags + improved UI design (#4110) Signed-off-by: Meaghan FitzGerald <[email protected]> Signed-off-by: Nicolas Arnedo Villanueva <[email protected]> Co-authored-by: Meaghan FitzGerald <[email protected]> Co-authored-by: Stephen Buttolph <[email protected]> commit 81f13b2 Author: Draco <[email protected]> Date: Mon Aug 18 13:59:43 2025 -0400 feat: add eviction callback in LRU cache (#4088) commit 4f5acfc Author: Jonathan Oppenheimer <[email protected]> Date: Mon Aug 18 13:16:44 2025 -0400 Migrate predicate package from evm repos (#4147) Signed-off-by: Jonathan Oppenheimer <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Stephen Buttolph <[email protected]> Co-authored-by: Joshua Kim <[email protected]> commit 335e79f Author: Kendra Karol Sevilla <[email protected]> Date: Mon Aug 18 18:45:52 2025 +0200 chore: fix typo (#4179) Signed-off-by: Kendra Karol Sevilla <[email protected]> commit 7275b02 Author: yinwenyu6 <[email protected]> Date: Mon Aug 18 22:29:03 2025 +0800 chore: fix function name (#4178) Signed-off-by: yinwenyu6 <[email protected]> commit 3b0c595 Author: yacovm <[email protected]> Date: Mon Aug 18 16:28:29 2025 +0200 Fix typo in comment - PChainHeight context (#4176) Signed-off-by: Yacov Manevich <[email protected]> commit 96f30d1 Author: rodrigo <[email protected]> Date: Fri Aug 15 02:15:44 2025 -0400 feat(load): add token test (#4171) commit e285ce0 Author: Sam Liokumovich <[email protected]> Date: Thu Aug 14 13:52:41 2025 -0400 Use EmptyVoteMetadata in Simplex Proto Messages (#4174) commit 5c72544 Author: aaronbuchwald <[email protected]> Date: Wed Aug 13 10:34:58 2025 -0400 Move C-Chain benchmark to custom action and add ARC + GH runner triggers (#4165) commit 3b0f8d4 Author: rodrigo <[email protected]> Date: Tue Aug 5 20:14:38 2025 -0400 refactor(load): remove context from test interface (#4157) commit a893a61 Author: Juan Leon <[email protected]> Date: Tue Aug 5 14:36:59 2025 -0400 Add @joshua-kim as CODEOWNER to testing-related packages (#4118) Signed-off-by: Juan Leon <[email protected]> commit be28a8b Author: Galoretka <[email protected]> Date: Mon Aug 4 22:39:41 2025 +0300 chore: fix a typo in gossip,go (#4154) Signed-off-by: Galoretka <[email protected]> commit b876d78 Author: aaronbuchwald <[email protected]> Date: Mon Aug 4 11:58:22 2025 -0400 Separate re-execution job params for PR from schedule (#4151) commit 752e12f Author: Stephen Buttolph <[email protected]> Date: Fri Aug 1 16:23:01 2025 -0400 Update coreth to v0.15.3-rc.5 (#4153) commit 3ba5246 Author: Joshua Kim <[email protected]> Date: Fri Aug 1 14:59:24 2025 -0400 fix metrics tests (#4146) Signed-off-by: Joshua Kim <[email protected]> commit 0cb887b Author: Afounso Souza <[email protected]> Date: Fri Aug 1 16:37:53 2025 +0200 Typo fix (#4150) Signed-off-by: Afounso Souza <[email protected]> commit 110807a Author: rodrigo <[email protected]> Date: Thu Jul 31 22:06:40 2025 -0400 docs: load (#4132) commit 24a051a Author: Jonathan Oppenheimer <[email protected]> Date: Thu Jul 31 19:06:15 2025 -0400 uplift: Add combined metrics package from evm repositories (#4135) Signed-off-by: Jonathan Oppenheimer <[email protected]> Co-authored-by: Stephen Buttolph <[email protected]> commit d9b512e Author: rodrigo <[email protected]> Date: Thu Jul 31 11:52:39 2025 -0400 Parameterize values in transfer tests (#4144) commit 6947e4c Author: rodrigo <[email protected]> Date: Wed Jul 30 12:27:45 2025 -0400 feat(load): add trie stress test (#4137) Signed-off-by: Joshua Kim <[email protected]>


This PR adds an optional step to the re-execution custom action to archive the S3 post execution state.
This also updates the ARC action to provide an added optional input to the manual workflow trigger that will use it.
Testing the workflow via PR with the triggers updated so that it runs what should only run via manual workflow temporarily.
ARC run exporting successfully to S3 bucket here: https://github.com/ava-labs/avalanchego/actions/runs/16976897266/job/48128019101?pr=4172
Second attempt on the same commit that fails due to an attempt to overwrite the S3 bucket pushed by the first run: https://github.com/ava-labs/avalanchego/actions/runs/16976897266?pr=4172
Commit undoing the PR trigger to test these changes: 326468e
Resolves #4167