-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add a MockS3Server plus an s3client and bulkdumpings3 simulation test #12279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
network (simulated network intercepts requests for 127.0.0.1:8080 and does appropriate forwarding to s3 handler). Add two workloads, one for s3client against 's3' and another that swaps in s3 as backend for the BulkDumping test. The Mock S3 Server is mostly generated (claude-sonnet-4) code. It supports: - Basic GET/PUT/DELETE/HEAD object operations - Multipart uploads (initiate, upload parts, complete, abort) - Object tagging (put/get tags) - In-memory storage with deterministic behavior - S3-compatible XML responses * fdbserver/workloads/BulkDumping.actor.cpp Change this workload so it reads the transport to use from configuration file so can be used to test file-based and s3 bulk loading. If the transport is blobstore/s3, start up the mocks3server in _setup. * fdbserver/workloads/S3ClientWorkload.actor.cpp Simple workload to exercise s3client. Starts up the mocks3server in _setup. * fdbclient/S3BlobStore.actor.cpp Check BUGGIFY before messing w/ status codes. * fdbrpc/HTTP.actor.cpp Add defines and log if failed parse of status line.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
fdbserver/fdbserver.actor.cpp
Outdated
@@ -2028,6 +2028,8 @@ int main(int argc, char* argv[]) { | |||
enableGeneralBuggify(); | |||
} else { | |||
disableGeneralBuggify(); | |||
// When buggify is disabled, also disable global fault injection | |||
opts.faultInjectionEnabled = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to disable the fault injection? I think the buggify is not exactly same as the fault injection. Please correct me if I'm wrong. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is here because though I have fault injection set to false in the two toml tests files, there was still fault injection happening in joshua. Let me undo this for now. I'll post new joshua run numbers... Get an example of what fault injection I was seeing.
@@ -1186,12 +1186,13 @@ ACTOR Future<Reference<HTTP::IncomingResponse>> doRequest_impl(Reference<S3BlobS | |||
} | |||
|
|||
Reference<HTTP::IncomingResponse> _r = wait(timeoutError(reqF, requestTimeout)); | |||
if (g_network->isSimulated() && deterministicRandom()->random01() < 0.1) { | |||
if (g_network->isSimulated() && BUGGIFY && deterministicRandom()->random01() < 0.1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do Buggify here? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only do the random if BUGGIFY is enabled. I don't to turn off any variance for the moment in s3 client behaviors till we have stable determinism.
} | ||
|
||
// Public Interface Implementation | ||
ACTOR Future<Void> startMockS3Server(NetworkAddress listenAddress) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this startMockS3Server used? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not currently. I thought I would need it until tripped over the register http server thingy. Seems useful though if someone wanted to start up a mock s3 instance.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Here is 10k tests of BulkDumpingS3 only
and here are 100k of general tests
Looks like its fine w/o the global disabling of fault injection. |
tests/slow/S3Client.toml
Outdated
|
||
[[knobs]] | ||
# Disable failure injection for deterministic behavior | ||
bulkload_sim_failure_injection = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that this knob is irrelevant to this toml file test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Removed (was mirroring changes in BulkDumpingS3 to this file)
# When done, run: | ||
# shutdown_weed ~/weed | ||
|
||
# Disable all fault injection and network failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to disable all fault injection? Will we enable those in the future? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Currently, fault injection produces failures. In follow-on, would introduce faults.
# DETERMINISM FIX: Combined knobs section to avoid duplicates | ||
[[knobs]] | ||
# S3/Bulkload determinism fixes | ||
bulkload_sim_failure_injection = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try with enabling this knob bulkload_sim_failure_injection? Since we have a mocked http server, we do not have a reason to disable the bulkload failure injection? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I disable this flag, I get failures:
20250729-152416-stack-b70ff5b181db9da0 compressed=True data_size=41406205 duration=898424 ended=10000 env=TH_ARCHIVE_LOGS_ON_FAILURE=true:TH_UNSEED_CHECK_RATIO=1.0 fail=2 fail_fast=10 max_runs=10000 pass=9998 priority=100 remaining=0 runtime=0:22:44 sanity=False started=10000 stopped=20250729-154700 submitted=20250729-152416 timeout=5400 username=stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. We can set the knob here and get the PR merged. We will remove the bulkload_sim_failure_injection later. Maybe better to add a "TODO" here.
# Simple network configuration - single region, no satellites | ||
minimumRegions = 1 | ||
# Use ssd-2 storage engine instead of rocksdb to avoid RocksDB bulk loading crashes | ||
# The RocksDB assertion failure indicates incompatibility with current bulk loading implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which assertion failure? The current bulkload implementation should be compatible to RocksDB. Please let me know more detail about this. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is old at your suggestion IIRC. We can look at this in a follow-on?
// Register MockS3Server with IP address - simulation environment doesn't support hostname resolution | ||
// See in HTTPServer.actor.cpp how the MockS3RequestHandler is implemented. Client connects to | ||
// connect("127.0.0.1", "8080") and then simulation network routes it to MockS3Server. | ||
wait(g_simulator->registerSimHTTPServer("127.0.0.1", "8080", makeReference<MockS3RequestHandler>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only a few nits. Thank you for the nice PR!
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Thanks for the review @kakaiu |
Register it at 127.0.0.1:8080 on the simulated network (simulated network intercepts requests for 127.0.0.1:8080 and does appropriate forwarding to s3 handler). Add two workloads, one for s3client against 's3' and another that swaps in s3 as backend for the BulkDumping test.
The Mock S3 Server is mostly generated (claude-sonnet-4) code. It supports:
fdbserver/workloads/BulkDumping.actor.cpp Change this workload so it reads the transport to use from configuration file so can be used to test file-based and s3 bulk loading. If the transport is blobstore/s3, start up the mocks3server in _setup.
fdbserver/workloads/S3ClientWorkload.actor.cpp Simple workload to exercise s3client. Starts up the mocks3server in _setup.
fdbclient/S3BlobStore.actor.cpp Check BUGGIFY before messing w/ status codes.
fdbrpc/HTTP.actor.cpp Add defines and log if failed parse of status line.
And I think I fixed the one fail ... its global buggify.