Skip to content

Speed up test suite #4849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

marta-lokhova
Copy link
Contributor

@marta-lokhova marta-lokhova commented Jul 26, 2025

Resolves https://github.com/stellar/stellar-core-internal/issues/371
Resolves #2100
Resolves #4445

Looks like our test suite accumulated quite a lot of tech debt and is taking much longer than it should(this came up in a recent conversation with @drebelsky about simple tests like "txset nomination" taking ~35 seconds).

This PR makes several changes:

  • Add a helper script to audit the test suite and identify biggest offenders (written by @claude)

The script can be run on all tests, any partition, or an individual test. sample output:

python3 run_test_timing_analysis.py -p overlay -n 10

Found 62 tests to run
Running each test individually to measure wall-clock time...
This may take a while depending on the number of tests...
--------------------------------------------------------------------------------
[1/62] Running: Flooding
[2/62] Running: ItemFetcher fetches
[3/62] Running: next peer strategy
...
[62/62] Running: Tracker works

================================================================================
TEST TIMING ANALYSIS REPORT (Wall-Clock Time)
================================================================================

Total tests: 62
Total time: 350.01s
Average time per test: 5.645s

Top 10 Slowest Tests:
--------------------------------------------------------------------------------
  1.  152.442s ( 43.6%) overlay flow control
  2.   50.374s ( 14.4%) TCPPeer read malformed messages
  3.   46.080s ( 13.2%) overlay pull mode loadgen
  4.   19.691s (  5.6%) survey request process order
  5.   14.766s (  4.2%) Flooding
  6.   11.077s (  3.2%) inbounds nodes can be promoted to ouboundvalid
  7.   10.106s (  2.9%) TCPPeer lifetime
  8.    9.474s (  2.7%) Time sliced static topology survey
  9.    6.130s (  1.8%) connecting to saturated nodes
 10.    5.087s (  1.5%) peer is purged from database after few failures

Top 10 tests account for 325.23s (92.9% of total time)

Warning: 1 tests failed
  • One of the biggest issues is that we close a ledger every time we create a new account (some tests create thousands of accounts, which is slow even in virtual time). I switched those tests to use GENESIS_TEST_ACCOUNT_COUNT and observed some drastic speedups, e.g. "SCP Driver"
// Before
stellar-core test -a "SCP Driver"  30.92s user 49.76s system 37% cpu 3:35.51 total
// With GENESIS_TEST_ACCOUNT_COUNT
stellar-core test -a "SCP Driver"  1.89s user 0.20s system 73% cpu 2.840 total
  • Move a couple of long-running tests to acceptance
  • Tweak how load is generated in tests running in TCP mode, since it's running in real-time. Remaining TODO is to switch those to accelerated time.
  • Remove old upgrade tests - since the network is way past protocol 10, 11, and 12, replaying upgrades via catchup should be sufficient.

Opening as a draft, since I haven't finished auditing all the tests flagged by the scripts.

@marta-lokhova
Copy link
Contributor Author

Here are top 50 slowest tests (a lot of them already live in acceptance, but some like archival meta run in every GH action). Note that these results are based on a regular core build; I imagine with asan - which is our default in CI - things are much worse.

Top 50 Slowest Tests:
--------------------------------------------------------------------------------
  1. 1392.434s ( 24.0%) History catchup
  2.  686.091s ( 11.8%) archival meta
  3.  482.937s (  8.3%) Catchup recent
  4.  417.904s (  7.2%) overlay parallel processing
  5.  287.317s (  5.0%) Publish catchup alternation with stall
  6.  264.985s (  4.6%) Catchup fatal failure
  7.  152.276s (  2.6%) overlay flow control
  8.  125.216s (  2.2%) LedgerTxnRoot prefetch classic entries
  9.  118.261s (  2.0%) ledger state update flow with parallel apply
 10.  117.431s (  2.0%) bucketmanager reattach to running merge
 11.  109.898s (  1.9%) write verified checkpoint hashes
 12.   80.054s (  1.4%) Retriggering catchups after trimming mSyncingLedgers
 13.   78.248s (  1.4%) History publish with restart
 14.   65.993s (  1.1%) Catchup failure recovery with buffered checkpoint
 15.   65.692s (  1.1%) catchup with a gap
 16.   58.802s (  1.0%) Publish throttles catchup
 17.   49.920s (  0.9%) TCPPeer read malformed messages
 18.   46.214s (  0.8%) overlay pull mode loadgen
 19.   41.728s (  0.7%) History bucket verification
 20.   39.476s (  0.7%) apply load
 21.   36.576s (  0.6%) generate load with unique accounts
 22.   34.024s (  0.6%) bl cache
 23.   32.347s (  0.6%) soroban cache population
 24.   31.407s (  0.5%) Catchup with protocol upgrade
 25.   29.497s (  0.5%) key-value lookup
 26.   29.390s (  0.5%) loadPoolShareTrustLinesByAccountAndAsset does not load outdated versions
 27.   29.347s (  0.5%) do not load outdated values
 28.   29.274s (  0.5%) bucket entry counters
 29.   29.259s (  0.5%) Flooding
 30.   29.251s (  0.5%) loadPoolShareTrustLinesByAccountAndAsset
 31.   27.614s (  0.5%) tx set hits overlay byte limit during construction
 32.   26.118s (  0.5%) surge pricing with DEX separation holds invariants
 33.   25.319s (  0.4%) BucketListIsConsistentWithDatabase bucket bounds
 34.   24.587s (  0.4%) create offer
 35.   23.430s (  0.4%) soroban txs accepted by the network
 36.   22.122s (  0.4%) Catchup manual
 37.   20.161s (  0.3%) parallel tx set building
 38.   20.112s (  0.3%) CKMSSample gamma distribution
 39.   19.430s (  0.3%) 3 nodes 2 running threshold 2
 40.   17.470s (  0.3%) pathpayment
 41.   16.893s (  0.3%) BucketListIsConsistentWithDatabase added entries
 42.   16.388s (  0.3%) survey request process order
 43.   15.033s (  0.3%) METADATA_DEBUG_LEDGERS works
 44.   15.033s (  0.3%) Soroban footprint validation
 45.   14.798s (  0.3%) getNodeWeight
 46.   14.457s (  0.2%) Soroban authorization
 47.   13.733s (  0.2%) History catchup with different modes
 48.   13.383s (  0.2%) txset nomination
 49.   11.820s (  0.2%) CKMSSample uniform distribution
 50.   11.771s (  0.2%) revoke from pool

Top 50 tests account for 5360.92s (92.5% of total time)

Copy link
Contributor

@dmkozh dmkozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that 'archival meta' has not been moved to the acceptance, even though it's one of the slowest tests we have. I have mixed feelings about it; it is sensitive to the fee changes and it often breaks. Maybe we should think how to separate it into a basic test that covers most of the logic quickly and then a slower acceptance test with more forgiving assertions? cc @SirTyson

@SirTyson
Copy link
Contributor

I can see that 'archival meta' has not been moved to the acceptance, even though it's one of the slowest tests we have. I have mixed feelings about it; it is sensitive to the fee changes and it often breaks. Maybe we should think how to separate it into a basic test that covers most of the logic quickly and then a slower acceptance test with more forgiving assertions? cc @SirTyson

I think given the amount of updates and breaks that particular test has seen in p23, we should keep it in non-acceptance. In general our meta tests seem somewhat lacking and need improvement. For this particular test, I think it will be fairly easy to speed up significantly, which I'll work on as a followup.

@marta-lokhova
Copy link
Contributor Author

Sad news: I looked a bit more into our [history] tests that seem to consume the most time, and looks like our history testing infrastructure is buggy, and doesn't handle failed catchup test scenarios correctly (resulting in basically sleeping in tests for minutes). The buggy behavior has been around for years (going back to 2018 it seems). Fixes seem fairly simple, so I'm pushing those as well, as they speed up the suite dramatically, e.g.

  1.    4.092s (100.0%) Catchup fatal failure (was 264.985s)

Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into the test backlogs! Just a couple of small questions.

@marta-lokhova marta-lokhova marked this pull request as ready for review July 28, 2025 22:20
@marta-lokhova
Copy link
Contributor Author

There's a surprising perf issue in "LedgerTxnRoot prefetch classic entries" test: this call takes almost a minute, which seems way too high given we only generate 1000 valid classic entries.

@marta-lokhova
Copy link
Contributor Author

@dmkozh @SirTyson Your comments should be addressed now

dmkozh
dmkozh previously approved these changes Jul 29, 2025
dmkozh
dmkozh previously approved these changes Jul 29, 2025
SirTyson
SirTyson previously approved these changes Jul 29, 2025
dmkozh
dmkozh previously approved these changes Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use root account in loadgen Audit test suite for misuse of SECTION
3 participants