Speed up test suite #4849

marta-lokhova · 2025-07-26T02:29:57Z

Resolves https://github.com/stellar/stellar-core-internal/issues/371
Resolves #2100
Resolves #4445

Looks like our test suite accumulated quite a lot of tech debt and is taking much longer than it should(this came up in a recent conversation with @drebelsky about simple tests like "txset nomination" taking ~35 seconds).

This PR makes several changes:

Add a helper script to audit the test suite and identify biggest offenders (written by @claude)

The script can be run on all tests, any partition, or an individual test. sample output:

python3 run_test_timing_analysis.py -p overlay -n 10

Found 62 tests to run
Running each test individually to measure wall-clock time...
This may take a while depending on the number of tests...
--------------------------------------------------------------------------------
[1/62] Running: Flooding
[2/62] Running: ItemFetcher fetches
[3/62] Running: next peer strategy
...
[62/62] Running: Tracker works

================================================================================
TEST TIMING ANALYSIS REPORT (Wall-Clock Time)
================================================================================

Total tests: 62
Total time: 350.01s
Average time per test: 5.645s

Top 10 Slowest Tests:
--------------------------------------------------------------------------------
  1.  152.442s ( 43.6%) overlay flow control
  2.   50.374s ( 14.4%) TCPPeer read malformed messages
  3.   46.080s ( 13.2%) overlay pull mode loadgen
  4.   19.691s (  5.6%) survey request process order
  5.   14.766s (  4.2%) Flooding
  6.   11.077s (  3.2%) inbounds nodes can be promoted to ouboundvalid
  7.   10.106s (  2.9%) TCPPeer lifetime
  8.    9.474s (  2.7%) Time sliced static topology survey
  9.    6.130s (  1.8%) connecting to saturated nodes
 10.    5.087s (  1.5%) peer is purged from database after few failures

Top 10 tests account for 325.23s (92.9% of total time)

Warning: 1 tests failed

One of the biggest issues is that we close a ledger every time we create a new account (some tests create thousands of accounts, which is slow even in virtual time). I switched those tests to use GENESIS_TEST_ACCOUNT_COUNT and observed some drastic speedups, e.g. "SCP Driver"

// Before
stellar-core test -a "SCP Driver"  30.92s user 49.76s system 37% cpu 3:35.51 total
// With GENESIS_TEST_ACCOUNT_COUNT
stellar-core test -a "SCP Driver"  1.89s user 0.20s system 73% cpu 2.840 total

Move a couple of long-running tests to acceptance
Tweak how load is generated in tests running in TCP mode, since it's running in real-time. Remaining TODO is to switch those to accelerated time.
Remove old upgrade tests - since the network is way past protocol 10, 11, and 12, replaying upgrades via catchup should be sufficient.

Opening as a draft, since I haven't finished auditing all the tests flagged by the scripts.

marta-lokhova · 2025-07-26T02:32:49Z

Here are top 50 slowest tests (a lot of them already live in acceptance, but some like archival meta run in every GH action). Note that these results are based on a regular core build; I imagine with asan - which is our default in CI - things are much worse.

Top 50 Slowest Tests:
--------------------------------------------------------------------------------
  1. 1392.434s ( 24.0%) History catchup
  2.  686.091s ( 11.8%) archival meta
  3.  482.937s (  8.3%) Catchup recent
  4.  417.904s (  7.2%) overlay parallel processing
  5.  287.317s (  5.0%) Publish catchup alternation with stall
  6.  264.985s (  4.6%) Catchup fatal failure
  7.  152.276s (  2.6%) overlay flow control
  8.  125.216s (  2.2%) LedgerTxnRoot prefetch classic entries
  9.  118.261s (  2.0%) ledger state update flow with parallel apply
 10.  117.431s (  2.0%) bucketmanager reattach to running merge
 11.  109.898s (  1.9%) write verified checkpoint hashes
 12.   80.054s (  1.4%) Retriggering catchups after trimming mSyncingLedgers
 13.   78.248s (  1.4%) History publish with restart
 14.   65.993s (  1.1%) Catchup failure recovery with buffered checkpoint
 15.   65.692s (  1.1%) catchup with a gap
 16.   58.802s (  1.0%) Publish throttles catchup
 17.   49.920s (  0.9%) TCPPeer read malformed messages
 18.   46.214s (  0.8%) overlay pull mode loadgen
 19.   41.728s (  0.7%) History bucket verification
 20.   39.476s (  0.7%) apply load
 21.   36.576s (  0.6%) generate load with unique accounts
 22.   34.024s (  0.6%) bl cache
 23.   32.347s (  0.6%) soroban cache population
 24.   31.407s (  0.5%) Catchup with protocol upgrade
 25.   29.497s (  0.5%) key-value lookup
 26.   29.390s (  0.5%) loadPoolShareTrustLinesByAccountAndAsset does not load outdated versions
 27.   29.347s (  0.5%) do not load outdated values
 28.   29.274s (  0.5%) bucket entry counters
 29.   29.259s (  0.5%) Flooding
 30.   29.251s (  0.5%) loadPoolShareTrustLinesByAccountAndAsset
 31.   27.614s (  0.5%) tx set hits overlay byte limit during construction
 32.   26.118s (  0.5%) surge pricing with DEX separation holds invariants
 33.   25.319s (  0.4%) BucketListIsConsistentWithDatabase bucket bounds
 34.   24.587s (  0.4%) create offer
 35.   23.430s (  0.4%) soroban txs accepted by the network
 36.   22.122s (  0.4%) Catchup manual
 37.   20.161s (  0.3%) parallel tx set building
 38.   20.112s (  0.3%) CKMSSample gamma distribution
 39.   19.430s (  0.3%) 3 nodes 2 running threshold 2
 40.   17.470s (  0.3%) pathpayment
 41.   16.893s (  0.3%) BucketListIsConsistentWithDatabase added entries
 42.   16.388s (  0.3%) survey request process order
 43.   15.033s (  0.3%) METADATA_DEBUG_LEDGERS works
 44.   15.033s (  0.3%) Soroban footprint validation
 45.   14.798s (  0.3%) getNodeWeight
 46.   14.457s (  0.2%) Soroban authorization
 47.   13.733s (  0.2%) History catchup with different modes
 48.   13.383s (  0.2%) txset nomination
 49.   11.820s (  0.2%) CKMSSample uniform distribution
 50.   11.771s (  0.2%) revoke from pool

Top 50 tests account for 5360.92s (92.5% of total time)

src/herder/test/TxSetTests.cpp

dmkozh

I can see that 'archival meta' has not been moved to the acceptance, even though it's one of the slowest tests we have. I have mixed feelings about it; it is sensitive to the fee changes and it often breaks. Maybe we should think how to separate it into a basic test that covers most of the logic quickly and then a slower acceptance test with more forgiving assertions? cc @SirTyson

src/herder/test/HerderTests.cpp

src/herder/test/TxSetTests.cpp

src/herder/test/UpgradesTests.cpp

src/history/test/HistoryTests.cpp

src/overlay/test/OverlayTests.cpp

src/overlay/test/TCPPeerTests.cpp

src/simulation/TxGenerator.h

SirTyson · 2025-07-28T19:54:39Z

I can see that 'archival meta' has not been moved to the acceptance, even though it's one of the slowest tests we have. I have mixed feelings about it; it is sensitive to the fee changes and it often breaks. Maybe we should think how to separate it into a basic test that covers most of the logic quickly and then a slower acceptance test with more forgiving assertions? cc @SirTyson

I think given the amount of updates and breaks that particular test has seen in p23, we should keep it in non-acceptance. In general our meta tests seem somewhat lacking and need improvement. For this particular test, I think it will be fairly easy to speed up significantly, which I'll work on as a followup.

marta-lokhova · 2025-07-28T20:46:42Z

Sad news: I looked a bit more into our [history] tests that seem to consume the most time, and looks like our history testing infrastructure is buggy, and doesn't handle failed catchup test scenarios correctly (resulting in basically sleeping in tests for minutes). The buggy behavior has been around for years (going back to 2018 it seems). Fixes seem fairly simple, so I'm pushing those as well, as they speed up the suite dramatically, e.g.

  1.    4.092s (100.0%) Catchup fatal failure (was 264.985s)

SirTyson

Thanks for looking into the test backlogs! Just a couple of small questions.

src/main/CommandHandler.cpp

src/overlay/test/OverlayTests.cpp

src/simulation/LoadGenerator.cpp

src/history/test/HistoryTests.cpp

marta-lokhova · 2025-07-29T00:27:22Z

There's a surprising perf issue in "LedgerTxnRoot prefetch classic entries" test: this call takes almost a minute, which seems way too high given we only generate 1000 valid classic entries.

marta-lokhova · 2025-07-29T00:33:27Z

@dmkozh @SirTyson Your comments should be addressed now

marta-lokhova commented Jul 26, 2025

View reviewed changes

src/herder/test/TxSetTests.cpp Outdated Show resolved Hide resolved

marta-lokhova force-pushed the testSuite branch from f5ae7e0 to dc87d2d Compare July 28, 2025 17:49

marta-lokhova requested a review from graydon July 28, 2025 18:46

marta-lokhova force-pushed the testSuite branch from 8f97803 to 89a12f0 Compare July 28, 2025 19:18

dmkozh reviewed Jul 28, 2025

View reviewed changes

SirTyson reviewed Jul 28, 2025

View reviewed changes

src/main/CommandHandler.cpp Outdated Show resolved Hide resolved

src/overlay/test/OverlayTests.cpp Outdated Show resolved Hide resolved

src/simulation/LoadGenerator.cpp Outdated Show resolved Hide resolved

marta-lokhova force-pushed the testSuite branch from 89a12f0 to 51c0ebc Compare July 28, 2025 22:19

marta-lokhova marked this pull request as ready for review July 28, 2025 22:20

marta-lokhova force-pushed the testSuite branch from 51c0ebc to 13098ec Compare July 28, 2025 22:20

dmkozh reviewed Jul 28, 2025

View reviewed changes

src/history/test/HistoryTests.cpp Outdated Show resolved Hide resolved

marta-lokhova added 3 commits July 28, 2025 17:29

Remove dead code and move upgrade tests to acceptance

0967b76

Check in helper script to analyze test suite performance

5946881

Update testing infra to use root account for soroban upgrades

306330e

marta-lokhova force-pushed the testSuite branch from 13098ec to 25e0181 Compare July 29, 2025 00:29

dmkozh previously approved these changes Jul 29, 2025

View reviewed changes

marta-lokhova dismissed dmkozh’s stale review via 036e4ec July 29, 2025 19:48

marta-lokhova force-pushed the testSuite branch from 25e0181 to 036e4ec Compare July 29, 2025 19:48

dmkozh previously approved these changes Jul 29, 2025

View reviewed changes

marta-lokhova dismissed dmkozh’s stale review via 38204fa July 29, 2025 21:32

marta-lokhova force-pushed the testSuite branch from 036e4ec to 38204fa Compare July 29, 2025 21:32

marta-lokhova enabled auto-merge July 29, 2025 21:33

SirTyson previously approved these changes Jul 29, 2025

View reviewed changes

dmkozh previously approved these changes Jul 29, 2025

View reviewed changes

marta-lokhova added 2 commits July 29, 2025 16:47

Speed up various tests

67e932b

Fix next meta

87b4454

marta-lokhova dismissed stale reviews from dmkozh and SirTyson via 87b4454 July 29, 2025 23:48

marta-lokhova force-pushed the testSuite branch from 38204fa to 87b4454 Compare July 29, 2025 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up test suite #4849

Speed up test suite #4849

marta-lokhova commented Jul 26, 2025 •

edited

Loading

Uh oh!

marta-lokhova commented Jul 26, 2025

Uh oh!

Uh oh!

dmkozh left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SirTyson commented Jul 28, 2025

Uh oh!

marta-lokhova commented Jul 28, 2025

Uh oh!

SirTyson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marta-lokhova commented Jul 29, 2025

Uh oh!

marta-lokhova commented Jul 29, 2025

Uh oh!

Uh oh!

Speed up test suite #4849

Are you sure you want to change the base?

Speed up test suite #4849

Conversation

marta-lokhova commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marta-lokhova commented Jul 26, 2025

Uh oh!

Uh oh!

dmkozh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SirTyson commented Jul 28, 2025

Uh oh!

marta-lokhova commented Jul 28, 2025

Uh oh!

SirTyson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marta-lokhova commented Jul 29, 2025

Uh oh!

marta-lokhova commented Jul 29, 2025

Uh oh!

Uh oh!

marta-lokhova commented Jul 26, 2025 •

edited

Loading