MDEV-21923 LSN allocation is a bottleneck #3925

dr-m · 2025-03-25T15:38:53Z

The Jira issue number for this PR is: MDEV-21923

Description

Thanks to 685d958 and a635c40 multiple mtr_t::commit() can concurrently copy their slice of mtr_t::m_log to the shared log_sys.buf. They would allocate their own log sequence number by invoking log_t::append_prepare() while holding a shared log_sys.latch. This function was too heavy, because it would invoke a minimum of 4 atomic read-modify-write operations as well as system calls in the supposedly fast code path.

It turns out that with a more clever data structure, instead of having several data fields that needed to be kept consistent with each other, we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which we can operate using fetch_add(), fetch_sub() as well as a single-bit fetch_or(), which reasonably modern compilers (GCC 7 counts as one) can translate into loop-free code on AMD64.

log_t::base_lsn: The LSN of the last write_buf() or persist(). This is a rough approximation of log_sys.lsn, which will be removed.

log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers updates of write_to_buf and base_lsn.

log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove. Replaced by base_lsn and write_lsn_offset.

log_t::buf_size: Always reflects the usable size in append_prepare().

log_t::lsn_lock: Remove. For the memory-mapped log in resize_write(), there will be a resize_wrap_mutex.

log_t::get_lsn_approx(): Return a lower bound of get_lsn(). This should be exact unless append_prepare_wait() is pending.

log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked while holding an exclusive log_sys.latch.

In many places, references to log_sys.get_lsn() are replaced with log_sys.get_flushed_lsn(), which remains a simple std::atomic::load().

Release Notes

InnoDB should perform faster in write-heavy workloads.

The parameter innodb_log_spin_wait_delay will be deprecated and ignored, because there is no spin loop anymore.

How can this PR be tested?

This is very low-level functionality that is reasonably well covered by the regression test suite as well as our regular stress tests.

It should be noted that there are two log write code paths on some 64-bit Linux platform, unless cmake -DWITH_INNODB_PMEM=OFF has been specified. By default, both code paths will be included in the build when available, and /dev/shm is pretended to be "fake PMEM". I have tested both code paths in a few GNU/Linux environments.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2025-03-25T15:38:59Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

storage/innobase/mtr/mtr0mtr.cc

mariadb-DebarunBanerjee

Thanks @dr-m for the improvement. I have added my comments and findings for your consideration.

mariadb-DebarunBanerjee · 2025-04-08T15:43:03Z

storage/innobase/log/log0recv.cc

+  if (!recv_needed_recovery)
+    return;
+  recv_needed_recovery= false;
+  log_sys.latch.wr_lock(SRW_LOCK_CALL);
+  if (srv_operation <= SRV_OPERATION_EXPORT_RESTORED &&
+      lsn - log_sys.next_checkpoint_lsn < log_sys.log_capacity)
+    /* Write a FILE_CHECKPOINT marker as the first thing,
+    before generating any other redo log. This ensures
+    that subsequent crash recovery will be possible even
+    if the server were killed soon after this. */
+    fil_names_clear(log_sys.next_checkpoint_lsn);
+  log_sys.latch.wr_unlock();
+}


I see this code segment is moved up. In general doing so is not immune to regression and it is better to have a testcase that can validate this part. Do we have any such test or can we add sone crash point test ?

I moved the code because we can’t write to the log before log_sys.base_lsn and log_sys.write_lsn_offset have been initialized. I now see that it might be fine at the original call site; log_sys.set_recovered_lsn() should initialize them. I’ll remove the fil_names_clear() call and see if any tests would fail. I see that the equivalent call was there already in 2e814d4, and I can’t immediately think why recovery would fail if this write weren’t there.

No mtr test failed when I removed the equivalent code in another branch. This code had been moved here because we no longer can write any log records before log_sys.clear_mmap() has been invoked, right before the code that is quoted above. I will ask for a new stress test run without this fil_names_clear() call. If it is OK, the code can be removed.

mariadb-DebarunBanerjee · 2025-04-09T12:24:32Z

storage/innobase/log/log0log.cc

+  /* This synchronizes with get_lsn_approx();
+  we must store write_lsn_offset before base_lsn. */
+  write_lsn_offset.store(0, std::memory_order_relaxed);
+  base_lsn.store(new_base_lsn, std::memory_order_release);
+  flushed_to_disk_lsn.store(lsn, std::memory_order_relaxed);


OK, acquire/release memory order usage looks fine here to synchronize with get_lsn_approx(). However, I see get_lsn_approx() is used in page cleaner recommendation and it should be fine to use flushed_to_disk_lsn there instead and get rid of this approximate calculation. What do you think ?

I initially was referring to log_sys.flushed_to_disk_lsn in the buf_flush_page_cleaner() thread. I don’t remember anymore if I observed actual hangs related to it lagging too much behind log_sys.get_lsn(). In any case, if the server is running with innodb_flush_log_at_trx_commit=0, it could take up to innodb_flush_log_at_timeout (maximum: 2700 seconds) before a fdatasync() or fsync() on the ib_logfile0 is executed and the durable LSN could advance. We would not want to block the page cleaning activity because of that.

The logic is that the page cleaner first determines how many pages to write out, based on the difference between the latest checkpoint and the latest LSN. Then, when executing buf_page_t::flush() on each chosen page, it will ensure that everything up to the FIL_PAGE_LSN has been durably written to the log. If no pages can be chosen for flushing, then nothing in the page cleaner will force durable log writes. I did not want to introduce an additional durable log write in the page cleaner. We actually had such a write there between 3a9a3be and 762bcb8.

mariadb-DebarunBanerjee · 2025-04-09T13:52:37Z

storage/innobase/mtr/mtr0mtr.cc

+  while (UNIV_UNLIKELY((l= write_lsn_offset.fetch_add(size + WRITE_TO_BUF) &
+                        (WRITE_TO_BUF - 1)) >= buf_size))
  {
-    b= append_prepare_wait<spin>(b, ex, l);
-    /* While flushing log, we had released the lsn lock and LSN could have
-    progressed in the meantime. */
-    l= lsn.load(std::memory_order_relaxed);
-    end_lsn= l + size;
+    /* The following is inlined here instead of being part of
+    append_prepare_wait(), in order to increase the locality of reference
+    and to set the WRITE_BACKOFF flag as soon as possible. */
+    bool late(write_lsn_offset.fetch_or(WRITE_BACKOFF) & WRITE_BACKOFF);
+    /* Subtract our LSN overshoot. */
+    write_lsn_offset.fetch_sub(size);
+    append_prepare_wait(late, ex);


This is a note and not necessarily any issue.

The design here is that when the redo log buffer is close to full, the threads compete to write the WRITE_BACKOFF flag and which ever threads succeeds first, goes ahead and does write the log buffer to file (log_write_up_to()) and frees up space while other threads backs off and waits for WRITE_BACKOFF flag to be reset.

There is one non-intuitive scenario here where a concurrent thread with lesser mtr size can sneak in and write its redo log to the left over space in redo log buffer while other threads (with bigger redo log to write) are synchronizing to write the log buffer to file and continue. This is possible as we would be releasing write latch before flushing the log.

I still don't see any issue with that as log_write_up_to would reacquire the write latch and should synchronize with such thread but thought worth mentioning it.

Another point to note is that in usual OLTP test scenario it might be hard to hit this code path as the redo would be frequently written during COMMIT and may not actually reach this WRITE_BACKOFF synchronization. To comprehensively test this part I thought of following steps.

Avoid flushing log at transaction commit :innodb_flush_log_at_trx_commit = 0

Avoid flushing log periodically: innodb_flush_log_at_timeout = 300

I would try one round with this in my sysbench run with release binary.
Marko, if you think it makes sense to run our mtr tests with such configuration in debug mode, can you please try ?

recv_sys_t::report_progress(): Display the largest currently known LSN. recv_scan_log(): Display an error with fewer function calls. Reviewed by: Debarun Banerjee

The parameter innodb_log_spin_wait_delay will be deprecated and ignored, because there is no spin loop anymore. Thanks to commit 685d958 and commit a635c40 multiple mtr_t::commit() can concurrently copy their slice of mtr_t::m_log to the shared log_sys.buf. Each writer would allocate their own log sequence number by invoking log_t::append_prepare() while holding a shared log_sys.latch. This function was too heavy, because it would invoke a minimum of 4 atomic read-modify-write operations as well as system calls in the supposedly fast code path. It turns out that with a simpler data structure, instead of having several data fields that needed to be kept consistent with each other, we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which we can operate using fetch_add(), fetch_sub() as well as a single-bit fetch_or(), which reasonably modern compilers (GCC 7, Clang 15 or later) can translate into loop-free code on AMD64. Before anything can be written to the log, log_sys.clear_mmap() must be invoked. log_t::base_lsn: The LSN of the last write_buf() or persist(). This is a rough approximation of log_sys.lsn, which will be removed. log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers updates of write_to_buf and base_lsn. log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove. Replaced by base_lsn and write_lsn_offset. log_t::buf_size: Always reflects the usable size in append_prepare(). log_t::lsn_lock: Remove. For the memory-mapped log in resize_write(), there will be a resize_wrap_mutex. log_t::get_lsn_approx(): Return a lower bound of get_lsn(). This should be exact unless append_prepare_wait() is pending. log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked while holding an exclusive log_sys.latch. recv_recovery_from_checkpoint_start(): Do not invoke fil_names_clear(); it would seem to be unnecessary. In many places, references to log_sys.get_lsn() are replaced with log_sys.get_flushed_lsn(), which remains a simple std::atomic::load(). Reviewed by: Debarun Banerjee

vaintroub · 2025-04-11T13:51:44Z

FYI : This crashed with assertion, on a rare (only?) buildbot builder, which is both Debug, and does not run tests in shared memory, thus does not use "mmap"

dr-m · 2025-05-23T06:22:40Z

@vaintroub, sorry, I had overlooked your note. The logs for that failure have been purged, but I believe that it was due to a race condition in an assertion that is only present in debug builds. That assertion was removed later in #4038.

dr-m self-assigned this Mar 25, 2025

svoj added the MariaDB Corporation label Mar 25, 2025

dr-m force-pushed the 10.11-MDEV-21923 branch 5 times, most recently from 3514077 to 4b891c4 Compare April 2, 2025 08:53

dr-m requested a review from mariadb-DebarunBanerjee April 2, 2025 09:03

dr-m marked this pull request as ready for review April 2, 2025 09:03

dr-m force-pushed the 10.11-MDEV-21923 branch 3 times, most recently from 2e40a2c to 689ae8f Compare April 3, 2025 05:40

dr-m commented Apr 5, 2025

View reviewed changes

storage/innobase/mtr/mtr0mtr.cc Outdated Show resolved Hide resolved

dr-m force-pushed the 10.11-MDEV-21923 branch 2 times, most recently from 5c0f18e to 595063a Compare April 8, 2025 05:41

mariadb-DebarunBanerjee approved these changes Apr 10, 2025

View reviewed changes

dr-m added 2 commits April 10, 2025 12:58

Recovery cleanup

0bcc03a

recv_sys_t::report_progress(): Display the largest currently known LSN. recv_scan_log(): Display an error with fewer function calls. Reviewed by: Debarun Banerjee

dr-m force-pushed the 10.11-MDEV-21923 branch from 595063a to acd071f Compare April 10, 2025 10:05

dr-m merged commit acd071f into 10.11 Apr 10, 2025
12 of 13 checks passed

dr-m deleted the 10.11-MDEV-21923 branch April 10, 2025 16:08

dr-m mentioned this pull request Apr 11, 2025

MDEV-19749 MDL scalability regression after backup locks #3966

Closed

5 tasks

dr-m mentioned this pull request Apr 14, 2025

MDEV-36587 InnoDB uses too much memory #3972

Merged

5 tasks

dr-m mentioned this pull request Apr 25, 2025

MDEV-36301 SET GLOBAL innodb_log_file_disabled, innodb_log_group_home_dir #4014

Open

5 tasks

dr-m mentioned this pull request May 8, 2025

MDEV-36760 log_t::append_prepare_wait(): Bogus assertion on write_lsn #4038

Merged

5 tasks

dr-m mentioned this pull request Nov 19, 2025

MDEV-38126 Dead code, race conditions and contentious lock_operations in audit plugin #4449

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MDEV-21923 LSN allocation is a bottleneck #3925

MDEV-21923 LSN allocation is a bottleneck #3925

Uh oh!

dr-m commented Mar 25, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Mar 25, 2025

Uh oh!

Uh oh!

mariadb-DebarunBanerjee left a comment

Uh oh!

mariadb-DebarunBanerjee Apr 8, 2025

Uh oh!

dr-m Apr 10, 2025

Uh oh!

dr-m Apr 10, 2025

Uh oh!

mariadb-DebarunBanerjee Apr 9, 2025

Uh oh!

dr-m Apr 10, 2025

Uh oh!

mariadb-DebarunBanerjee Apr 9, 2025

Uh oh!

Uh oh!

vaintroub commented Apr 11, 2025 •

edited

Loading

Uh oh!

dr-m commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

Uh oh!

MDEV-21923 LSN allocation is a bottleneck #3925

MDEV-21923 LSN allocation is a bottleneck #3925

Uh oh!

Conversation

dr-m commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

Uh oh!

CLAassistant commented Mar 25, 2025

Uh oh!

Uh oh!

mariadb-DebarunBanerjee left a comment

Choose a reason for hiding this comment

Uh oh!

mariadb-DebarunBanerjee Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

dr-m Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

dr-m Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

mariadb-DebarunBanerjee Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

dr-m Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

mariadb-DebarunBanerjee Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vaintroub commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-m commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

dr-m commented Mar 25, 2025 •

edited

Loading

vaintroub commented Apr 11, 2025 •

edited

Loading