Skip to content

Conversation

@dr-m
Copy link
Contributor

@dr-m dr-m commented Mar 25, 2025

  • The Jira issue number for this PR is: MDEV-21923

Description

Thanks to 685d958 and a635c40 multiple mtr_t::commit() can concurrently copy their slice of mtr_t::m_log to the shared log_sys.buf. They would allocate their own log sequence number by invoking log_t::append_prepare() while holding a shared log_sys.latch. This function was too heavy, because it would invoke a minimum of 4 atomic read-modify-write operations as well as system calls in the supposedly fast code path.

It turns out that with a more clever data structure, instead of having several data fields that needed to be kept consistent with each other, we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which we can operate using fetch_add(), fetch_sub() as well as a single-bit fetch_or(), which reasonably modern compilers (GCC 7 counts as one) can translate into loop-free code on AMD64.

log_t::base_lsn: The LSN of the last write_buf() or persist(). This is a rough approximation of log_sys.lsn, which will be removed.

log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers updates of write_to_buf and base_lsn.

log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove. Replaced by base_lsn and write_lsn_offset.

log_t::buf_size: Always reflects the usable size in append_prepare().

log_t::lsn_lock: Remove. For the memory-mapped log in resize_write(), there will be a resize_wrap_mutex.

log_t::get_lsn_approx(): Return a lower bound of get_lsn(). This should be exact unless append_prepare_wait() is pending.

log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked while holding an exclusive log_sys.latch.

In many places, references to log_sys.get_lsn() are replaced with log_sys.get_flushed_lsn(), which remains a simple std::atomic::load().

Release Notes

InnoDB should perform faster in write-heavy workloads.

The parameter innodb_log_spin_wait_delay will be deprecated and ignored, because there is no spin loop anymore.

How can this PR be tested?

This is very low-level functionality that is reasonably well covered by the regression test suite as well as our regular stress tests.

It should be noted that there are two log write code paths on some 64-bit Linux platform, unless cmake -DWITH_INNODB_PMEM=OFF has been specified. By default, both code paths will be included in the build when available, and /dev/shm is pretended to be "fake PMEM". I have tested both code paths in a few GNU/Linux environments.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@dr-m dr-m self-assigned this Mar 25, 2025
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@dr-m dr-m force-pushed the 10.11-MDEV-21923 branch 5 times, most recently from 3514077 to 4b891c4 Compare April 2, 2025 08:53
@dr-m dr-m marked this pull request as ready for review April 2, 2025 09:03
@dr-m dr-m force-pushed the 10.11-MDEV-21923 branch 3 times, most recently from 2e40a2c to 689ae8f Compare April 3, 2025 05:40
@dr-m dr-m force-pushed the 10.11-MDEV-21923 branch 2 times, most recently from 5c0f18e to 595063a Compare April 8, 2025 05:41
Copy link
Contributor

@mariadb-DebarunBanerjee mariadb-DebarunBanerjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dr-m for the improvement. I have added my comments and findings for your consideration.

if (!recv_needed_recovery)
return;
recv_needed_recovery= false;
log_sys.latch.wr_lock(SRW_LOCK_CALL);
if (srv_operation <= SRV_OPERATION_EXPORT_RESTORED &&
lsn - log_sys.next_checkpoint_lsn < log_sys.log_capacity)
/* Write a FILE_CHECKPOINT marker as the first thing,
before generating any other redo log. This ensures
that subsequent crash recovery will be possible even
if the server were killed soon after this. */
fil_names_clear(log_sys.next_checkpoint_lsn);
log_sys.latch.wr_unlock();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this code segment is moved up. In general doing so is not immune to regression and it is better to have a testcase that can validate this part. Do we have any such test or can we add sone crash point test ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the code because we can’t write to the log before log_sys.base_lsn and log_sys.write_lsn_offset have been initialized. I now see that it might be fine at the original call site; log_sys.set_recovered_lsn() should initialize them. I’ll remove the fil_names_clear() call and see if any tests would fail. I see that the equivalent call was there already in 2e814d4, and I can’t immediately think why recovery would fail if this write weren’t there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No mtr test failed when I removed the equivalent code in another branch. This code had been moved here because we no longer can write any log records before log_sys.clear_mmap() has been invoked, right before the code that is quoted above. I will ask for a new stress test run without this fil_names_clear() call. If it is OK, the code can be removed.

Comment on lines +920 to +924
/* This synchronizes with get_lsn_approx();
we must store write_lsn_offset before base_lsn. */
write_lsn_offset.store(0, std::memory_order_relaxed);
base_lsn.store(new_base_lsn, std::memory_order_release);
flushed_to_disk_lsn.store(lsn, std::memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, acquire/release memory order usage looks fine here to synchronize with get_lsn_approx(). However, I see get_lsn_approx() is used in page cleaner recommendation and it should be fine to use flushed_to_disk_lsn there instead and get rid of this approximate calculation. What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially was referring to log_sys.flushed_to_disk_lsn in the buf_flush_page_cleaner() thread. I don’t remember anymore if I observed actual hangs related to it lagging too much behind log_sys.get_lsn(). In any case, if the server is running with innodb_flush_log_at_trx_commit=0, it could take up to innodb_flush_log_at_timeout (maximum: 2700 seconds) before a fdatasync() or fsync() on the ib_logfile0 is executed and the durable LSN could advance. We would not want to block the page cleaning activity because of that.

The logic is that the page cleaner first determines how many pages to write out, based on the difference between the latest checkpoint and the latest LSN. Then, when executing buf_page_t::flush() on each chosen page, it will ensure that everything up to the FIL_PAGE_LSN has been durably written to the log. If no pages can be chosen for flushing, then nothing in the page cleaner will force durable log writes. I did not want to introduce an additional durable log write in the page cleaner. We actually had such a write there between 3a9a3be and 762bcb8.

Comment on lines +964 to +973
while (UNIV_UNLIKELY((l= write_lsn_offset.fetch_add(size + WRITE_TO_BUF) &
(WRITE_TO_BUF - 1)) >= buf_size))
{
b= append_prepare_wait<spin>(b, ex, l);
/* While flushing log, we had released the lsn lock and LSN could have
progressed in the meantime. */
l= lsn.load(std::memory_order_relaxed);
end_lsn= l + size;
/* The following is inlined here instead of being part of
append_prepare_wait(), in order to increase the locality of reference
and to set the WRITE_BACKOFF flag as soon as possible. */
bool late(write_lsn_offset.fetch_or(WRITE_BACKOFF) & WRITE_BACKOFF);
/* Subtract our LSN overshoot. */
write_lsn_offset.fetch_sub(size);
append_prepare_wait(late, ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a note and not necessarily any issue.

The design here is that when the redo log buffer is close to full, the threads compete to write the WRITE_BACKOFF flag and which ever threads succeeds first, goes ahead and does write the log buffer to file (log_write_up_to()) and frees up space while other threads backs off and waits for WRITE_BACKOFF flag to be reset.

There is one non-intuitive scenario here where a concurrent thread with lesser mtr size can sneak in and write its redo log to the left over space in redo log buffer while other threads (with bigger redo log to write) are synchronizing to write the log buffer to file and continue. This is possible as we would be releasing write latch before flushing the log.

I still don't see any issue with that as log_write_up_to would reacquire the write latch and should synchronize with such thread but thought worth mentioning it.

Another point to note is that in usual OLTP test scenario it might be hard to hit this code path as the redo would be frequently written during COMMIT and may not actually reach this WRITE_BACKOFF synchronization. To comprehensively test this part I thought of following steps.

  1. Avoid flushing log at transaction commit :innodb_flush_log_at_trx_commit = 0
  2. Avoid flushing log periodically: innodb_flush_log_at_timeout = 300

I would try one round with this in my sysbench run with release binary.
Marko, if you think it makes sense to run our mtr tests with such configuration in debug mode, can you please try ?

dr-m added 2 commits April 10, 2025 12:58
recv_sys_t::report_progress(): Display the largest currently known LSN.

recv_scan_log(): Display an error with fewer function calls.

Reviewed by: Debarun Banerjee
The parameter innodb_log_spin_wait_delay will be deprecated and
ignored, because there is no spin loop anymore.

Thanks to commit 685d958
and commit a635c40
multiple mtr_t::commit() can concurrently copy their slice of
mtr_t::m_log to the shared log_sys.buf.  Each writer would allocate
their own log sequence number by invoking log_t::append_prepare()
while holding a shared log_sys.latch.  This function was too heavy,
because it would invoke a minimum of 4 atomic read-modify-write
operations as well as system calls in the supposedly fast code path.

It turns out that with a simpler data structure, instead of having
several data fields that needed to be kept consistent with each other,
we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which
we can operate using fetch_add(), fetch_sub() as well as a single-bit
fetch_or(), which reasonably modern compilers (GCC 7, Clang 15 or later)
can translate into loop-free code on AMD64.

Before anything can be written to the log, log_sys.clear_mmap()
must be invoked.

log_t::base_lsn: The LSN of the last write_buf() or persist().
This is a rough approximation of log_sys.lsn, which will be removed.

log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers
updates of write_to_buf and base_lsn.

log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove.
Replaced by base_lsn and write_lsn_offset.

log_t::buf_size: Always reflects the usable size in append_prepare().

log_t::lsn_lock: Remove.  For the memory-mapped log in resize_write(),
there will be a resize_wrap_mutex.

log_t::get_lsn_approx(): Return a lower bound of get_lsn().
This should be exact unless append_prepare_wait() is pending.

log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked
while holding an exclusive log_sys.latch.

recv_recovery_from_checkpoint_start(): Do not invoke fil_names_clear();
it would seem to be unnecessary.

In many places, references to log_sys.get_lsn() are replaced with
log_sys.get_flushed_lsn(), which remains a simple std::atomic::load().

Reviewed by: Debarun Banerjee
@dr-m dr-m force-pushed the 10.11-MDEV-21923 branch from 595063a to acd071f Compare April 10, 2025 10:05
@dr-m dr-m merged commit acd071f into 10.11 Apr 10, 2025
12 of 13 checks passed
@dr-m dr-m deleted the 10.11-MDEV-21923 branch April 10, 2025 16:08
@vaintroub
Copy link
Member

vaintroub commented Apr 11, 2025

FYI : This crashed with assertion, on a rare (only?) buildbot builder, which is both Debug, and does not run tests in shared memory, thus does not use "mmap"

@dr-m
Copy link
Contributor Author

dr-m commented May 23, 2025

@vaintroub, sorry, I had overlooked your note. The logs for that failure have been purged, but I believe that it was due to a race condition in an assertion that is only present in debug builds. That assertion was removed later in #4038.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants