-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
MDEV-21923 LSN allocation is a bottleneck #3925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
3514077 to
4b891c4
Compare
2e40a2c to
689ae8f
Compare
5c0f18e to
595063a
Compare
mariadb-DebarunBanerjee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dr-m for the improvement. I have added my comments and findings for your consideration.
storage/innobase/log/log0recv.cc
Outdated
| if (!recv_needed_recovery) | ||
| return; | ||
| recv_needed_recovery= false; | ||
| log_sys.latch.wr_lock(SRW_LOCK_CALL); | ||
| if (srv_operation <= SRV_OPERATION_EXPORT_RESTORED && | ||
| lsn - log_sys.next_checkpoint_lsn < log_sys.log_capacity) | ||
| /* Write a FILE_CHECKPOINT marker as the first thing, | ||
| before generating any other redo log. This ensures | ||
| that subsequent crash recovery will be possible even | ||
| if the server were killed soon after this. */ | ||
| fil_names_clear(log_sys.next_checkpoint_lsn); | ||
| log_sys.latch.wr_unlock(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this code segment is moved up. In general doing so is not immune to regression and it is better to have a testcase that can validate this part. Do we have any such test or can we add sone crash point test ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the code because we can’t write to the log before log_sys.base_lsn and log_sys.write_lsn_offset have been initialized. I now see that it might be fine at the original call site; log_sys.set_recovered_lsn() should initialize them. I’ll remove the fil_names_clear() call and see if any tests would fail. I see that the equivalent call was there already in 2e814d4, and I can’t immediately think why recovery would fail if this write weren’t there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No mtr test failed when I removed the equivalent code in another branch. This code had been moved here because we no longer can write any log records before log_sys.clear_mmap() has been invoked, right before the code that is quoted above. I will ask for a new stress test run without this fil_names_clear() call. If it is OK, the code can be removed.
| /* This synchronizes with get_lsn_approx(); | ||
| we must store write_lsn_offset before base_lsn. */ | ||
| write_lsn_offset.store(0, std::memory_order_relaxed); | ||
| base_lsn.store(new_base_lsn, std::memory_order_release); | ||
| flushed_to_disk_lsn.store(lsn, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, acquire/release memory order usage looks fine here to synchronize with get_lsn_approx(). However, I see get_lsn_approx() is used in page cleaner recommendation and it should be fine to use flushed_to_disk_lsn there instead and get rid of this approximate calculation. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially was referring to log_sys.flushed_to_disk_lsn in the buf_flush_page_cleaner() thread. I don’t remember anymore if I observed actual hangs related to it lagging too much behind log_sys.get_lsn(). In any case, if the server is running with innodb_flush_log_at_trx_commit=0, it could take up to innodb_flush_log_at_timeout (maximum: 2700 seconds) before a fdatasync() or fsync() on the ib_logfile0 is executed and the durable LSN could advance. We would not want to block the page cleaning activity because of that.
The logic is that the page cleaner first determines how many pages to write out, based on the difference between the latest checkpoint and the latest LSN. Then, when executing buf_page_t::flush() on each chosen page, it will ensure that everything up to the FIL_PAGE_LSN has been durably written to the log. If no pages can be chosen for flushing, then nothing in the page cleaner will force durable log writes. I did not want to introduce an additional durable log write in the page cleaner. We actually had such a write there between 3a9a3be and 762bcb8.
| while (UNIV_UNLIKELY((l= write_lsn_offset.fetch_add(size + WRITE_TO_BUF) & | ||
| (WRITE_TO_BUF - 1)) >= buf_size)) | ||
| { | ||
| b= append_prepare_wait<spin>(b, ex, l); | ||
| /* While flushing log, we had released the lsn lock and LSN could have | ||
| progressed in the meantime. */ | ||
| l= lsn.load(std::memory_order_relaxed); | ||
| end_lsn= l + size; | ||
| /* The following is inlined here instead of being part of | ||
| append_prepare_wait(), in order to increase the locality of reference | ||
| and to set the WRITE_BACKOFF flag as soon as possible. */ | ||
| bool late(write_lsn_offset.fetch_or(WRITE_BACKOFF) & WRITE_BACKOFF); | ||
| /* Subtract our LSN overshoot. */ | ||
| write_lsn_offset.fetch_sub(size); | ||
| append_prepare_wait(late, ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a note and not necessarily any issue.
The design here is that when the redo log buffer is close to full, the threads compete to write the WRITE_BACKOFF flag and which ever threads succeeds first, goes ahead and does write the log buffer to file (log_write_up_to()) and frees up space while other threads backs off and waits for WRITE_BACKOFF flag to be reset.
There is one non-intuitive scenario here where a concurrent thread with lesser mtr size can sneak in and write its redo log to the left over space in redo log buffer while other threads (with bigger redo log to write) are synchronizing to write the log buffer to file and continue. This is possible as we would be releasing write latch before flushing the log.
I still don't see any issue with that as log_write_up_to would reacquire the write latch and should synchronize with such thread but thought worth mentioning it.
Another point to note is that in usual OLTP test scenario it might be hard to hit this code path as the redo would be frequently written during COMMIT and may not actually reach this WRITE_BACKOFF synchronization. To comprehensively test this part I thought of following steps.
- Avoid flushing log at transaction commit :innodb_flush_log_at_trx_commit = 0
- Avoid flushing log periodically: innodb_flush_log_at_timeout = 300
I would try one round with this in my sysbench run with release binary.
Marko, if you think it makes sense to run our mtr tests with such configuration in debug mode, can you please try ?
recv_sys_t::report_progress(): Display the largest currently known LSN. recv_scan_log(): Display an error with fewer function calls. Reviewed by: Debarun Banerjee
The parameter innodb_log_spin_wait_delay will be deprecated and ignored, because there is no spin loop anymore. Thanks to commit 685d958 and commit a635c40 multiple mtr_t::commit() can concurrently copy their slice of mtr_t::m_log to the shared log_sys.buf. Each writer would allocate their own log sequence number by invoking log_t::append_prepare() while holding a shared log_sys.latch. This function was too heavy, because it would invoke a minimum of 4 atomic read-modify-write operations as well as system calls in the supposedly fast code path. It turns out that with a simpler data structure, instead of having several data fields that needed to be kept consistent with each other, we only need one Atomic_relaxed<uint64_t> write_lsn_offset, on which we can operate using fetch_add(), fetch_sub() as well as a single-bit fetch_or(), which reasonably modern compilers (GCC 7, Clang 15 or later) can translate into loop-free code on AMD64. Before anything can be written to the log, log_sys.clear_mmap() must be invoked. log_t::base_lsn: The LSN of the last write_buf() or persist(). This is a rough approximation of log_sys.lsn, which will be removed. log_t::write_lsn_offset: An Atomic_relaxed<uint64_t> that buffers updates of write_to_buf and base_lsn. log_t::buf_free, log_t::max_buf_free, log_t::lsn. Remove. Replaced by base_lsn and write_lsn_offset. log_t::buf_size: Always reflects the usable size in append_prepare(). log_t::lsn_lock: Remove. For the memory-mapped log in resize_write(), there will be a resize_wrap_mutex. log_t::get_lsn_approx(): Return a lower bound of get_lsn(). This should be exact unless append_prepare_wait() is pending. log_get_lsn(): A wrapper for log_sys.get_lsn(), which must be invoked while holding an exclusive log_sys.latch. recv_recovery_from_checkpoint_start(): Do not invoke fil_names_clear(); it would seem to be unnecessary. In many places, references to log_sys.get_lsn() are replaced with log_sys.get_flushed_lsn(), which remains a simple std::atomic::load(). Reviewed by: Debarun Banerjee
|
FYI : This crashed with assertion, on a rare (only?) buildbot builder, which is both Debug, and does not run tests in shared memory, thus does not use "mmap" |
|
@vaintroub, sorry, I had overlooked your note. The logs for that failure have been purged, but I believe that it was due to a race condition in an assertion that is only present in debug builds. That assertion was removed later in #4038. |
Description
Thanks to 685d958 and a635c40 multiple
mtr_t::commit()can concurrently copy their slice ofmtr_t::m_logto the sharedlog_sys.buf. They would allocate their own log sequence number by invokinglog_t::append_prepare()while holding a sharedlog_sys.latch. This function was too heavy, because it would invoke a minimum of 4 atomic read-modify-write operations as well as system calls in the supposedly fast code path.It turns out that with a more clever data structure, instead of having several data fields that needed to be kept consistent with each other, we only need one
Atomic_relaxed<uint64_t> write_lsn_offset, on which we can operate usingfetch_add(),fetch_sub()as well as a single-bitfetch_or(), which reasonably modern compilers (GCC 7 counts as one) can translate into loop-free code on AMD64.log_t::base_lsn: The LSN of the lastwrite_buf()orpersist(). This is a rough approximation oflog_sys.lsn, which will be removed.log_t::write_lsn_offset: AnAtomic_relaxed<uint64_t>that buffers updates ofwrite_to_bufandbase_lsn.log_t::buf_free,log_t::max_buf_free,log_t::lsn. Remove. Replaced bybase_lsnandwrite_lsn_offset.log_t::buf_size: Always reflects the usable size inappend_prepare().log_t::lsn_lock: Remove. For the memory-mapped log inresize_write(), there will be aresize_wrap_mutex.log_t::get_lsn_approx(): Return a lower bound ofget_lsn(). This should be exact unlessappend_prepare_wait()is pending.log_get_lsn(): A wrapper forlog_sys.get_lsn(), which must be invoked while holding an exclusivelog_sys.latch.In many places, references to
log_sys.get_lsn()are replaced withlog_sys.get_flushed_lsn(), which remains a simplestd::atomic::load().Release Notes
InnoDB should perform faster in write-heavy workloads.
The parameter
innodb_log_spin_wait_delaywill be deprecated and ignored, because there is no spin loop anymore.How can this PR be tested?
This is very low-level functionality that is reasonably well covered by the regression test suite as well as our regular stress tests.
It should be noted that there are two log write code paths on some 64-bit Linux platform, unless
cmake -DWITH_INNODB_PMEM=OFFhas been specified. By default, both code paths will be included in the build when available, and/dev/shmis pretended to be "fake PMEM". I have tested both code paths in a few GNU/Linux environments.Basing the PR against the correct MariaDB version
mainbranch.PR quality check