ftrace: ClearPerCpuTrace only for offline CPUs #2260

sashwinbalaji · 2025-07-23T07:42:16Z

Clearing /sys/kernel/tracing/trace should clear all per-CPU trace data for online CPUs (src). For offline CPUs, trace data might still be pending in their trace_pipe_raw. To minimize time taken, only clear the trace for these offline CPUs instead of for all.

Flow:

Read the list of offline CPUs from /sys/devices/system/cpu/offline.
Parse the comma-separated list of decimal numbers and ranges, representing the CPU list.(src: show_cpus_attr, bitmap_print_to_pagebuf).
Iterate & clear the buffer of only the offline CPUs.

Fixes: #519
Test: Tested locally by manually turning off some cpus on both linux and android

LalitMaganti · 2025-07-23T08:24:03Z

LG but I'll take Ryan take the lead on this one.

src/traced/probes/ftrace/tracefs.cc

rsavitski · 2025-07-23T17:38:01Z

src/traced/probes/ftrace/tracefs.cc

+  // Each range is either a single CPU or a range of CPUs, e.g. "0-3,5,7-9".
+  // Source: https://docs.kernel.org/admin-guide/cputopology.html
+  std::vector<uint32_t> offline_cpus_tmp;
+  for (base::StringSplitter ss(offline_cpus_str, ','); ss.Next();) {


I'm fine with this, but will note that Primiano some concerns that we're introducing too much complexity for an edge case, and that we should keep it simple.

One option would be to just check that there's at least one offline cpu core before doing the per-cpu fallback, instead of trying to parse the exact mask. That would move the edge case to "many-core machine + offlined cores + very sensitive to time spent in ClearTrace()", but also avoid whatever potential string parsing errors we're not seeing here :--).

Consider the much simpler to read & maintain callsite in that scenario:

if (!HasOfflineCpus()) { return; for (size_t cpu = 0, num_cpus = NumberOfCpus(); cpu < num_cpus; cpu++) { ClearPerCpuTrace(cpu); }

My reasoning for current approach is:

For no offline cpu: the string should be empty (if not some wider issue in kernel or in base functions(TrimWhitespace/StringSplitter)), so effectively no parsing logic should happen.

For some offline cpus(though rare): switching to just check that there's at least one offline cpu core before doing the per-cpu fallback, instead of trying to parse the exact mask. would effectively mean no trace for scenario with high NumberOfCpus and short trace_duration.

From local runs we observe:

[242.828] tracefs.cc:466 Tracefs::ClearTrace started for NumberOfCpus=128 [251.148] tracefs.cc:474 Tracefs::ClearTrace done

[416.948] tracefs.cc:468 Tracefs::ClearTrace started for NumberOfCpus=256 [432.652] tracefs.cc:475 Tracefs::ClearTrace done

Now if no issue with parsing, we improve on this and in case any issue we fallback effectively to the same.

Open to change if you aren't convinced.

so effectively no parsing logic should happen.
But it's still a bunch of code added to the class, so the ratio of code vs applicability is rather low. Plus the "rare" codepath will be less exercised in practice and therefore more likely to have issues/bitrot.

But to reiterate, I'm personally fine with this patch. In fact we can probably later move the string parsing to src/kernel_utils/, as it comes up in a bunch of procfs/sysfs places.

rsavitski · 2025-07-23T17:41:08Z

src/traced/probes/ftrace/tracefs.cc

@@ -655,12 +715,16 @@ bool Tracefs::IsFileReadable(const std::string& path) {
  return access(path.c_str(), R_OK) == 0;
 }

+bool Tracefs::ReadFile(const std::string& path, std::string* str) const {


What's wrong with using ReadFileIntoString for the offlined cores file? I'd prefer not to add any extra shims for the mock-based tests. (Plus I think we should be moving towards a FakeTracefs for majority of tests anyway.)

What's wrong with using ReadFileIntoString for the offlined cores file?
We want to know if base::readfile fails so we can invoke the fallback of clearing all CPUs. Currently ReadFileIntoString in case of failure returns empty string which is also a possible return value when no cpu is offline.

do you mean MockTracefs by FakeTracefs ?

Now the reason for adding the wrapper function Tracefs::ReadFile for base::ReadFile is purely for testing purposes following the recommendation go/gmockcook#mocking-free-functions

Ah, I see your point of needing to distinguish the error case, and wanting to inject that in the test.

It's weird to have a mocking testing shim for both ReadFileIntoString and ReadFile, if the earlier is a minimal wrapper around the latter. Though I can see how it's easier to mock the earlier.

do you mean MockTracefs by FakeTracefs?
I was talking about a hypothetical fake that we should finally write. A significant chunk of ftrace tests do not assert interactions, and therefore would be easier to maintain with a fake. But let's put that aside for this patch, it's not directly relevant.

Clearing `/sys/kernel/tracing/trace` should clear all per-CPU trace data for online CPUs. For offline CPUs, trace data might still be pending in their `trace_pipe_raw`. To optimize the clearing process and minimize time taken, only clear the trace for these offline CPUs instead of iterating over all CPUs.

rsavitski · 2025-07-24T15:12:59Z

src/traced/probes/ftrace/tracefs.h

+  // Parses the list of offline CPUs from "/sys/devices/system/cpu/offline" and
+  // returns them as a vector if successful, or std::nullopt if any failure in
+  // reading or parsing.
+  std::optional<std::vector<uint32_t>> GetOfflineCpus();


Can we make this protected since it's effectively a utility helper function that is only here because tracefs is the single caller atm, and the only reason you're exposing it in the interface is for tests?

rsavitski · 2025-07-24T15:16:10Z

src/traced/probes/ftrace/tracefs_unittest.cc

@@ -143,5 +152,39 @@ TEST(TracefsTest, ReadBufferSizeInPages) {
  EXPECT_THAT(ftrace.GetCpuBufferSizeInPages(), 1);
 }

+TEST(TracefsTest, GetOfflineCpus) {


It'd be useful to have a test showing that the per-cpu buffer clearing is working as expected when calling into ClearTrace(), since that's the logic we actually care about.

sashwinbalaji requested a review from a team as a code owner July 23, 2025 07:42

LalitMaganti requested a review from rsavitski July 23, 2025 08:24

rsavitski self-assigned this Jul 23, 2025

rsavitski reviewed Jul 23, 2025

View reviewed changes

sashwinbalaji added 3 commits July 24, 2025 11:13

use base::StringView

f5c7fa9

swith to std::optional & remove not needed description.

c3c236a

sashwinbalaji force-pushed the dev/sashwinbalaji/traced_timeout branch from 8ab25a4 to c3c236a Compare July 24, 2025 05:48

sashwinbalaji marked this pull request as draft July 24, 2025 11:49

rsavitski reviewed Jul 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ftrace: ClearPerCpuTrace only for offline CPUs #2260

ftrace: ClearPerCpuTrace only for offline CPUs #2260

sashwinbalaji commented Jul 23, 2025

Uh oh!

LalitMaganti commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

rsavitski Jul 23, 2025

Uh oh!

sashwinbalaji Jul 24, 2025

Uh oh!

rsavitski Jul 24, 2025

Uh oh!

rsavitski Jul 23, 2025

Uh oh!

sashwinbalaji Jul 24, 2025

Uh oh!

rsavitski Jul 24, 2025

Uh oh!

rsavitski Jul 24, 2025

Uh oh!

rsavitski Jul 24, 2025

Uh oh!

Uh oh!

ftrace: ClearPerCpuTrace only for offline CPUs #2260

Are you sure you want to change the base?

ftrace: ClearPerCpuTrace only for offline CPUs #2260

Conversation

sashwinbalaji commented Jul 23, 2025

Uh oh!

LalitMaganti commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!