-
Notifications
You must be signed in to change notification settings - Fork 333
Fix process leak in PosixCommandBasedStatisticsGetter #1717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@xinyuiscool @shanthoosh @sborya Could you take a look at this? Thank you |
| } catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); | ||
| throw new IOException("Interrupted while waiting for command to complete", e); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to add:
} finally {
executable.destroy();
}
here just to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL
| // Wait for the process to complete to prevent resource leak | ||
| try { | ||
| executable.waitFor(); | ||
| } catch (InterruptedException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we re-interrupt, we should probably try to kill it off. e.g.
} catch (InterruptedException e) {
executable.destroy();
Thread.currentThread().interrupt();
throw new IOException("Interrupted while waiting for command to complete", e);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll do destroy in finally block
| psOutput.add(line); | ||
| } | ||
| } | ||
| processReader.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of explicitly calling close, this should probably use try-with-resources instead. Also, since we're already changing this, any thoughts on draining the stderr buffer here too? Although (very) unlikely, if the process spams stderr and fills the pipe, it'll deadlock on the wait until something drains it off. e.g.
try (BufferedReader processReader = new BufferedReader(new InputStreamReader(executable.getInputStream()));
BufferedReader errorReader = new BufferedReader(new InputStreamReader(executable.getErrorStream()))) {
String line;
while ((line = processReader.readLine()) != null) {
if (!line.isEmpty()) {
psOutput.add(line);
}
}
while ((line = errorReader.readLine()) != null) {
if (!line.isEmpty()) {
log.error("stderr while running {}: {}", cmdArray, line);
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the stderr buffer draining.
sborya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Updated |
bringhurst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Improve stderr drain performance by only record the first 100 lines
0b5a014 to
7f6d456
Compare
|
|
||
| // Wait for the process to complete to prevent resource leak | ||
| try { | ||
| boolean finished = executable.waitFor(COMMAND_TIMEOUT_SECONDS, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the behavior if the executable did not complete within the timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will throw an exception and finally destroy the executable. This is to avoid infinite wait. We could also remove this TTL to just call executable.waitFor() or with a longer default time. @bringhurst what's your thought on this?
|
|
||
| // Wait for the process to complete to prevent resource leak | ||
| try { | ||
| boolean finished = executable.waitFor(COMMAND_TIMEOUT_SECONDS, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make this configurable per job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this would be necessary at this point as we don't want to expose this to users for tuning. Make it configurable would also require a wider refactoring.
|
This change was discussed offline. It was updated to a minute to sync up with the metrics reporting interval. There's no need to restrict it more than that. |
| while ((line = processReader.readLine()) != null) { | ||
| if (!line.isEmpty()) { | ||
| psOutput.add(line); | ||
| try (BufferedReader processReader = new BufferedReader(new InputStreamReader(executable.getInputStream())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please clarify how did we have verify and test the changes end-to-end such that it addresses the leack problem?
Summary
PosixCommandBasedStatisticsGetterwhere shell processes spawned to collect systemstatistics were not being properly reaped
process.waitFor()call after reading command output to ensure child processes complete and are removedfrom the process table
Problem
The
PosixCommandBasedStatisticsGetterclass executes shell commands (ps, grep) to collect memory statistics butwasn't waiting for these processes to complete. This caused zombie processes to accumulate over time in
long-running Samza containers, potentially leading to resource exhaustion.
Solution
process.waitFor()after closing the process reader to ensure proper process cleanupTest Plan