-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Start new per-host queue when an async over sync Dns call is cancelled #92862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -641,9 +641,7 @@ private static bool LogFailure(object hostNameOrAddress, long? startingTimestamp | |
/// Since this is doing synchronous work on a thread pool thread, we want to limit how many threads end up being | ||
/// blocked. We could employ a semaphore to limit overall usage, but a common case is that DNS requests are made | ||
/// for only a handful of endpoints, and a reasonable compromise is to ensure that requests for a given host are | ||
/// serialized. Once the data for that host is cached locally by the OS, the subsequent requests should all complete | ||
/// very quickly, and if the head-of-line request is taking a long time due to the connection to the server, we won't | ||
/// block lots of threads all getting data for that one host. We also still want to issue the request to the OS, rather | ||
/// serialized. We also still want to issue the request to the OS, rather | ||
/// than having all concurrent requests for the same host share the exact same task, so that any shuffling of the results | ||
/// by the OS to enable round robin is still perceived. | ||
/// </remarks> | ||
|
@@ -666,7 +664,18 @@ private static Task<TResult> RunAsync<TResult>(Func<object, long, TResult> func, | |
Debug.Assert(!Monitor.IsEntered(s_tasks)); | ||
try | ||
{ | ||
return func(key, startingTimestamp); | ||
using (cancellationToken.UnsafeRegister(static s => | ||
{ | ||
// In case a request is cancelled, start a new queue for this key. | ||
// This helps avoiding congestion when a getaddrinfo call runs for too long. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the rationale for why a cancellation request means we no longer want to serialize? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If configured well, a cancellation would likely kick-in for a long-running or stalled request. Closing the existing queue and starting a new one makes sure that requests following the cancellation would not be blocked by the long-running request. This helped in my experiments + validations are in progress in customer environments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One of the original motivations for doing this queueing was a DNS request that could get stalled, and thus all requests for that DNS endpoint would end up blocking all threads in the pool. Doesn't this effectively put us back into the same place? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From what I see, #48566 doesn't talk about stalled (up to several seconds) requests but about slow (tens of milliseconds?) requests using up the ThreadPool threads. This does not remove the serialization of those. The queue is only being reset when a cancellation kicks in which is typically triggered by a cc @davidfowl if you can recall how long did the slow requests take in #34633. Without that I would assume it's a case similar to what we discussed in #92862 (comment) and "slow" means that the requests are in the 5-50 ms range. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, so if every request for a given destination ends up taking a longer period of time and ends up being canceled as a result, every cancellation is going to cause subsequent requests to run concurrently with ones already there, and if it happens enough, won't it similarly spread out over all of the pool threads?
That's one source. The Dns APIs that support cancellation are public. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, but how likely is this scenario, and is it more probable than hitting #81023? To me it looks like having all requests to the same host stalled for a long enough period to starve the ThreadPool is a catsrophic event which would have serious impact on the application's funcionality even without the pool starvation, and the customer has to go and fix the environmental setup anyways. Occasional stalled requests are more common IMO, and the problem is that they are currently killing a possibly long chain of follow-up requests.
IMO this doesn't really change things. Regularly occuring cases of cancellation are most likely timeouts, and cancelling a significant percentage of DNS lookups by setting an overly short timeout doesn't look like a sane scenario to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What is the cause of the stalled requests you're seeing that makes this the case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In #92054 we see the OS resolver (succesfully) retrying a query after the default I don't have such confident information from the other cases, because this is typcially happening in production systems in a sporadic manner. Currently I'm working with @matt-augustine checking if a private build of the changes in this PR helps with their symptoms. Would that bring enough confidence for you that the change is net positive? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We see the issue in a production service with a containerized application running in a Kubernetes cluster on AKS. It usually occurs in new pods shortly after they are created. The cluster is using the default Kubernetes DNS configuration, which as I understand it means that DNS lookups are not cached in the container/pod, so every DNS query is routed to the Our theory is that UDP packet loss might be responsible for failed/retried DNS queries when new pods are created, but we have not confirmed that. We are currently testing a private build to see if it helps, and I'm going to try to make a simpler way to test it in a more repeatable way. |
||
lock (s_tasks) | ||
{ | ||
s_tasks.Remove(s!); | ||
} | ||
}, key)) | ||
{ | ||
return func(key, startingTimestamp); | ||
} | ||
} | ||
finally | ||
{ | ||
|
@@ -682,16 +691,17 @@ private static Task<TResult> RunAsync<TResult>(Func<object, long, TResult> func, | |
}, key, cancellationToken, TaskContinuationOptions.DenyChildAttach, TaskScheduler.Default); | ||
|
||
// If it's possible the task may end up getting canceled, it won't have a chance to remove itself from | ||
// the dictionary if it is canceled, so use a separate continuation to do so. | ||
// the dictionary and report AfterResolution() if it is canceled, so use a separate continuation to do so. | ||
if (cancellationToken.CanBeCanceled) | ||
{ | ||
task.ContinueWith((task, key) => | ||
task.ContinueWith(delegate | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously this wasn't allocating a new delegate and a new closure on every call. Now it is, as it's closing over the key and the task and the startingTimestamp. It'll even result in that allocation even if !CanBeCanceled, because state it's closing over is at the top-level method scope. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a fix for #92045. I don't see a way to capture There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At a minimum you can pass the required state in via the object state parameter. That'll require a tuple allocation/boxing, but the one allocation for that is still better than the two allocations for the closure and delegate. And it'll make it specific to the CanBeCanceled case, whereas with how it's currently written the closure allocation will happen first thing in the method body. |
||
{ | ||
NameResolutionTelemetry.Log.AfterResolution(key, startingTimestamp, false); | ||
lock (s_tasks) | ||
{ | ||
((ICollection<KeyValuePair<object, Task>>)s_tasks).Remove(new KeyValuePair<object, Task>(key!, task)); | ||
} | ||
}, key, CancellationToken.None, TaskContinuationOptions.OnlyOnCanceled | TaskContinuationOptions.ExecuteSynchronously, TaskScheduler.Default); | ||
}, CancellationToken.None, TaskContinuationOptions.OnlyOnCanceled | TaskContinuationOptions.ExecuteSynchronously, TaskScheduler.Default); | ||
} | ||
|
||
// Finally, store the task into the dictionary as the current task for this key. | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a false assumption. In Ubuntu there is no caching by default, and each
getaddrinfo
call sends an A & AAAA DNS request and takes 5-50ms. If the packet is lost, it waits 5 sec before retrying.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On what are you basing this? On Ubuntu for:
I get:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
averages to
24.84 ms
in my local 22.04 environment and to9.051 ms
in the Azure D32 VM from #92054.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That name resolves through a CNAME of
fb.lt.main.ml.mdn.trafficmanager.net
, which advertises a CNAME forfileserver-backend-lb.lt-westus-01.ic3-middle-lane-main-lt.westus-test.cosmic-int.office.net
with a 0 second TTL. So my guess is the OS is just honoring the TTLs in this case.The
www.microsoft.com
example above on the other hand uses non-0 TLLs.(that's in no way saying that this isn't a real scenario, DNS-based connection load balancing works this way)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also check whether DNS caching in your system is enabled, e.g.
My point is simply that stating that it's a false assumption that OSes cache DNS information is itself a false assumption. Every major OS I'm aware of is capable of doing DNS caching. Whether it's enabled is a different question, though to my knowledge it is by default in Ubuntu 22.04. And we shouldn't just be killing the associated comment. If you want to augment the comment to say that caching might be disabled, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong thinking that answers are never cached on Ubuntu by default. However, to me the comment suggests that we can generally rely on short completion times because of the caching, which is also not true, eg. as Miha pointed out answers with TTL=0 are not cached, which then increases the probability of hitting #81023.
I will add the comment back & augment it.