- 
                Notifications
    
You must be signed in to change notification settings  - Fork 758
 
feat(http): add error handling for exporting #4709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(http): add error handling for exporting #4709
Conversation
d3d4031    to
    afb76f2      
    Compare
  
    | try: | ||
| resp = self._export(serialized_data, deadline_sec - time()) | ||
| except Exception as error: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind opening an issue or point to some existing issue for the behavior you're encountering? We can discuss possible fixes there.
I think there is a bug in _export() on L157, where the code assumes the connection will succeed, and so the retry loop here is never executed. However the raised exception should be ultimately caught here 
opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/_shared_internal/__init__.py
Lines 167 to 170 in d327927
| except Exception: # pylint: disable=broad-exception-caught | |
| self._logger.exception( | |
| "Exception while exporting %s.", self._exporting | |
| ) | 
If you just want the logs to go away, you can set a filter on the opentelemetry.sdk._shared_internal logger or disable it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea so the issue for me was that the logger logged the entire error trace to the std.out, which totally makes sense given the code that you shared.
So my goal was to still get to know when something fails and not just disable the logger, as I feel like the otlp_exporters should handle this internally.
On top of that I would like to generally comment that doing a retry over a try-except in _export() feels kinda odd to me.
Thanks for taking your time to look into this! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still think I should open an issue for that generally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aabmass, from what I understand, this is basically when the HTTP endpoint is not available and the urllib call is unable to complete the request, and this outputs the full stacktrace of urllib with the connection refused error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an appropriate fix could be in export fix the retry logic under a try/catch, like:
try:
  resp = self._export(serialized_data, remaining_time)
except Exception as exc:
  is_retryable = True
else:
  is_retryable = _is_retryable(resp)
and use that variable in the if at L183. Maybe needs more adjustments. wdyt @aabmass @DylanRussell ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pafi-code if you are open to do that would be good, if not please lmk so I can file a PR, but I believe the correct approach is to fix the bug of the request not getting into the retry loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR is updated. Two questions:
- Are you fine with me excepting 
Exceptionsin general or should I only handleConnectionError? I assumeConnectionErroris preferable. - Are you fine with the log message in case of time out or max retries?
 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- @pafi-code let's catch 
requests.exceptions.RequestException, wdyt? - My first thought on this fix was:
 
        try:
            resp = self._export(serialized_data, remaining_time)
            if resp.ok:
                return SpanExportResult.SUCCESS
        except requests.exceptions.RequestException as exc:
            reason = str(exc)
            retryable = True
            status_code = None
        else:
            retryable = _is_retryable(resp)
            status_code = resp.status_code
            reason = resp.reason
        # Compute jittered exponential backoff
        backoff = (2 ** retry_num) * random.uniform(0.8, 1.2)
        remaining_time = deadline - time()
        if (
            not retryable
            or retry_num + 1 == _MAX_RETRYS
            or backoff > remaining_time
            or self._shutdown
        ):
            if status_code is None:
                _logger.error("Failed to export span batch: %s", reason)
            else:
                _logger.error(
                    "Failed to export span batch code: %s, reason: %s",
                    status_code,
                    reason,
                )
            return SpanExportResult.FAILURE
        _logger.warning(
            "Transient error %s encountered while exporting span batch, retrying in %.2fs.",
            reason,
            backoff,
        )
        if is_shutdown_in_progress.wait(backoff):
            _logger.warning("Shutdown in progress, aborting retry.")
            break
    return SpanExportResult.FAILURE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ok will do!
 - well my issue with this is that these log messages are not really meaningful. If we can't export because max retries was reached for example, then the log message should also reflect this. So in your implementation on each retry you would create a log warning. Then on the last try you just log the error with the same reasoning. However the actual reason you failed to export was because you've reached max retries
 
Does that argumentation make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay but I've just noticed that my implementation is also not ideal for the log messages... I will try to find a solution here 😄
| 
           Okay so I know that the pipeline is failing and I assume this PR should be rebased, but for me there is still the question if my proposed fix is actually how it should be handled. Would appreciate some more feedback here. Also I've created an issue related to this PR: #4712  | 
    
afb76f2    to
    f581f22      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pafi-code can we have tests for this?
Description
When using the OTEL http exporters the std.out gets polluted by unhandled error traces. I don't think that this should be the desired result. This happens when the endpoint is not available when exporting telemetry. This can be cause by different events but in general I think it should be better handled.
Therefore I added error handling with more specific and shorted error messages.
Type of change
I'm not sure if the change type is rather fix or feature, as the service was still running.
How Has This Been Tested?
Does This PR Require a Contrib Repo Change?
Checklist: