Skip to content

Conversation

@yunfengzhou-hub
Copy link
Contributor

What is the purpose of the change

This issue aims to support configuring Model Function's behavior when the remote requests failed, so as to better adapt to different use cases like debugging or production.

Brief change log

Adds support for the following error handling strategies

  • retry: Retry sending the request. The retrying behavior is limited by retry-num, retry-fallback-strategy, retry-backoff-strategy and retry-backoff-base-interval.
  • failover: Throw exceptions and fail the Flink job.
  • ignore: Ignore the input that caused the error and continue. The error itself would be recorded in log.

Verifying this change

Added unit tests in ModelFunctionErrorHandlingStrategyTest to cover the newly introduced functions.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): yes
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? docs

@flinkbot
Copy link
Collaborator

flinkbot commented Oct 28, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@yunfengzhou-hub yunfengzhou-hub force-pushed the error-handling-strategy branch from 03835af to d377599 Compare October 28, 2025 11:52
@yunfengzhou-hub yunfengzhou-hub force-pushed the error-handling-strategy branch from d377599 to b4e212e Compare October 29, 2025 01:53
@yunfengzhou-hub yunfengzhou-hub marked this pull request as ready for review October 29, 2025 06:00
<ul>
<li><code>retry</code>: Retry sending the request. The retrying behavior is limited by retry-num, retry-fallback-strategy, retry-backoff-strategy and retry-backoff-base-interval.</li>
<li><code>failover</code>: Throw exceptions and fail the Flink job.</li>
<li><code>ignore</code>: Ignore the input that caused the error and continue. The error itself would be recorded in log.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the HTTP connector . We have introduced metadata columns to surface the error details and a flag to continueOnError. Can we have a strategy here that would surface the error and carry on. This allows the stream to "handle" the error with dead letter queues etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I tried to add error-string, http-status-code and http-headers-map as optional output metadata columns in the latest commit. Please take a look.

I did not add http-completion-state yet, as OpenAI's Java SDK did not provide direct access to this information. Maybe we can extend this function in future.

.defaultValue(AbstractOpenAIModelFunction.ErrorHandlingStrategy.RETRY)
.withDescription(
Description.builder()
.text(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we retry the requests when the error is retryable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the retrying functionality has been added in reference to OpenAI's Java SDK's RetryingHttpClient.

add(408); // Retry on request timeouts
add(409); // Retry on lock timeouts
add(429); // Retry on rate limits
add(500); // Retry internal errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not think internal errors should be retryable. The http connector allows this list to be supplied by the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retryable code list is referenced from the OpenAI SDK RetryingHttpClient.kt#L173. Since the OpenAI REST API has a relatively stable behavior and this ModelFunction does not mean to support arbitrary HTTP server with LLM yet, I don't think we need to add this flexibility for users now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of retrying ourselves, can we just use the retry ability from the RetryingHttpClient?

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Oct 29, 2025
@yunfengzhou-hub yunfengzhou-hub force-pushed the error-handling-strategy branch from 662e73f to 3927931 Compare October 30, 2025 10:55
@yunfengzhou-hub
Copy link
Contributor Author

@flinkbot run azure

@yunfengzhou-hub yunfengzhou-hub force-pushed the error-handling-strategy branch from 3927931 to de40c8b Compare October 31, 2025 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants