[FLINK-38581][model] Model Function supports error handling strategy #27163

yunfengzhou-hub · 2025-10-28T11:33:03Z

What is the purpose of the change

This issue aims to support configuring Model Function's behavior when the remote requests failed, so as to better adapt to different use cases like debugging or production.

Brief change log

Adds support for the following error handling strategies

retry: Retry sending the request. The retrying behavior is limited by retry-num, retry-fallback-strategy, retry-backoff-strategy and retry-backoff-base-interval.
failover: Throw exceptions and fail the Flink job.
ignore: Ignore the input that caused the error and continue. The error itself would be recorded in log.

Verifying this change

Added unit tests in ModelFunctionErrorHandlingStrategyTest to cover the newly introduced functions.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): yes
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): yes
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? docs

flinkbot · 2025-10-28T11:36:29Z

CI report:

de40c8b Azure: FAILURE
0154b6b Azure: PENDING

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

davidradl · 2025-10-29T11:30:06Z

docs/content/docs/connectors/models/openai.md

+                <ul>
+                    <li><code>retry</code>: Retry sending the request. The retrying behavior is limited by retry-num, retry-fallback-strategy, retry-backoff-strategy and retry-backoff-base-interval.</li>
+                    <li><code>failover</code>: Throw exceptions and fail the Flink job.</li>
+                    <li><code>ignore</code>: Ignore the input that caused the error and continue. The error itself would be recorded in log.</li>


For the HTTP connector . We have introduced metadata columns to surface the error details and a flag to continueOnError. Can we have a strategy here that would surface the error and carry on. This allows the stream to "handle" the error with dead letter queues etc.

Thanks for the suggestion. I tried to add error-string, http-status-code and http-headers-map as optional output metadata columns in the latest commit. Please take a look.

I did not add http-completion-state yet, as OpenAI's Java SDK did not provide direct access to this information. Maybe we can extend this function in future.

davidradl · 2025-10-29T11:31:38Z

flink-models/flink-model-openai/src/main/java/org/apache/flink/model/openai/OpenAIOptions.java

+                            .defaultValue(AbstractOpenAIModelFunction.ErrorHandlingStrategy.RETRY)
+                            .withDescription(
+                                    Description.builder()
+                                            .text(


Can we retry the requests when the error is retryable?

Yes, the retrying functionality has been added in reference to OpenAI's Java SDK's RetryingHttpClient.

davidradl · 2025-10-29T11:34:41Z

...nk-model-openai/src/main/java/org/apache/flink/model/openai/AbstractOpenAIModelFunction.java

+                    add(408); // Retry on request timeouts
+                    add(409); // Retry on lock timeouts
+                    add(429); // Retry on rate limits
+                    add(500); // Retry internal errors


I would not think internal errors should be retryable. The http connector allows this list to be supplied by the user.

The retryable code list is referenced from the OpenAI SDK RetryingHttpClient.kt#L173. Since the OpenAI REST API has a relatively stable behavior and this ModelFunction does not mean to support arbitrary HTTP server with LLM yet, I don't think we need to add this flexibility for users now.

Instead of retrying ourselves, can we just use the retry ability from the RetryingHttpClient?

yunfengzhou-hub · 2025-10-30T14:11:25Z

@flinkbot run azure

yunfengzhou-hub force-pushed the error-handling-strategy branch from 03835af to d377599 Compare October 28, 2025 11:52

[FLINK-38581][model] Model Function supports error handling strategy

b4e212e

yunfengzhou-hub force-pushed the error-handling-strategy branch from d377599 to b4e212e Compare October 29, 2025 01:53

yunfengzhou-hub marked this pull request as ready for review October 29, 2025 06:00

davidradl reviewed Oct 29, 2025

View reviewed changes

github-actions bot added the community-reviewed PR has been reviewed by the community. label Oct 29, 2025

yunfengzhou-hub force-pushed the error-handling-strategy branch from 662e73f to 3927931 Compare October 30, 2025 10:55

[FLINK-38581][model] Support surfacing error message

de40c8b

yunfengzhou-hub force-pushed the error-handling-strategy branch from 3927931 to de40c8b Compare October 31, 2025 01:41

[FLINK-38581][model] Reuse retrying logic from RetryingHttpClient

0154b6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-38581][model] Model Function supports error handling strategy #27163

[FLINK-38581][model] Model Function supports error handling strategy #27163

yunfengzhou-hub commented Oct 28, 2025

Uh oh!

flinkbot commented Oct 28, 2025 •

edited

Loading

Uh oh!

davidradl Oct 29, 2025

Uh oh!

yunfengzhou-hub Oct 30, 2025

Uh oh!

davidradl Oct 29, 2025

Uh oh!

yunfengzhou-hub Oct 30, 2025

Uh oh!

davidradl Oct 29, 2025

Uh oh!

yunfengzhou-hub Oct 30, 2025

Uh oh!

Sxnan Oct 31, 2025

Uh oh!

yunfengzhou-hub commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FLINK-38581][model] Model Function supports error handling strategy #27163

Are you sure you want to change the base?

[FLINK-38581][model] Model Function supports error handling strategy #27163

Conversation

yunfengzhou-hub commented Oct 28, 2025

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

davidradl Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

yunfengzhou-hub Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidradl Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

yunfengzhou-hub Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidradl Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

yunfengzhou-hub Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Sxnan Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

yunfengzhou-hub commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flinkbot commented Oct 28, 2025 •

edited

Loading