Skip to content

Conversation

@keelerm84
Copy link
Member

@keelerm84 keelerm84 commented Nov 14, 2025

In the previous setup, we would only check the fallback or recovery
conditions once the synchronizer returned an update. If the synchronizer
was stuck, or nothing was changing in the environment, we would never
check the conditions.

This configuration also exposed an interesting behavior. If the
synchronizer cannot connect, it will emit error updates. Each time we
receive an error, we check if we have failed to initialize for the last
10 seconds. If so, we re-create the primary synchronizer.

When it continues to fail, the first update will trigger the condition
check. And since it has still failed for 10 seconds, it will immediately
error out. With this change, we can be assured a synchronizer is given
at least 10 seconds to try before the condition is evaluated.


Note

Separates FDv2 fallback/recovery checks from synchronizer emissions via a timer+reader thread, fixes initial status timestamp, removes streaming context manager hooks, and skips flaky file-source tests in CI.

  • FDv2 core:
    • Periodically evaluates conditions independent of updates: _consume_synchronizer_results now uses a queue, background reader thread, and RepeatingTask timer to trigger regular status checks and act without waiting for sync updates.
    • Minor flow tweak when starting the secondary synchronizer.
  • Status handling:
    • Initialize DataSourceStatus.since with time.time() in DataSourceStatusProviderImpl.
  • Streaming:
    • Remove context-manager methods (__enter__, __exit__) from StreamingDataSource.
  • Tests:
    • Update test_two_phase_init to ensure version progression and add readiness/change events.
    • Mark file data source tests as flaky-skipped when LD_SKIP_FLAKY_TESTS is set.
  • CI:
    • Set LD_SKIP_FLAKY_TESTS: true for test steps on Linux and Windows jobs.

Written by Cursor Bugbot for commit fa34244. This will update automatically on new commits. Configure here.

@keelerm84 keelerm84 requested a review from a team as a code owner November 14, 2025 18:01
@keelerm84 keelerm84 force-pushed the mk/sdk-1579/improve-recovery branch from cbcd3d6 to 7e0dbce Compare November 14, 2025 18:08
synchronizer.stop()
timer.stop()

sync_reader.join(0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the half second timeout here? My understanding is that if it doesn't stop in .5 seconds, execution will move on and that thread will still continue to process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to shut down in an orderly fashion. The timeout gives us time to close out correctly, but eventually we do have to move on if it isn't halting as expected. The idea is that if close has been called, the python interpreter is going to exit shortly after anyway.

@keelerm84 keelerm84 force-pushed the mk/sdk-1574/outage-detector branch from 85f59ba to 8e81cdb Compare November 18, 2025 18:11
In the previous setup, we would only check the fallback or recovery
conditions once the synchronizer returned an update. If the synchronizer
was stuck, or nothing was changing in the environment, we would never
check the conditions.

This configuration also exposed an interesting behavior. If the
synchronizer cannot connect, it will emit error updates. Each time we
receive an error, we check if we have failed to initialize for the last
10 seconds. If so, we re-create the primary synchronizer.

When it continues to fail, the first update will trigger the condition
check. And since it has still failed for 10 seconds, it will immediately
error out. With this change, we can be assured a synchronizer is given
at least 10 seconds to try before the condition is evaluated.
@keelerm84 keelerm84 force-pushed the mk/sdk-1579/improve-recovery branch from 7e0dbce to 38fb5f4 Compare November 18, 2025 18:55
Base automatically changed from mk/sdk-1574/outage-detector to feat/fdv2 November 18, 2025 18:55
synchronizer.stop()
timer.stop()

sync_reader.join(0.5)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Thread Termination Failure Leads to Instability

The sync_reader thread is given only a 0.5-second timeout to exit. If it hasn't finished within that time, execution continues while the thread is still running, potentially causing race conditions or resource leaks when a new synchronizer is subsequently created and another reader thread is started.

Fix in Cursor Fix in Web

@keelerm84 keelerm84 force-pushed the mk/sdk-1579/improve-recovery branch 2 times, most recently from 1799cee to 644cc13 Compare November 19, 2025 15:22
@keelerm84 keelerm84 force-pushed the mk/sdk-1579/improve-recovery branch 2 times, most recently from 842e8fa to e578a2e Compare November 19, 2025 17:55
@keelerm84 keelerm84 force-pushed the mk/sdk-1579/improve-recovery branch from e578a2e to a37fa56 Compare November 19, 2025 18:33

# Send change events
if affected_items:
print("@@@@@ affected items are (delta)", affected_items, "\n")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Debug print statements left in production code

Debug print() statement with marker @@@@@ left in production code at line 374. This should be removed before merging.

Fix in Cursor Fix in Web

@keelerm84 keelerm84 force-pushed the mk/sdk-1579/improve-recovery branch from a37fa56 to 2ca4af0 Compare November 19, 2025 19:45
@keelerm84 keelerm84 merged commit 7834f90 into feat/fdv2 Nov 19, 2025
24 of 25 checks passed
@keelerm84 keelerm84 deleted the mk/sdk-1579/improve-recovery branch November 19, 2025 21:35
keelerm84 added a commit that referenced this pull request Nov 19, 2025
In the previous setup, we would only check the fallback or recovery
conditions once the synchronizer returned an update. If the synchronizer
was stuck, or nothing was changing in the environment, we would never
check the conditions.

This configuration also exposed an interesting behavior. If the
synchronizer cannot connect, it will emit error updates. Each time we
receive an error, we check if we have failed to initialize for the last
10 seconds. If so, we re-create the primary synchronizer.

When it continues to fail, the first update will trigger the condition
check. And since it has still failed for 10 seconds, it will immediately
error out. With this change, we can be assured a synchronizer is given
at least 10 seconds to try before the condition is evaluated.
keelerm84 added a commit that referenced this pull request Nov 19, 2025
In the previous setup, we would only check the fallback or recovery
conditions once the synchronizer returned an update. If the synchronizer
was stuck, or nothing was changing in the environment, we would never
check the conditions.

This configuration also exposed an interesting behavior. If the
synchronizer cannot connect, it will emit error updates. Each time we
receive an error, we check if we have failed to initialize for the last
10 seconds. If so, we re-create the primary synchronizer.

When it continues to fail, the first update will trigger the condition
check. And since it has still failed for 10 seconds, it will immediately
error out. With this change, we can be assured a synchronizer is given
at least 10 seconds to try before the condition is evaluated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants