Skip to content

Conversation

@matt-bernstein
Copy link
Contributor

batch tasks_existed count during storage sync

@netlify
Copy link

netlify bot commented Oct 10, 2025

Deploy Preview for label-studio-docs-new-theme canceled.

Name Link
🔨 Latest commit 7622386
🔍 Latest deploy log https://app.netlify.com/projects/label-studio-docs-new-theme/deploys/68f7d50e92481b00087a1bf3

@netlify
Copy link

netlify bot commented Oct 10, 2025

Deploy Preview for label-studio-storybook ready!

Name Link
🔨 Latest commit 7622386
🔍 Latest deploy log https://app.netlify.com/projects/label-studio-storybook/deploys/68f7d50ecc10850008d67210
😎 Deploy Preview https://deploy-preview-8630--label-studio-storybook.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@netlify
Copy link

netlify bot commented Oct 10, 2025

Deploy Preview for heartex-docs canceled.

Name Link
🔨 Latest commit 7622386
🔍 Latest deploy log https://app.netlify.com/projects/heartex-docs/deploys/68f7d50ea0428d00084e9f3d

@github-actions github-actions bot added the fix label Oct 10, 2025
@netlify
Copy link

netlify bot commented Oct 10, 2025

Deploy Preview for label-studio-playground ready!

Name Link
🔨 Latest commit 7622386
🔍 Latest deploy log https://app.netlify.com/projects/label-studio-playground/deploys/68f7d50eddd7510008a76e59
😎 Deploy Preview https://deploy-preview-8630--label-studio-playground.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented Oct 10, 2025

/git merge develop

Workflow run
Successfully merged: create mode 100644 label_studio/tasks/migrations/0058_task_precomputed_agreement.py

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented Oct 10, 2025

/fmt

Workflow run

@codecov
Copy link

codecov bot commented Oct 10, 2025

Codecov Report

❌ Patch coverage is 86.36364% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.69%. Comparing base (e80029c) to head (7622386).
⚠️ Report is 1 commits behind head on develop.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
label_studio/io_storages/base_models.py 86.84% 5 Missing ⚠️
label_studio/io_storages/s3/models.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #8630       +/-   ##
============================================
+ Coverage    67.04%   79.69%   +12.64%     
============================================
  Files          792      238      -554     
  Lines        60444    21557    -38887     
  Branches     10291        0    -10291     
============================================
- Hits         40527    17179    -23348     
+ Misses       19914     4378    -15536     
+ Partials         3        0        -3     
Flag Coverage Δ
lsf-e2e ?
lsf-integration ?
lsf-unit ?
pytests 79.69% <86.36%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@matt-bernstein matt-bernstein requested a review from makseq October 10, 2025 21:12
keys_for_existed_count.append(key)
if len(keys_for_existed_count) >= settings.STORAGE_EXISTED_COUNT_BATCH_SIZE:
tasks_existed += link_class.objects.filter(
key__in=keys_for_existed_count, storaged=self.id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe key is not in the index, so this will become expensive checking for each object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is still true, each query is as expensive as it was before. But there are now STORAGE_EXISTED_COUNT_BATCH_SIZE = 1000x fewer queries. For storages being synced often so most keys in the storage already exist, this is the dominant cost of a sync operation. And this is the situation we are in now, with people running sync on a cronjob every 30 minutes forever (which you can check in datadog).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a (storage, key) index would be a big improvement too. But adding an index to an existing large table is something Jo has warned against doing in the past, as it leads to difficult migrations. I'm not confident that doing it is a good idea without his input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along these lines, the next thing I will try is batching the exists() query as well, but that's a larger refactor, so I wanted to get this diff out first.

makseq
makseq previously requested changes Oct 10, 2025
@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented Oct 14, 2025

/git merge develop

Workflow run
Successfully merged: 8 files changed, 206 insertions(+), 43 deletions(-)

if n_tasks_linked := link_class.n_tasks_linked(key, self):
logger.debug(f'{self.__class__.__name__} already has {n_tasks_linked} tasks linked to {key=}')
tasks_existed += n_tasks_linked # update progress counter
if link_class.exists(key, self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have I understood it correctly: before we did .count request for every key and now we do .exists request for every key and also 1 request per batch for count?
So we will have even more requests than before, but we expect that exists is much lightweight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But with the feature flag disabled we will have twice as many requests? (.exists() plus .count() for every key)
Maybe we can reimplement it in a safer way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, fixed it 👍

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented Oct 15, 2025

Results from silk on resyncing a storage with 2500 tasks:

ff ON:
548ms overall
37ms on queries
28 queries

ff OFF:
486ms overall
38ms on queries
28 queries

let me make sure the test was valid, and try other numbers of tasks

Comment on lines -883 to -884
def n_tasks_linked(cls, key, storage):
return cls.objects.filter(key=key, storage=storage.id).count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not keep around this class method as well, but change the signature to cls, key_set, storage? This could be used to make the new code a bit DRYer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's used in one place with key=, and another place with keys__in=, I didn't want to combine those, it seemed premature (even though keys__in=[key] is a valid use I felt it would increase rather than reduce confusion) - especially since we're going to be getting rid of one.


tasks_for_webhook = []
keys_for_existed_count = []
for key in self.iter_keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another optimization idea: could we implement a batch_iter_keys and then process multiple keys (at least for checking if already synced) in a single query?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to reduce implementation overhead, this might not be necessary for every storage - we could just do it on the storage we're struggling with and have .batch_iter_keys raise NotImplemented on other storages for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do that too, yep, but playing around with this idea in case there's still an easy win from better query planning, that'd be less risk

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented Oct 16, 2025

/fmt

Workflow run

@matt-bernstein
Copy link
Contributor Author

Silk results very similar to last time after adding batching

@matt-bernstein
Copy link
Contributor Author

As a note for the future, loading the storages page is quite slow due to the following GETs: /storages, /storages/export, /storages/types (most time is not spent in the DB though)

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented Oct 17, 2025

/git merge develop

Workflow run
Successfully merged: create mode 100644 label_studio/data_manager/migrations/0017_update_agreement_selected_to_nested_structure.py

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented Oct 20, 2025

/git merge develop

Workflow run
Successfully merged: create mode 100644 web/libs/ui/src/shad/components/ui/tabs.tsx

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented Oct 20, 2025

/git merge develop

Workflow run
Successfully merged: 3 files changed, 21 insertions(+), 18 deletions(-)

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented Oct 21, 2025

/git merge develop

Workflow run
Successfully merged: 6 files changed, 331 insertions(+), 8 deletions(-)

@matt-bernstein matt-bernstein dismissed makseq’s stale review October 21, 2025 22:37

see updated test results in LSE PR

@robot-ci-heartex robot-ci-heartex merged commit 9e0cfba into develop Oct 21, 2025
53 checks passed
@robot-ci-heartex robot-ci-heartex deleted the fb-ROOT-212 branch October 21, 2025 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants