fix: ROOT-212: Reduce ImportStorageLink COUNT calls during storage sync #8630

matt-bernstein · 2025-10-10T20:06:04Z

batch tasks_existed count during storage sync

netlify · 2025-10-10T20:06:09Z

✅ Deploy Preview for label-studio-docs-new-theme canceled.

Name	Link
🔨 Latest commit	`7622386`
🔍 Latest deploy log	https://app.netlify.com/projects/label-studio-docs-new-theme/deploys/68f7d50e92481b00087a1bf3

netlify · 2025-10-10T20:06:10Z

✅ Deploy Preview for label-studio-storybook ready!

Name	Link
🔨 Latest commit	`7622386`
🔍 Latest deploy log	https://app.netlify.com/projects/label-studio-storybook/deploys/68f7d50ecc10850008d67210
😎 Deploy Preview	https://deploy-preview-8630--label-studio-storybook.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify · 2025-10-10T20:06:10Z

✅ Deploy Preview for heartex-docs canceled.

Name	Link
🔨 Latest commit	`7622386`
🔍 Latest deploy log	https://app.netlify.com/projects/heartex-docs/deploys/68f7d50ea0428d00084e9f3d

netlify · 2025-10-10T20:06:25Z

✅ Deploy Preview for label-studio-playground ready!

Name	Link
🔨 Latest commit	`7622386`
🔍 Latest deploy log	https://app.netlify.com/projects/label-studio-playground/deploys/68f7d50eddd7510008a76e59
😎 Deploy Preview	https://deploy-preview-8630--label-studio-playground.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

matt-bernstein · 2025-10-10T20:45:02Z

/git merge develop

Workflow run
Successfully merged: create mode 100644 label_studio/tasks/migrations/0058_task_precomputed_agreement.py

matt-bernstein · 2025-10-10T20:45:13Z

/fmt

Workflow run

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18418010052

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18418013225

codecov · 2025-10-10T20:59:58Z

Codecov Report

❌ Patch coverage is 86.36364% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.69%. Comparing base (e80029c) to head (7622386).
⚠️ Report is 1 commits behind head on develop.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
label_studio/io_storages/base_models.py	86.84%	5 Missing ⚠️
label_studio/io_storages/s3/models.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #8630       +/-   ##
============================================
+ Coverage    67.04%   79.69%   +12.64%     
============================================
  Files          792      238      -554     
  Lines        60444    21557    -38887     
  Branches     10291        0    -10291     
============================================
- Hits         40527    17179    -23348     
+ Misses       19914     4378    -15536     
+ Partials         3        0        -3

Flag	Coverage Δ
lsf-e2e	`?`
lsf-integration	`?`
lsf-unit	`?`
pytests	`79.69% <86.36%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

niklub · 2025-10-10T22:27:08Z

label_studio/io_storages/base_models.py

+                    keys_for_existed_count.append(key)
+                    if len(keys_for_existed_count) >= settings.STORAGE_EXISTED_COUNT_BATCH_SIZE:
+                        tasks_existed += link_class.objects.filter(
+                            key__in=keys_for_existed_count, storaged=self.id


I believe key is not in the index, so this will become expensive checking for each object

Yes, that is still true, each query is as expensive as it was before. But there are now STORAGE_EXISTED_COUNT_BATCH_SIZE = 1000x fewer queries. For storages being synced often so most keys in the storage already exist, this is the dominant cost of a sync operation. And this is the situation we are in now, with people running sync on a cronjob every 30 minutes forever (which you can check in datadog).

Adding a (storage, key) index would be a big improvement too. But adding an index to an existing large table is something Jo has warned against doing in the past, as it leads to difficult migrations. I'm not confident that doing it is a good idea without his input.

Along these lines, the next thing I will try is batching the exists() query as well, but that's a larger refactor, so I wanted to get this diff out first.

label_studio/io_storages/base_models.py

matt-bernstein · 2025-10-14T16:04:07Z

/git merge develop

Workflow run
Successfully merged: 8 files changed, 206 insertions(+), 43 deletions(-)

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18502705569

triklozoid · 2025-10-15T17:15:23Z

label_studio/io_storages/base_models.py

-            if n_tasks_linked := link_class.n_tasks_linked(key, self):
-                logger.debug(f'{self.__class__.__name__} already has {n_tasks_linked} tasks linked to {key=}')
-                tasks_existed += n_tasks_linked  # update progress counter
+            if link_class.exists(key, self):


Have I understood it correctly: before we did .count request for every key and now we do .exists request for every key and also 1 request per batch for count?
So we will have even more requests than before, but we expect that exists is much lightweight?

Yes, exactly

But with the feature flag disabled we will have twice as many requests? (.exists() plus .count() for every key)
Maybe we can reimplement it in a safer way?

That's a good point, fixed it 👍

matt-bernstein · 2025-10-15T18:47:40Z

Results from silk on resyncing a storage with 2500 tasks:

ff ON:
548ms overall
37ms on queries
28 queries

ff OFF:
486ms overall
38ms on queries
28 queries

let me make sure the test was valid, and try other numbers of tasks

jombooth · 2025-10-15T21:12:02Z

label_studio/io_storages/base_models.py

-    def n_tasks_linked(cls, key, storage):
-        return cls.objects.filter(key=key, storage=storage.id).count()


why not keep around this class method as well, but change the signature to cls, key_set, storage? This could be used to make the new code a bit DRYer

Since it's used in one place with key=, and another place with keys__in=, I didn't want to combine those, it seemed premature (even though keys__in=[key] is a valid use I felt it would increase rather than reduce confusion) - especially since we're going to be getting rid of one.

jombooth · 2025-10-15T21:16:09Z

label_studio/io_storages/base_models.py


        tasks_for_webhook = []
+        keys_for_existed_count = []
        for key in self.iter_keys():


another optimization idea: could we implement a batch_iter_keys and then process multiple keys (at least for checking if already synced) in a single query?

to reduce implementation overhead, this might not be necessary for every storage - we could just do it on the storage we're struggling with and have .batch_iter_keys raise NotImplemented on other storages for now

Could do that too, yep, but playing around with this idea in case there's still an easy win from better query planning, that'd be less risk

matt-bernstein · 2025-10-16T17:34:25Z

/fmt

Workflow run

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18569736205

matt-bernstein · 2025-10-16T20:14:49Z

Silk results very similar to last time after adding batching

matt-bernstein · 2025-10-16T20:16:11Z

As a note for the future, loading the storages page is quite slow due to the following GETs: /storages, /storages/export, /storages/types (most time is not spent in the DB though)

matt-bernstein · 2025-10-17T18:25:50Z

/git merge develop

Workflow run
Successfully merged: create mode 100644 label_studio/data_manager/migrations/0017_update_agreement_selected_to_nested_structure.py

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18601468586

label_studio/io_storages/base_models.py

matt-bernstein · 2025-10-20T14:44:43Z

/git merge develop

Workflow run
Successfully merged: create mode 100644 web/libs/ui/src/shad/components/ui/tabs.tsx

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18655724511

matt-bernstein · 2025-10-20T19:15:30Z

/git merge develop

Workflow run
Successfully merged: 3 files changed, 21 insertions(+), 18 deletions(-)

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18662394500

matt-bernstein · 2025-10-21T12:28:26Z

/git merge develop

Workflow run
Successfully merged: 6 files changed, 331 insertions(+), 8 deletions(-)

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18683828059

see updated test results in LSE PR

matt-bernstein added 2 commits October 10, 2025 15:28

feat: ROOT-212: Reduce ImportStorageLink COUNT calls during storage sync

7fa6a21

batch tasks_existed count during storage sync

0cb008c

github-actions bot added the fix label Oct 10, 2025

matt-bernstein added 2 commits October 10, 2025 16:18

indent error

8c2781d

restore exists()

4dfcf47

matt-bernstein added 2 commits October 10, 2025 20:45

Merge branch 'develop' into 'fb-ROOT-212'

d55625f

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18418010052

Apply pre-commit linters

8ed4d0f

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18418013225

fflag

e4279da

matt-bernstein requested a review from makseq October 10, 2025 21:12

niklub reviewed Oct 10, 2025

View reviewed changes

makseq previously requested changes Oct 10, 2025

View reviewed changes

label_studio/io_storages/base_models.py Outdated Show resolved Hide resolved

matt-bernstein added 2 commits October 10, 2025 21:07

typo

1738ecd

update ff organization check

8bb128b

matt-bernstein requested review from makseq and niklub October 11, 2025 01:25

Merge branch 'develop' into 'fb-ROOT-212'

0965d90

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18502705569

matt-bernstein requested a review from jombooth October 15, 2025 16:42

triklozoid reviewed Oct 15, 2025

View reviewed changes

matt-bernstein added 3 commits October 15, 2025 14:06

don't double-query in ff OFF case

f8ec0fd

Merge remote-tracking branch 'origin/develop' into fb-ROOT-212

5ec6024

lint

c15d2eb

jombooth reviewed Oct 15, 2025

View reviewed changes

batch all queries for existing keys

09aa3cc

Apply pre-commit linters

cbad8b9

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18569736205

matt-bernstein requested a review from jombooth October 16, 2025 20:14

Merge remote-tracking branch 'origin/develop' into fb-ROOT-212

1cfe709

Merge branch 'develop' into 'fb-ROOT-212'

e0fad74

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18601468586

jombooth reviewed Oct 17, 2025

View reviewed changes

label_studio/io_storages/base_models.py Show resolved Hide resolved

jombooth approved these changes Oct 17, 2025

View reviewed changes

Merge branch 'develop' into 'fb-ROOT-212'

670ef19

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18655724511

Merge branch 'develop' into 'fb-ROOT-212'

cbff940

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18662394500

matt-bernstein and others added 3 commits October 21, 2025 12:29

Merge branch 'develop' into 'fb-ROOT-212'

229fc7a

Workflow run: https://github.com/HumanSignal/label-studio/actions/runs/18683828059

s3 query optimization

2374bd6

Merge remote-tracking branch 'origin/develop' into fb-ROOT-212

7622386

robot-ci-heartex merged commit 9e0cfba into develop Oct 21, 2025
53 checks passed

robot-ci-heartex deleted the fb-ROOT-212 branch October 21, 2025 22:48

		def n_tasks_linked(cls, key, storage):
		return cls.objects.filter(key=key, storage=storage.id).count()

fix: ROOT-212: Reduce ImportStorageLink COUNT calls during storage sync #8630

fix: ROOT-212: Reduce ImportStorageLink COUNT calls during storage sync #8630

Uh oh!

Conversation

matt-bernstein commented Oct 10, 2025

Uh oh!

netlify bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for label-studio-docs-new-theme canceled.

Uh oh!

netlify bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for label-studio-storybook ready!

Uh oh!

netlify bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for heartex-docs canceled.

Uh oh!

netlify bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for label-studio-playground ready!

Uh oh!

matt-bernstein commented Oct 10, 2025 • edited by robot-ci-heartex Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matt-bernstein commented Oct 10, 2025 • edited by robot-ci-heartex Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matt-bernstein commented Oct 14, 2025 • edited by robot-ci-heartex Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matt-bernstein commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matt-bernstein commented Oct 16, 2025 • edited by robot-ci-heartex Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matt-bernstein commented Oct 16, 2025

Uh oh!

matt-bernstein commented Oct 16, 2025

Uh oh!

matt-bernstein commented Oct 17, 2025 • edited by robot-ci-heartex Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

matt-bernstein commented Oct 20, 2025 • edited by robot-ci-heartex Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matt-bernstein commented Oct 20, 2025 • edited by robot-ci-heartex Loading Uh oh! There was an error while loading. Please reload this page.

netlify bot commented Oct 10, 2025 •

edited

Loading

netlify bot commented Oct 10, 2025 •

edited

Loading

netlify bot commented Oct 10, 2025 •

edited

Loading

netlify bot commented Oct 10, 2025 •

edited

Loading

matt-bernstein commented Oct 10, 2025 •

edited by robot-ci-heartex

Loading

matt-bernstein commented Oct 10, 2025 •

edited by robot-ci-heartex

Loading

codecov bot commented Oct 10, 2025 •

edited

Loading

matt-bernstein commented Oct 14, 2025 •

edited by robot-ci-heartex

Loading

matt-bernstein commented Oct 15, 2025 •

edited

Loading

matt-bernstein commented Oct 16, 2025 •

edited by robot-ci-heartex

Loading

matt-bernstein commented Oct 17, 2025 •

edited by robot-ci-heartex

Loading

matt-bernstein commented Oct 20, 2025 •

edited by robot-ci-heartex

Loading

matt-bernstein commented Oct 20, 2025 •

edited by robot-ci-heartex

Loading

matt-bernstein commented Oct 21, 2025 •

edited by robot-ci-heartex

Loading