-
Notifications
You must be signed in to change notification settings - Fork 3.2k
fix: ROOT-212: Reduce ImportStorageLink COUNT calls during storage sync #8630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for label-studio-docs-new-theme canceled.
|
✅ Deploy Preview for label-studio-storybook ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for heartex-docs canceled.
|
✅ Deploy Preview for label-studio-playground ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
/git merge develop
|
|
/fmt |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8630 +/- ##
============================================
+ Coverage 67.04% 79.69% +12.64%
============================================
Files 792 238 -554
Lines 60444 21557 -38887
Branches 10291 0 -10291
============================================
- Hits 40527 17179 -23348
+ Misses 19914 4378 -15536
+ Partials 3 0 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| keys_for_existed_count.append(key) | ||
| if len(keys_for_existed_count) >= settings.STORAGE_EXISTED_COUNT_BATCH_SIZE: | ||
| tasks_existed += link_class.objects.filter( | ||
| key__in=keys_for_existed_count, storaged=self.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe key is not in the index, so this will become expensive checking for each object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is still true, each query is as expensive as it was before. But there are now STORAGE_EXISTED_COUNT_BATCH_SIZE = 1000x fewer queries. For storages being synced often so most keys in the storage already exist, this is the dominant cost of a sync operation. And this is the situation we are in now, with people running sync on a cronjob every 30 minutes forever (which you can check in datadog).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a (storage, key) index would be a big improvement too. But adding an index to an existing large table is something Jo has warned against doing in the past, as it leads to difficult migrations. I'm not confident that doing it is a good idea without his input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along these lines, the next thing I will try is batching the exists() query as well, but that's a larger refactor, so I wanted to get this diff out first.
|
/git merge develop
|
| if n_tasks_linked := link_class.n_tasks_linked(key, self): | ||
| logger.debug(f'{self.__class__.__name__} already has {n_tasks_linked} tasks linked to {key=}') | ||
| tasks_existed += n_tasks_linked # update progress counter | ||
| if link_class.exists(key, self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have I understood it correctly: before we did .count request for every key and now we do .exists request for every key and also 1 request per batch for count?
So we will have even more requests than before, but we expect that exists is much lightweight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But with the feature flag disabled we will have twice as many requests? (.exists() plus .count() for every key)
Maybe we can reimplement it in a safer way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, fixed it 👍
|
Results from silk on resyncing a storage with 2500 tasks: ff ON: ff OFF: let me make sure the test was valid, and try other numbers of tasks |
| def n_tasks_linked(cls, key, storage): | ||
| return cls.objects.filter(key=key, storage=storage.id).count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not keep around this class method as well, but change the signature to cls, key_set, storage? This could be used to make the new code a bit DRYer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's used in one place with key=, and another place with keys__in=, I didn't want to combine those, it seemed premature (even though keys__in=[key] is a valid use I felt it would increase rather than reduce confusion) - especially since we're going to be getting rid of one.
|
|
||
| tasks_for_webhook = [] | ||
| keys_for_existed_count = [] | ||
| for key in self.iter_keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another optimization idea: could we implement a batch_iter_keys and then process multiple keys (at least for checking if already synced) in a single query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to reduce implementation overhead, this might not be necessary for every storage - we could just do it on the storage we're struggling with and have .batch_iter_keys raise NotImplemented on other storages for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do that too, yep, but playing around with this idea in case there's still an easy win from better query planning, that'd be less risk
|
/fmt |
|
Silk results very similar to last time after adding batching |
|
As a note for the future, loading the storages page is quite slow due to the following GETs: |
|
/git merge develop
|
|
/git merge develop
|
|
/git merge develop
|
|
/git merge develop
|
see updated test results in LSE PR
batch tasks_existed count during storage sync