-
Notifications
You must be signed in to change notification settings - Fork 519
Description
Problem
In local development, each run reuses the same default storages, which are wiped clean at the beginning. This behavior is intentional and should be preserved.
On the Apify platform, however, every Actor run automatically receives a new set of unnamed storages. Since these storages are unnamed, they are subject to the default 30-day retention policy.
The Apify Python client already exposes the ability to create new unnamed storages:
await DatasetCollectionClientAsync.get_or_create()Each call creates a new unnamed dataset with a unique ID.
Crawlee (and the underlying SDK), by contrast, does not currently support this. For example, repeated calls to:
await Dataset.open()always return the same default unnamed storage.
Motivation
In more complex scenarios (e.g., WCC), an Actor may require multiple storage instances within a single run (for example, multiple request queues in WCC). To support this, we must allow creating additional unnamed storages beyond the single default instance - non-default unnamed storages (NDU). This provides flexibility for advanced workflows and better alignment with the platform capabilities.
Goal state
Bring Crawlee storages (all storage clients, including ApifyStorageClient) to feature parity with the Apify platform by supporting non-default unnamed storages.
Preserve the current behavior of local storage clients: wipe storages at the start of each run and reuse the same default storage across runs.
Possible solution 1) - new argument
Introduce a new argument to the storage open constructor:
async def open(
cls,
name: str | None = None,
id: str | None = None,
scope: Literal['run', 'global'] = 'global',
) -> Dataset | KeyValueStore | RequestQueue:
...scope='run'indicates a non-default unnamed storage.scope='global'refers to globally named storages.- The
nameparameter cannot be entirely removed for run scope storages, as it's needed for implementation:- For the filesystem storage: to use as a directory name.
- For Apify platform storage: to store the mapping of name -> ID in the default key-value store.
Behavior matrix...
Open storage by ID and name
- Raise an exception (choose one of these).
- Scope argument is ignored.
Open storage by ID
- Opens an existing storage by ID.
- Scope?
Open storage by name
- Scope run:
- Opens or creates a run-scope (non-default unnamed) storage.
nameis used internally for reference-storage purposes but is not the actual storage's "name".
- Opens or creates a run-scope (non-default unnamed) storage.
- Scope global:
- Opens or creates a global named storage.
Open storage without args
- Opens the default unnamed storage.
- Scope argument is ignored.
Possible solution 2) - new constructor
Introduce an alternative constructor for non-default unnamed storages (e.g., open_by_alias or similar):
await Dataset.open_2(
cls,
alias: str,
)- Each call to
open_new()creates a new unnamed storage with a unique ID (mirroring the Apify API client). - This also keeps the behavior of
open()unchanged.
Another solution?
Discussion
- Both options have trade-offs:
- Adding parameters to
open()results in confusing combinations or a complex signature. - Alternative constructors may overwhelm users with choices from the start, and would need to be implemented for every storage type, and all related convenient helpers in crawlers & contexts.
- Adding parameters to