Skip to content

Add support for non-default unnamed storages #1175

@vdusek

Description

@vdusek

Problem

In local development, each run reuses the same default storages, which are wiped clean at the beginning. This behavior is intentional and should be preserved.

On the Apify platform, however, every Actor run automatically receives a new set of unnamed storages. Since these storages are unnamed, they are subject to the default 30-day retention policy.

The Apify Python client already exposes the ability to create new unnamed storages:

await DatasetCollectionClientAsync.get_or_create()

Each call creates a new unnamed dataset with a unique ID.

Crawlee (and the underlying SDK), by contrast, does not currently support this. For example, repeated calls to:

await Dataset.open()

always return the same default unnamed storage.

Motivation

In more complex scenarios (e.g., WCC), an Actor may require multiple storage instances within a single run (for example, multiple request queues in WCC). To support this, we must allow creating additional unnamed storages beyond the single default instance - non-default unnamed storages (NDU). This provides flexibility for advanced workflows and better alignment with the platform capabilities.

Goal state

Bring Crawlee storages (all storage clients, including ApifyStorageClient) to feature parity with the Apify platform by supporting non-default unnamed storages.

Preserve the current behavior of local storage clients: wipe storages at the start of each run and reuse the same default storage across runs.

Possible solution 1) - new argument

Introduce a new argument to the storage open constructor:

async def open(
    cls,
    name: str | None = None,
    id: str | None = None,
    scope: Literal['run', 'global'] = 'global',
) -> Dataset | KeyValueStore | RequestQueue:
    ...
  • scope='run' indicates a non-default unnamed storage.
  • scope='global' refers to globally named storages.
  • The name parameter cannot be entirely removed for run scope storages, as it's needed for implementation:
    • For the filesystem storage: to use as a directory name.
    • For Apify platform storage: to store the mapping of name -> ID in the default key-value store.

Behavior matrix...

Open storage by ID and name

  • Raise an exception (choose one of these).
  • Scope argument is ignored.

Open storage by ID

  • Opens an existing storage by ID.
  • Scope?

Open storage by name

  • Scope run:
    • Opens or creates a run-scope (non-default unnamed) storage.
      • name is used internally for reference-storage purposes but is not the actual storage's "name".
  • Scope global:
    • Opens or creates a global named storage.

Open storage without args

  • Opens the default unnamed storage.
  • Scope argument is ignored.

Possible solution 2) - new constructor

Introduce an alternative constructor for non-default unnamed storages (e.g., open_by_alias or similar):

await Dataset.open_2(
    cls,
    alias: str,
)
  • Each call to open_new() creates a new unnamed storage with a unique ID (mirroring the Apify API client).
  • This also keeps the behavior of open() unchanged.

Another solution?

Discussion

  • Both options have trade-offs:
    • Adding parameters to open() results in confusing combinations or a complex signature.
    • Alternative constructors may overwhelm users with choices from the start, and would need to be implemented for every storage type, and all related convenient helpers in crawlers & contexts.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions