Skip to content

Conversation

bevzzz
Copy link
Collaborator

@bevzzz bevzzz commented Jun 27, 2025

This PR adds Pagination API for sync and async clients. Owing to a different asynchronous programming model that we have in Java, the APIs are not identical.

Sync

Synchronous pagination is instantly familiar. CursorSpliterator powers 2 patterns for iterating over objects:

  • the default Paginator object returned by collection.paginate() implements Iterable that can be used in a traditional for-loop
  • stream() presents the internal Spliterator via an idiomatic Stream API
var things = client.collections.use("Things");
var allThings = things.paginate();

for (WeaviateObject object : allThings) {
    // Traditional for-loop
}

// Stream API
var listOfThings = allThings.stream().map(/* some transformations */).toList();

Async

We cannot do the exact same thing in asynchronous part of the client; Streams and Iterator APIs are inherently synchronous and resist being bent into asynchronicity.

Our API ends up looking only slightly differently:

var things = client.collections.use("Things");

things.paginate().forEach(thing -> System.out.println(thing));

In case other components of the application are also doing async-batching type of work, it is possible to process complete pages at once too:

things.paginate().forPage(manyThings -> otherApi.sendAsync(manyThings));

AsyncPaginator supports prefetching, so that work can begin before the main thread starts to process results:

// The first request is sent immediately;
var allThings = things.paginate(p -> p.prefetch(true));

// We can do something else right away
animals.query.fetchObjects(...);

// Start processing Things
allThings.forEach(thing -> System.out.print(thing));

// Do something else:
things.data.insert(...);

// Block until pagination completes
allThings.get();

For example, it is possible to write an async for loop in Python, where each await cedes control to the event loop. Java's concurrency is built on threads and we end up working at a slightly lower level.

E.g.: asynchronously iterating over a loop requires manually scheduling recursive callbacks. Some pseudo-code to illustrate:

iter = db.iterator()
iter.nextPage().then(processAll)

fn processAll(page) {
  if page is empty { return }
  
  for obj in page { ... }
  page.nextPage().then(processAll)

The example snippet above demonstrates the helper forEach which reduces the boilerplate necessary.

Passing query options

// Create a paginator
var allThings = things.paginate(
  p -> p
    .pageSize(10)
    .resumeFrom("uuid-3")
    .returnProperties("name", "height")
    .returnMetadata(Metadata.VECTOR));

// Process data (sync: stream or for-loop, async: forEach / forPage)
allThings.stream().toList();

I purposefully omitted passing arguments to .stream() method, because normally this method does not take any parameters.

CursorSpliterator powers 2 patterns for iterating over objects:
- list() returns an Iterable that can be used in a for-loop
- stream() presents the internal Iterator via a familiar Stream API
@bevzzz bevzzz self-assigned this Jun 27, 2025
Copy link

@orca-security-eu orca-security-eu bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca

bevzzz added 6 commits June 30, 2025 11:10
This lets us keep all configurations 'on the left' of the operator
and all operations on the right
ObjectBuilder.partial is a new util for composing ObjectBuilder-style functions.
E.g.: Boolean properties cannot be compared using gt(-e) and lt(-e)
operators. Also LIKE operator only makes sense to text properties
and should accept a single pattern.
Does not support Stream or Spliterator / Iterator APIs,
because those are inherently synchronous.
@bevzzz bevzzz changed the title v6: Sync Iterator v6: Pagination Jun 30, 2025
@bevzzz bevzzz marked this pull request as ready for review June 30, 2025 16:00
bevzzz added 2 commits July 4, 2025 14:24
- this.cursor should be a final field since AsyncResult will only
  ever work with a single page
- change currentPage access to package-private to avoid confusion

See: #399 (comment)
@bevzzz bevzzz merged commit 085929a into v6 Jul 7, 2025
2 checks passed
@bevzzz bevzzz deleted the v6-sync-iterator branch July 7, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants