Skip to content

Conversation

@rstefanic
Copy link
Contributor

@rstefanic rstefanic commented Sep 26, 2025

There currently exists the DuckDBResultReader, but the issue that we're running into is that this processes chunks into an internal buffer that's never cleared. For large result sets, this buffer builds and builds until we run out of memory.

The case that we have is that we just need the result set row by row without needing to compute any of the column data. This new class does less than DuckDBResultReader by not storing the results of the query in the underlying buffer. Instead, the results of the chunk are passed back to the caller where the caller can handle the chunks as they wish.

The interface for processing rows is similar to DuckDBRowReader where the caller can still use converters as they wish to process their data.

From a DuckDBConnection, calling rowReaderStream will return a DuckDBRowReader where the caller can handle each individually processed chunk without having it written to an internal buffer for more memory efficient operations.

const rowReader = await connection.rowReaderStream(sql);

for await (const rows of rowReader.getRowObjects()) {

  // `rows` is a Record<string, DuckDBValue>[] from the last chunk that was processed.
  // Here, we can console.log each row without keeping the entire result set in memory.
  // We process the chunk of `rows` which is only available in this scope.

  for (const row or rows) {
    console.log(row);
  }
}

I didn't see any documentation on contributing, but if this is a feature you're interested in supporting, I'd be happy to change the formatting, add tests, change the API of DuckDBRowReader, etc. since this is an issue that we're facing at my job.

Copy link
Contributor

@jraymakers jraymakers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I definitely support adding something like this to the API.

Another contributor submitted this related PR some time ago, but I haven't yet found the time to integrate it properly into the API by making it more general.

This PR comes closer, but I think we can do even better by combining the two approaches. I'm wondering if we can implement the async iterator protocol directly on DuckDBResult, as done in the above PR, but have it just yield chunks, instead of performing a particular conversion and yielding rows (as the above PR does).

Then, we can use this iterator to add other convenience methods to DuckDBResult that perform the various conversions and also return results row-by-row in addition to chunk-by-chunk. I think that would alleviate the need for a separate DuckDBRowReader class or a new method on DuckDBConnection.

Below are a few detailed comments on this PR, which may or may not be relevant if we adopt the alternate approach described above.

@jraymakers
Copy link
Contributor

Here are some examples of the API I'm envisioning.

Yield chunks:

const result = await connection.stream(sql);
for await (const chunk of result) {
  // ...
}

Yield rows (:

const result = await connection.stream(sql);
for await (const row of result.yieldRows()) {
  // ...
}

Yield row objects (:

const result = await connection.stream(sql);
for await (const rowObject of result.yieldRowObjects()) {
  // ...
}

With (general) conversion (:

const result = await connection.stream(sql);
for await (const row of result.yieldRows(converter)) {
  // ...
}

With specific conversion (:

const result = await connection.stream(sql);
for await (const row of result.yieldRowsJS()) {
  // ...
}

Plus other combinations for converting row objects, returning JSON, etc.

@rstefanic
Copy link
Contributor Author

I've removed the method added to DuckDBConnection and deleted DuckDBRowReader. An async iterator has been added to DuckDBResult along with helper methods/wrappers that you wrote out. The only one I couldn't follow is yieldRows with two version (your example written above overloaded that method). So I named the one that uses a converter yieldConvertedRows.

I think I got all of them and the other combinations that you mentioned, but let me know if I missed one and I'll be happy to add it (or anything else you would like me to add)!

Copy link
Contributor

@jraymakers jraymakers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely like where this is headed; it's looking very clean. Just a few comments about the details and the set of methods.

@jraymakers
Copy link
Contributor

Also: Once we have all the methods in place, it would be good to add some unit tests for each to api.test.ts.

(I apologize for the single large unit test file. I've been meaning to split it up, but haven't gotten to it yet.)

There currently exists the `DuckDBResultReader`, but the issue that
we're running into is that this processes chunks into an internal buffer
that's never cleared. For large result sets, this buffer builds and
builds until we run out of memory.

The case that we have is that we just need the result set row by row
without needing to compute any of the column data. This new class does
less than `DuckDBResultReader` by not storing the results of the query
in the underlying buffer. Instead, the results of the chunk are passed
back to the caller where the caller can handle the chunk as they wish.

The interface for processing rows is similar to `DuckDBRowReader` where
the caller can still use converters as they wish to process their data.
Adds an asyncIterator to the DuckDBResult class to return the raw
DuckDBDataChunks. Helper methods to perform common operations such as
returning row objects or returning the rows as JSON have been added here
so that the caller can iterate over the results in a way that's
convenient for them.
`yieldConvertedRows` has been refactored to use `convertRowsFromChunks`.
`yeidlConvertedRowObjects` now uses `convertRowObjectsFromChunks` as the
previous version of `yieldConvertedRows` did.
Converts the remaining declarations of `iterator` to just use `this`
directly.
These were a bit backwards. `yieldRowsJs` and `yieldRowsJson` now call
`yieldConvertedRows`. The object versions of these methods have been
added here as `yieldRowObjectJs` and `yieldRowObjectJson` which both
call `yieldConvertedRowObjects`.
The first test added is to ensure that the caller can iterate over the
DuckDBResult object to get the data chunks.

The last 4 tests added here test the `yieldConvertedRows` and
`yieldConvertedRowObjects` along with the `yieldRows*` and
`yieldRowObject*` methods since the latter methods call the former
methods.
@rstefanic
Copy link
Contributor Author

rstefanic commented Sep 28, 2025

One thing to note that I forgot to add to my commit message is that I had to remove the async call before the JS/Json converter functions. It was returning a promise which was unnecessary and confusing.

I have added tests here, and they all passed on my branch when I ran them on my local machine; however there is now a conflict since my branch was a bit stale. On my machine, I have rebased my branch onto main and resolved the conflict, but the tests don't pass.

Even with a fresh pull of the repo from main has many of the tests failing. If you just want me to push my version with the merge conflict resolved, I can. But I wasn't sure if I wanted to push it since it was failing anyways.

EDIT: After writing that last message, I figured it was probably easier on you just to push the rebased version without the conflict; however I did want to note that the tests now fail on my machine since the rebase. But while I was developing this on a week old stale branch and creating the tests, they passed.

@jraymakers
Copy link
Contributor

It seems the tests pass in this PR. Perhaps something is stale in your local environment? Have you tried pnpm run clean in both bindings and api?

Copy link
Contributor

@jraymakers jraymakers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a couple comments about some additional test cases.

@jraymakers
Copy link
Contributor

BTW I was able to pull your fork and run both the build and tests just fine, both on main and your branch.

@rstefanic
Copy link
Contributor Author

Your suggestion to run pnpm run clean and rebuild the bindings is what helped. I built those over a week ago which explains why my tests were failing. Thanks!

I added three new tests. One that generates more than one chunk, a test for yieldRows, and a test for `yieldRowObjects.

@jraymakers jraymakers merged commit be6f528 into duckdb:main Sep 28, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants