-
Notifications
You must be signed in to change notification settings - Fork 29
DuckDBRowReader: Add class to stream DuckDB results efficiently #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jraymakers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I definitely support adding something like this to the API.
Another contributor submitted this related PR some time ago, but I haven't yet found the time to integrate it properly into the API by making it more general.
This PR comes closer, but I think we can do even better by combining the two approaches. I'm wondering if we can implement the async iterator protocol directly on DuckDBResult, as done in the above PR, but have it just yield chunks, instead of performing a particular conversion and yielding rows (as the above PR does).
Then, we can use this iterator to add other convenience methods to DuckDBResult that perform the various conversions and also return results row-by-row in addition to chunk-by-chunk. I think that would alleviate the need for a separate DuckDBRowReader class or a new method on DuckDBConnection.
Below are a few detailed comments on this PR, which may or may not be relevant if we adopt the alternate approach described above.
|
Here are some examples of the API I'm envisioning. Yield chunks: const result = await connection.stream(sql);
for await (const chunk of result) {
// ...
}Yield rows (: const result = await connection.stream(sql);
for await (const row of result.yieldRows()) {
// ...
}Yield row objects (: const result = await connection.stream(sql);
for await (const rowObject of result.yieldRowObjects()) {
// ...
}With (general) conversion (: const result = await connection.stream(sql);
for await (const row of result.yieldRows(converter)) {
// ...
}With specific conversion (: const result = await connection.stream(sql);
for await (const row of result.yieldRowsJS()) {
// ...
}Plus other combinations for converting row objects, returning JSON, etc. |
|
I've removed the method added to I think I got all of them and the other combinations that you mentioned, but let me know if I missed one and I'll be happy to add it (or anything else you would like me to add)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely like where this is headed; it's looking very clean. Just a few comments about the details and the set of methods.
|
Also: Once we have all the methods in place, it would be good to add some unit tests for each to (I apologize for the single large unit test file. I've been meaning to split it up, but haven't gotten to it yet.) |
There currently exists the `DuckDBResultReader`, but the issue that we're running into is that this processes chunks into an internal buffer that's never cleared. For large result sets, this buffer builds and builds until we run out of memory. The case that we have is that we just need the result set row by row without needing to compute any of the column data. This new class does less than `DuckDBResultReader` by not storing the results of the query in the underlying buffer. Instead, the results of the chunk are passed back to the caller where the caller can handle the chunk as they wish. The interface for processing rows is similar to `DuckDBRowReader` where the caller can still use converters as they wish to process their data.
Adds an asyncIterator to the DuckDBResult class to return the raw DuckDBDataChunks. Helper methods to perform common operations such as returning row objects or returning the rows as JSON have been added here so that the caller can iterate over the results in a way that's convenient for them.
`yieldConvertedRows` has been refactored to use `convertRowsFromChunks`. `yeidlConvertedRowObjects` now uses `convertRowObjectsFromChunks` as the previous version of `yieldConvertedRows` did.
Converts the remaining declarations of `iterator` to just use `this` directly.
These were a bit backwards. `yieldRowsJs` and `yieldRowsJson` now call `yieldConvertedRows`. The object versions of these methods have been added here as `yieldRowObjectJs` and `yieldRowObjectJson` which both call `yieldConvertedRowObjects`.
The first test added is to ensure that the caller can iterate over the DuckDBResult object to get the data chunks. The last 4 tests added here test the `yieldConvertedRows` and `yieldConvertedRowObjects` along with the `yieldRows*` and `yieldRowObject*` methods since the latter methods call the former methods.
|
One thing to note that I forgot to add to my commit message is that I had to remove the I have added tests here, and they all passed on my branch when I ran them on my local machine; however there is now a conflict since my branch was a bit stale. On my machine, I have rebased my branch onto main and resolved the conflict, but the tests don't pass. Even with a fresh pull of the repo from main has many of the tests failing. If you just want me to push my version with the merge conflict resolved, I can. But I wasn't sure if I wanted to push it since it was failing anyways. EDIT: After writing that last message, I figured it was probably easier on you just to push the rebased version without the conflict; however I did want to note that the tests now fail on my machine since the rebase. But while I was developing this on a week old stale branch and creating the tests, they passed. |
1c588c5 to
6e41488
Compare
|
It seems the tests pass in this PR. Perhaps something is stale in your local environment? Have you tried |
jraymakers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a couple comments about some additional test cases.
|
BTW I was able to pull your fork and run both the build and tests just fine, both on |
|
Your suggestion to run I added three new tests. One that generates more than one chunk, a test for |
There currently exists the
DuckDBResultReader, but the issue that we're running into is that this processes chunks into an internal buffer that's never cleared. For large result sets, this buffer builds and builds until we run out of memory.The case that we have is that we just need the result set row by row without needing to compute any of the column data. This new class does less than
DuckDBResultReaderby not storing the results of the query in the underlying buffer. Instead, the results of the chunk are passed back to the caller where the caller can handle the chunks as they wish.The interface for processing rows is similar to
DuckDBRowReaderwhere the caller can still use converters as they wish to process their data.From a
DuckDBConnection, callingrowReaderStreamwill return aDuckDBRowReaderwhere the caller can handle each individually processed chunk without having it written to an internal buffer for more memory efficient operations.I didn't see any documentation on contributing, but if this is a feature you're interested in supporting, I'd be happy to change the formatting, add tests, change the API of
DuckDBRowReader, etc. since this is an issue that we're facing at my job.