-
Notifications
You must be signed in to change notification settings - Fork 935
Bulk imports with COPY FROM STDIN #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is a bizarre and very cute trick. Let me see if I have the contour right: You use |
|
@fdr Yes, right. It was the only sane way that I found to get the I'm using it for OpenStreetMap imports into Postgres/PostGIS and it works pretty well (https://github.com/omniscale/imposm3). It does 30k/s inserts while the |
conn.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean with "asynchronous" here. Are you referring to the usage of an internal buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because of the buffering and also because it reads the responses in a separate goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit misleading and could use some rewording, but I'd like to hear other peoples' inputs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought about this for a bit more, I think the behavior actually fits all the criteria to call at least the error handling part of the feature "asynchronous". I think the documentation could be a bit more clear on what exactly that means in this case. My confusion arose from the use of "and"; my brain didn't connect the comment about Exec with the comment about asynchronicity. But that might just be me :-)
|
Thanks for working on this, I think it's looking pretty good in general. The only thing I'm worried about is the ugly interface database/sql is forcing us into. I see three options here:
I don't consider 1) to be unreasonable, but we'd have to support this interface for quite some time, which seems like an unwanted side effect. Thoughts? |
|
I think I addressed your comments. I hope the documentation is now more clear, I'm not a native speaker though. To the interface: The benefit from having it integrated in the database/sql API is that you can use the connection and transaction for regular SQL commands as well. The only difference between INSERT and COPY FROM is the prepared SQL statement. Creating a new transaction, creating or truncating a table and preparing the statement is identical. See here where I could change I think the only wart is the empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought about COPY TO (a.k.a. "copy out")? As-is this seems to foreclose doing something about that. Is it credible to use the protocol state returned from the server to determine what direction the COPY is in? If that is done, could statement-sniffing be avoided altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's a different protocol message for both COPY FROM and COPY TO, and using that to realize we're doing a COPY would probably be an improvement over the current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that vein, to prove it can be done, I'd like to see the other COPY (out) direction die in a lucid manner. Maybe with a test, if it's not herculean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now returns "COPY TO is not supported": olt@0bdf7d3
Before that the error was "unknown response for copy query: 'H'"
I can check if I can integrate it into prepareTo and then use the server response to decide if it's a COPY FROM. Not sure if this will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a nice thing to confirm, even if COPY ... TO STDOUT is but a stub. What would the interface look like anyway? I could only imagine .Exec() (of no arguments) would yield an opaque []byte of the copy-out data record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-app pooling like database/sql. Now pooling is obviously not a concern for us here, but I don't think we should assume that the client gets to work with our implementations of datbase/sql/driver interfaces directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not, and would it help if the type assertion was to an interface, not a concrete type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now pooling is obviously not a concern for us here, but I don't think we should assume that the client gets to work with our implementations of datbase/sql/driver interfaces directly.
Why not, and would it help if the type assertion was to an interface, not a concrete type?
Because the implementation in database/sql can, as far as I can tell, interpose its own types wherever interfaces are returned and delegate to the concrete types of a given database/sql/driver implementation. So no, interfaces won't help.
I'm not saying this won't work, but I don't believe it's guaranteed to always work in the future, since it's depending on specific implementation semantics.of the database/sql framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having access to our driver.Result interface would allow an API where lib/pq would abstract all the nasty stuff (casting of Result, etc.). But the database/sql API is clear that it returns sql.Result and not driver.Result, so there is the danger that it will change in future Go releases.
Given the restrictions here, could we try to bullet-proof that by doing something like
db.Prepare(pq.CopyIn("table_name", "col1", "col2"))
I like that. How should we handle quoting of table and column names? We could quote everything when building the COPY statement, but how do we handle Postgres schemas then? Schema names and table names need to be quoted separately ("schema"."table").
Just quote everything and permit hack like:
db.Prepare(pq.CopyIn("schema\".\"table_name", "col1", "col2"))
Or only quote if a string is not quoted already?
db.Prepare(pq.CopyIn("\"schema\".\"table_name\"", "col1", "col2"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. Hadn't thought of schemas. Given that they're somewhat less common, maybe just do a separate function for that? E.g., pq.CopyIn and pq.CopyInSchema. This is a little gross, but the other options seem kinda sketchy (i.e., we should not assume any special meanings for any characters in the identifiers).
|
Could you please rebase when you have a chance? Just got back from vacation and merged a few unrelated fixes. As this is biggish, it'd be great to discuss it as a clean patch. Thanks! For what it's worth, I actually kind of like @fdr's idea of using an io.Reader to provide data, but I think that can build on this as a later patch. I do think this only handles some of the use cases of COPY, but I believe there's value to this, and it's a clever approach. Thanks for the patch. |
|
I've added CopyIn and CopyInSchema and updated the docs. I've also created a ticket for the 'sql.Stmt.Close does not return errors from driver.Stmt.Close' issue to get an official statement: http://code.google.com/p/go/issues/detail?id=6734 Maybe we are lucky and it gets fixed in the future. |
|
So I'm reasonably happy with this at this point. It's a hack, but it exposes a handy feature of Postgres, and the |
|
I'm already using this myself, and its great, I did however find I needed to change one thing shown here... https://github.com/treetopllc/libpq/commit/ea2b0f61dd30b5d80a2a141eedaa7969c9f2d29b I'm not entirely sure what the response is from or caused by, but found that if I didn't catch that case bad stuff happened, if I match it and ignore it the copy succeeds just fine |
Thanks for testing!
Callers or recv1 don't need to ignore 'N' as of 7454a98, so rebasing against master should fix this issue. |
- db.Prepare("COPY foo (a, b) FROM STDIN") returns a COPY IN statement
- stmt.Exec(a, b) inserts new items
- stmt.Exec() syncs the input stream, since sql.Stmt.Close() does not
return errors from stmt.Close()
it's unlikely that we get errors from data we just flushed
|
I just did a rebase on the current master. |
copy.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recv1() already ignores these, so you shouldn't try to handle them here.
|
@johto I removed the check. |
|
Rebased, squashed, and merged. Thanks! |
|
@deafbybeheading: Looks like this patch broke the tests against 8.4. This seems to be the cause: Passing parameterStatus and calling encode() instead should fix this, I think. |
|
This is in ci.buffer = appendEncodedText(ci.conn.parameterStatus.serverVersion, ci.buffer, value)but I don't see a simpler fix. |
|
Actually, I guess we're passing the whole parameterStatus struct, so it's not quite as bad. |
Naively thinking I would expect both COPY and the "normal" interface to use encode(), but I have to admit I haven't looked at the code to see if that's actually possible. |
|
Yeah, or that--I think the idea is to append the text to an existing buffer and avoid copying where possible. It's nice for performance but not so nice for the code. |
|
Ah. Can't we change the interface for encode() to do that instead? I think with COPY performance is often very important. |
|
Yeah, I think we can fix this one way or another. Unfortunately, I ran out of time for looking at this right now. I got this: msakrejda/pq@lib:master...broken-fix-8.4-copy but it fails some tests so my approach may be wrong. I can revisit tonight. |
|
@deafbybeheading: Thanks, I might take a look a bit later. |
|
Fixed. The |
|
I think on 8.4 it needs to go through the entire appendEscapedText() treatment. That'll be a bit more costly performance-wise, though. Or one could try and add a boolean to escapeBytea() to properly escape the special characters for COPY. I'm hoping we can use binary mode in the future for the normal code path, though, so I'm not sure if spending much work on escapeBytea() right now is justified. Especially only to to make COPY faster on 8.4, both of which were unsupported until a week ago. |
|
Oh hmm, yeah. Hopefully this will do it. |
See #74