-
Notifications
You must be signed in to change notification settings - Fork 459
Trivial SAM header sanitising. #509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Us being clever with CR-LF canonicalisation introduces interoperability concerns. |
|
So is SAM able to work with both nl and cr-nl? The spec is, naturally, vague regarding this. However traditionally samtools has changed cr-nl to nl. (Try samtools view on a cr-nl SAM file and you'll see them all change.) Maybe it's wrong to normalised to the UNIX format, but there isn't exactly much guidance out there on what is right or wrong here. What has been the philosophy so far? |
|
For SAM files, the spec says they are text files and is naturally agnostic about what text files might look like on your platform. In the htslib implementation, accepting either CR or CR-LF as newline for SAM files is a QoI convenience feature and is taken care of by the functions underneath What text looks like when it is a blob embedded in a binary format (like the header text in BAM or CRAM) is a separate matter. The specs don't say anything about this and really ought to. We are very concrete about the other characters in our binary formats, so IMHO we should be concrete about the newline-like things in the textish parts of our binary formats and say that they are single 0x0A LF characters. I would think that that's what all the implementations are outputting (even on Windows), but I have not checked. (I thought we'd talked about text blobs within binary files in the past but I guess it was this thread about bgzipped text files, which is somewhat different — link is to Bob Handsaker's useful point about line terminators in compressed text files.) |
|
Agreed and infact I now realise my implementation has a flaw when it comes to SAM on Windows. It'll "correct" the header but not the records, leaving a half-way house which is worse than either of the two conventions. That maybe means the cr-nl to nl translation, if applicable at all, should be on output to BAM rather than input to memory. My initial rationale was because of the half-baked header parsing API which doesn't appear to have any cr removal at all (lots of '\n' and '\t' in sam_hdr_parse but no '\r'. However it does, bizarrely, appear to work. I haven't dug down deep into it, but don't consider it worthwhile given one day we'll implement the mythical new header parsing API and make all that old code defunct anyway. So I'll drop that bit from the PR. (Edit: to be specific, I'm not intending on fixing the BAM writing code here either - it can be as ill defined as before and we'll figure out how to handle it in a separate issue/PR.) |
|
That's because the in-memory Given that you lot may implement a mythical new header parsing API one day, any code added here will go away. So the code in this PR is non-trivial (hence a little risky), changes behaviour (e.g., when the input has multiple NUL padding), and won't be long-lived anyway. We already over-allocate the buffer by 1. If we're going to accept the missing final newline that motivated a change here, I don't think we should do anything more than something like this (and similarly for CRAM): @@ -123,6 +123,7 @@ bam_hdr_t *bam_hdr_read(BGZF *fp)
char buf[4];
int magic_len, has_EOF;
int32_t i, name_len, num_names = 0;
+ uint32_t l_nul;
size_t bufsize;
ssize_t bytes;
// check EOF
@@ -145,13 +146,17 @@ bam_hdr_t *bam_hdr_read(BGZF *fp)
if (bytes != 4) goto read_err;
if (fp->is_be) ed_swap_4p(&h->l_text);
- bufsize = ((size_t) h->l_text) + 1;
- if (bufsize < h->l_text) goto nomem; // so large that adding 1 overflowed
+ bufsize = ((size_t) h->l_text) + 2;
+ if (bufsize < h->l_text) goto nomem; // so large that adding 2 overflowed
h->text = (char*)malloc(bufsize);
if (!h->text) goto nomem;
- h->text[h->l_text] = 0; // make sure it is NULL terminated
+ h->text[h->l_text] = h->text[h->l_text+1] = '\0'; // make sure it is NUL-terminated
bytes = bgzf_read(fp, h->text, h->l_text);
if (bytes != h->l_text) goto read_err;
+ // make sure it has a final newline (if it's non-empty)
+ l_nul = h->l_text;
+ while (l_nul > 0 && h->text[l_nul-1] == '\0') l_nul--;
+ if (l_nul > 0 && h->text[l_nul-1] != '\n') { h->text[l_nul] = '\n'; h->l_text++; }
bytes = bgzf_read(fp, &h->n_targets, 4);
if (bytes != 4) goto read_err; |
Although the htslib code checks things like headers starting with @, nul termination, etc, not all BAM implementations are as friendly. This code sanitises on input, also fixing samtools/samtools#661 in the process.
Remove some variables that are no longer needed. Make it count lines. Ensure adding '\n' doesn't overflow h->l_text. Simplify logic for adding the '\n' a bit.
Added 'sanitise headers' samtools#509
* NEWS update. (Ideally merge on day on release, editing date if required.) * Update NEWS Added 'sanitise headers' #509
Although the htslib code checks things like headers starting with @,
nl vs cr-nl, nul termination, etc, not all BAM implementations are as
friendly. The specification doesn't state for example that we must
have nl and not cr-nl for the header in BAM.
This code sanitises on input, also fixing samtools/samtools#661 in the
process.