Skip to content

Conversation

@jkbonfield
Copy link
Contributor

Although the htslib code checks things like headers starting with @,
nl vs cr-nl, nul termination, etc, not all BAM implementations are as
friendly. The specification doesn't state for example that we must
have nl and not cr-nl for the header in BAM.

This code sanitises on input, also fixing samtools/samtools#661 in the
process.

@jmarshall
Copy link
Member

Us being clever with CR-LF canonicalisation introduces interoperability concerns.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Apr 11, 2017

So is SAM able to work with both nl and cr-nl? The spec is, naturally, vague regarding this. However traditionally samtools has changed cr-nl to nl. (Try samtools view on a cr-nl SAM file and you'll see them all change.)

Maybe it's wrong to normalised to the UNIX format, but there isn't exactly much guidance out there on what is right or wrong here. What has been the philosophy so far?

@jmarshall
Copy link
Member

For SAM files, the spec says they are text files and is naturally agnostic about what text files might look like on your platform. In the htslib implementation, accepting either CR or CR-LF as newline for SAM files is a QoI convenience feature and is taken care of by the functions underneath hts_getline(). Not preserving non-plain-LF newlines in output was touched on here (but kseq.h is no longer involved).

What text looks like when it is a blob embedded in a binary format (like the header text in BAM or CRAM) is a separate matter. The specs don't say anything about this and really ought to. We are very concrete about the other characters in our binary formats, so IMHO we should be concrete about the newline-like things in the textish parts of our binary formats and say that they are single 0x0A LF characters. I would think that that's what all the implementations are outputting (even on Windows), but I have not checked.

(I thought we'd talked about text blobs within binary files in the past but I guess it was this thread about bgzipped text files, which is somewhat different — link is to Bob Handsaker's useful point about line terminators in compressed text files.)

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Apr 11, 2017

Agreed and infact I now realise my implementation has a flaw when it comes to SAM on Windows. It'll "correct" the header but not the records, leaving a half-way house which is worse than either of the two conventions. That maybe means the cr-nl to nl translation, if applicable at all, should be on output to BAM rather than input to memory.

My initial rationale was because of the half-baked header parsing API which doesn't appear to have any cr removal at all (lots of '\n' and '\t' in sam_hdr_parse but no '\r'. However it does, bizarrely, appear to work. I haven't dug down deep into it, but don't consider it worthwhile given one day we'll implement the mythical new header parsing API and make all that old code defunct anyway.

So I'll drop that bit from the PR. (Edit: to be specific, I'm not intending on fixing the BAM writing code here either - it can be as ill defined as before and we'll figure out how to handle it in a separate issue/PR.)

@jmarshall
Copy link
Member

That's because the in-memory h->text is already canonically '\n' only: for SAM, due to hts_getline() as I said; for BAM/CRAM, by assuming it's '\n'-only on disk (testing other implementations and/or spec clarifications are desirable here). So your half-way house fears are unfounded. Perhaps you should dig down into these things before hacking on the code 😄.

Given that you lot may implement a mythical new header parsing API one day, any code added here will go away. So the code in this PR is non-trivial (hence a little risky), changes behaviour (e.g., when the input has multiple NUL padding), and won't be long-lived anyway.

We already over-allocate the buffer by 1. If we're going to accept the missing final newline that motivated a change here, I don't think we should do anything more than something like this (and similarly for CRAM):

@@ -123,6 +123,7 @@ bam_hdr_t *bam_hdr_read(BGZF *fp)
     char buf[4];
     int magic_len, has_EOF;
     int32_t i, name_len, num_names = 0;
+    uint32_t l_nul;
     size_t bufsize;
     ssize_t bytes;
     // check EOF
@@ -145,13 +146,17 @@ bam_hdr_t *bam_hdr_read(BGZF *fp)
     if (bytes != 4) goto read_err;
     if (fp->is_be) ed_swap_4p(&h->l_text);
 
-    bufsize = ((size_t) h->l_text) + 1;
-    if (bufsize < h->l_text) goto nomem; // so large that adding 1 overflowed
+    bufsize = ((size_t) h->l_text) + 2;
+    if (bufsize < h->l_text) goto nomem; // so large that adding 2 overflowed
     h->text = (char*)malloc(bufsize);
     if (!h->text) goto nomem;
-    h->text[h->l_text] = 0; // make sure it is NULL terminated
+    h->text[h->l_text] = h->text[h->l_text+1] = '\0'; // make sure it is NUL-terminated
     bytes = bgzf_read(fp, h->text, h->l_text);
     if (bytes != h->l_text) goto read_err;
+    // make sure it has a final newline (if it's non-empty)
+    l_nul = h->l_text;
+    while (l_nul > 0 && h->text[l_nul-1] == '\0') l_nul--;
+    if (l_nul > 0 && h->text[l_nul-1] != '\n') { h->text[l_nul] = '\n'; h->l_text++; }
 
     bytes = bgzf_read(fp, &h->n_targets, 4);
     if (bytes != 4) goto read_err;

jkbonfield and others added 2 commits May 7, 2017 23:25
Although the htslib code checks things like headers starting with @,
nul termination, etc, not all BAM implementations are as friendly.

This code sanitises on input, also fixing samtools/samtools#661 in the
process.
Remove some variables that are no longer needed.
Make it count lines.
Ensure adding '\n' doesn't overflow h->l_text.
Simplify logic for adding the '\n' a bit.
@daviesrob daviesrob merged commit 0639ea1 into samtools:develop May 8, 2017
jenniferliddle added a commit to jkbonfield/htslib that referenced this pull request May 8, 2017
Added 'sanitise headers' samtools#509
jenniferliddle pushed a commit that referenced this pull request May 8, 2017
* NEWS update.

(Ideally merge on day on release, editing date if required.)

* Update NEWS

Added 'sanitise headers' #509
@jkbonfield jkbonfield deleted the SAM_hdr_nl branch June 5, 2019 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants