Trivial SAM header sanitising. #509

jkbonfield · 2017-04-10T16:27:49Z

Although the htslib code checks things like headers starting with @,
nl vs cr-nl, nul termination, etc, not all BAM implementations are as
friendly. The specification doesn't state for example that we must
have nl and not cr-nl for the header in BAM.

This code sanitises on input, also fixing samtools/samtools#661 in the
process.

jmarshall · 2017-04-11T06:56:08Z

Us being clever with CR-LF canonicalisation introduces interoperability concerns.

jkbonfield · 2017-04-11T08:09:26Z

So is SAM able to work with both nl and cr-nl? The spec is, naturally, vague regarding this. However traditionally samtools has changed cr-nl to nl. (Try samtools view on a cr-nl SAM file and you'll see them all change.)

Maybe it's wrong to normalised to the UNIX format, but there isn't exactly much guidance out there on what is right or wrong here. What has been the philosophy so far?

jmarshall · 2017-04-11T10:17:41Z

For SAM files, the spec says they are text files and is naturally agnostic about what text files might look like on your platform. In the htslib implementation, accepting either CR or CR-LF as newline for SAM files is a QoI convenience feature and is taken care of by the functions underneath hts_getline(). Not preserving non-plain-LF newlines in output was touched on here (but kseq.h is no longer involved).

What text looks like when it is a blob embedded in a binary format (like the header text in BAM or CRAM) is a separate matter. The specs don't say anything about this and really ought to. We are very concrete about the other characters in our binary formats, so IMHO we should be concrete about the newline-like things in the textish parts of our binary formats and say that they are single 0x0A LF characters. I would think that that's what all the implementations are outputting (even on Windows), but I have not checked.

(I thought we'd talked about text blobs within binary files in the past but I guess it was this thread about bgzipped text files, which is somewhat different — link is to Bob Handsaker's useful point about line terminators in compressed text files.)

jkbonfield · 2017-04-11T11:24:09Z

Agreed and infact I now realise my implementation has a flaw when it comes to SAM on Windows. It'll "correct" the header but not the records, leaving a half-way house which is worse than either of the two conventions. That maybe means the cr-nl to nl translation, if applicable at all, should be on output to BAM rather than input to memory.

My initial rationale was because of the half-baked header parsing API which doesn't appear to have any cr removal at all (lots of '\n' and '\t' in sam_hdr_parse but no '\r'. However it does, bizarrely, appear to work. I haven't dug down deep into it, but don't consider it worthwhile given one day we'll implement the mythical new header parsing API and make all that old code defunct anyway.

So I'll drop that bit from the PR. (Edit: to be specific, I'm not intending on fixing the BAM writing code here either - it can be as ill defined as before and we'll figure out how to handle it in a separate issue/PR.)

jmarshall · 2017-04-11T12:15:05Z

That's because the in-memory h->text is already canonically '\n' only: for SAM, due to hts_getline() as I said; for BAM/CRAM, by assuming it's '\n'-only on disk (testing other implementations and/or spec clarifications are desirable here). So your half-way house fears are unfounded. Perhaps you should dig down into these things before hacking on the code 😄.

Given that you lot may implement a mythical new header parsing API one day, any code added here will go away. So the code in this PR is non-trivial (hence a little risky), changes behaviour (e.g., when the input has multiple NUL padding), and won't be long-lived anyway.

We already over-allocate the buffer by 1. If we're going to accept the missing final newline that motivated a change here, I don't think we should do anything more than something like this (and similarly for CRAM):

@@ -123,6 +123,7 @@ bam_hdr_t *bam_hdr_read(BGZF *fp)
     char buf[4];
     int magic_len, has_EOF;
     int32_t i, name_len, num_names = 0;
+    uint32_t l_nul;
     size_t bufsize;
     ssize_t bytes;
     // check EOF
@@ -145,13 +146,17 @@ bam_hdr_t *bam_hdr_read(BGZF *fp)
     if (bytes != 4) goto read_err;
     if (fp->is_be) ed_swap_4p(&h->l_text);
 
-    bufsize = ((size_t) h->l_text) + 1;
-    if (bufsize < h->l_text) goto nomem; // so large that adding 1 overflowed
+    bufsize = ((size_t) h->l_text) + 2;
+    if (bufsize < h->l_text) goto nomem; // so large that adding 2 overflowed
     h->text = (char*)malloc(bufsize);
     if (!h->text) goto nomem;
-    h->text[h->l_text] = 0; // make sure it is NULL terminated
+    h->text[h->l_text] = h->text[h->l_text+1] = '\0'; // make sure it is NUL-terminated
     bytes = bgzf_read(fp, h->text, h->l_text);
     if (bytes != h->l_text) goto read_err;
+    // make sure it has a final newline (if it's non-empty)
+    l_nul = h->l_text;
+    while (l_nul > 0 && h->text[l_nul-1] == '\0') l_nul--;
+    if (l_nul > 0 && h->text[l_nul-1] != '\n') { h->text[l_nul] = '\n'; h->l_text++; }
 
     bytes = bgzf_read(fp, &h->n_targets, 4);
     if (bytes != 4) goto read_err;

Although the htslib code checks things like headers starting with @, nul termination, etc, not all BAM implementations are as friendly. This code sanitises on input, also fixing samtools/samtools#661 in the process.

Remove some variables that are no longer needed. Make it count lines. Ensure adding '\n' doesn't overflow h->l_text. Simplify logic for adding the '\n' a bit.

Added 'sanitise headers' samtools#509

* NEWS update. (Ideally merge on day on release, editing date if required.) * Update NEWS Added 'sanitise headers' #509

jkbonfield force-pushed the SAM_hdr_nl branch from d22a187 to 9ae2cb5 Compare May 3, 2017 15:46

jkbonfield and others added 2 commits May 7, 2017 23:25

Trivial SAM header sanitising.

72a7c2a

Although the htslib code checks things like headers starting with @, nul termination, etc, not all BAM implementations are as friendly. This code sanitises on input, also fixing samtools/samtools#661 in the process.

Improve sam_hdr_sanitise()

0639ea1

Remove some variables that are no longer needed. Make it count lines. Ensure adding '\n' doesn't overflow h->l_text. Simplify logic for adding the '\n' a bit.

daviesrob force-pushed the SAM_hdr_nl branch from 9ae2cb5 to 0639ea1 Compare May 8, 2017 09:09

daviesrob merged commit 0639ea1 into samtools:develop May 8, 2017

jenniferliddle added a commit to jkbonfield/htslib that referenced this pull request May 8, 2017

Update NEWS

ae7e828

Added 'sanitise headers' samtools#509

jenniferliddle pushed a commit that referenced this pull request May 8, 2017

NEWS update. (#529)

89ea70a

* NEWS update. (Ideally merge on day on release, editing date if required.) * Update NEWS Added 'sanitise headers' #509

jkbonfield deleted the SAM_hdr_nl branch June 5, 2019 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trivial SAM header sanitising. #509

Trivial SAM header sanitising. #509

Uh oh!

jkbonfield commented Apr 10, 2017

Uh oh!

jmarshall commented Apr 11, 2017

Uh oh!

jkbonfield commented Apr 11, 2017 •

edited

Loading

Uh oh!

jmarshall commented Apr 11, 2017

Uh oh!

jkbonfield commented Apr 11, 2017 •

edited

Loading

Uh oh!

jmarshall commented Apr 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Trivial SAM header sanitising. #509

Trivial SAM header sanitising. #509

Uh oh!

Conversation

jkbonfield commented Apr 10, 2017

Uh oh!

jmarshall commented Apr 11, 2017

Uh oh!

jkbonfield commented Apr 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmarshall commented Apr 11, 2017

Uh oh!

jkbonfield commented Apr 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmarshall commented Apr 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jkbonfield commented Apr 11, 2017 •

edited

Loading

jkbonfield commented Apr 11, 2017 •

edited

Loading