Skip to content

Conversation

@jmarshall
Copy link
Member

Another ABI breaker that was in danger of being forgotten about while the ABI change window is open (leastwise I for one had forgotten about it!): change l_qname to uint16_t so QNAME length can be returned to the spec's 254 from 079c7b8's 251 — see #520.

As the soversion is being bumped, we have the luxury of fixing this by simply enlarging the l_qname field. It needs to represent the values 0…256 so most of the new bits are wasted, but ¯\_(ツ)_/¯ and maybe one day we may have a format wanting longer QNAMEs…

Adds test cases with long read names. I wish I had though of such test cases when I first implemented the align-the-CIGAR-data thing in 5d114eb!

Fixes #520. Supersedes and closes #523.

So that CIGAR data is 32-bit aligned, there are extra padding NULs
in memory after qname (see 5d114eb).
For simplicity in the bam_get_cigar() et al macros, these NULs are
counted in both l_qname and l_extranul -- but this means that the
maximum value for l_qname is 256 (namely a QNAME of length >= 252,
plus one NUL terminator, plus more NULs to get to a multiple of 4).

Make l_qname a uint16_t and (to retain the simplicity) leave the
l_qname/l_extranul representation as it is. Rearrange the two fields
so that the larger field takes over the previous unused1's space, so
that sizeof(bam1_core_t) is unchanged.

Revert the checks added by 079c7b8,
which are no longer needed. In bam_construct_seq(), just trust the
length obtained from the CRAM record; in bam_read1(), the l_qname
value has come from a 1-byte raw field, so there is no longer any
overflow risk; in sam_parse1(), revert the check to check against
the SAM spec's exact 254 maximum length (+ 1 for the following TAB).

Add test cases with long read names.

Fixes samtools#520.
uint8_t l_extranul;
uint16_t flag;
uint16_t l_qname;
uint32_t n_cigar;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff is pretty unhelpfully displayed. Equivalently and more clearly:

     int32_t pos;
     uint16_t bin;
     uint8_t qual;
-    uint8_t l_qname;
+    uint8_t l_extranul;
     uint16_t flag;
-    uint8_t unused1;
-    uint8_t l_extranul;
+    uint16_t l_qname;
     uint32_t n_cigar;
     int32_t l_qseq;
     int32_t mtid;

@jmarshall
Copy link
Member Author

jmarshall commented Jul 24, 2019

I suppose, to be precisely accurate, it needs to represent 4…256 inclusive.

So we could put + 4 in the various bam_get_cigar macros:

#define bam_get_cigar(b) ((uint32_t*)((b)->data + (b)->core.l_qname + 4)
// etc

and l_qname would store 0…252 and thus still fit in uint8_t! However this would be at the cost of breaking bad third-party code that does this computation itself rather than using bam_get_cigar() etc, which was part of the reason for counting the extra NULs in l_qname in the first place…

Edited to add: Certainly just expanding the field to uint16_t will make for clearer code…

@daviesrob
Copy link
Member

I think uint16_t is the easiest solution. #709 really messes with the contents of bam1_core_t so more rearrangement is going to happen anyway if we want to keep the structure hole free.

Possibly we need some checks in bam_write1() as it's possible to programmatically make read names that don't fit in a BAM file. In theory we could also relax the limit in SAM (and CRAM), although I suspect it would be a bad idea.

It's possible to programmatically make a longer QNAME, which would
lead to outputting a corrupted record in BAM.
daviesrob added a commit to daviesrob/htslib that referenced this pull request Jul 30, 2019
Bumped TWO_TO_THREE_TRANSITION_COUNT due to bam1_core_t changes.
@daviesrob daviesrob merged commit 92fc5ab into samtools:develop Jul 30, 2019
@daviesrob
Copy link
Member

Thanks, long QNAMEs are now restored.

@jmarshall jmarshall deleted the qname-upto-254 branch July 30, 2019 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

l_qname overflows for QNAMEs of 252 or more characters

2 participants