Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
418a183
First stage of supporting large chromosomes.
jkbonfield May 31, 2018
9e2984a
ABI/API placeholder for 64-bit positions in CRAM.
jkbonfield Jun 1, 2018
00b72b6
Expose hts_parse_reg64() as part of external API
jkbonfield Jun 1, 2018
a19593a
More CRAM LARGE_POS fixes for testing 64-bit positions.
jkbonfield Jun 4, 2018
24436de
More cram codec improvements for 64-bit quantities.
jkbonfield Jun 4, 2018
eb2334e
Further 64-bit position API/ABI changes.
jkbonfield Jun 4, 2018
835e133
Change from int64_t to hts_pos_t typedef.
jkbonfield Jun 5, 2018
329d2b9
Make headers 64-bit compliant
daviesrob Jul 31, 2019
3108bee
Eliminate struct holes in bam1_core_t and bam1_t
daviesrob Jul 31, 2019
3d49c61
Rearrange bcf1_t struct to eliminate hole.
daviesrob Aug 23, 2019
c557f72
Make regions and indexing work with long references
daviesrob Sep 9, 2019
674714e
Add large position tests
daviesrob Sep 10, 2019
c804f81
Parse / format bcf1_t::pos as 64 bit
daviesrob Sep 10, 2019
2e9bfe3
Fix n_lvls calculation in vcf_idx_init.
daviesrob Sep 10, 2019
10709d6
Store > 32 bit reference lengths in bcf_idinfo_t contig data
daviesrob Sep 11, 2019
1606913
Allow storage of 64 bit INFO values for vcf files
daviesrob Sep 12, 2019
9ed0641
Attempt to make tabix work for VCF and SAM with long references
daviesrob Sep 12, 2019
0f0d091
Make synced_bcf_reader work with 64 bit positions.
daviesrob Sep 12, 2019
9c943da
Make regidx work with 64 bit positions
daviesrob Sep 13, 2019
4f1a3fc
Add VCF long reference tests.
daviesrob Sep 11, 2019
995069d
Use hts_pos_t in tweak_overlap_quality() and related functions
daviesrob Sep 19, 2019
f84bba1
Add bcf_get_info_int64; make bcf_update_info_int64 a static inline
daviesrob Sep 19, 2019
983244b
Add NEWS update and README.large_positions.md file
daviesrob Sep 26, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ lib*.so.*
/test/fieldarith
/test/hfile
/test/hts_endian
/test/longrefs/*.tmp.*
/test/pileup
/test/sam
/test/tabix/*.tmp.*
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -532,7 +532,7 @@ htslib-uninstalled.pc: htslib.pc.tmp


testclean:
-rm -f test/*.tmp test/*.tmp.* test/tabix/*.tmp.* test/tabix/FAIL*
-rm -f test/*.tmp test/*.tmp.* test/longrefs/*.tmp.* test/tabix/*.tmp.* test/tabix/FAIL*

mostlyclean: testclean
-rm -f *.o *.pico cram/*.o cram/*.pico test/*.o test/*.dSYM version.h
Expand Down
12 changes: 12 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,18 @@ Noteworthy changes in release a.b
* Incompatible changes: Several functions and data types have been changed
in this release, and the shared library soversion has been bumped.

- HTSlib now supports 64 bit reference positions. This means several
structures, function parameters, and return values have been made bigger
to allow larger values to be stored. While most code that uses
HTSlib interfaces should still build after this change, some alterations
may be needed - notably to printf() formats where the values of structure
members are being printed.

Due to file format limitations, large positions are only supported
when reading and writing SAM and VCF files.

See README.large_positions.md for more information.

- An extra field has been added to the kbitset_t struct so bitsets can
be made smaller (and later enlarged) without involving memory allocation.

Expand Down
231 changes: 231 additions & 0 deletions README.large_positions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# HTSlib 64 bit reference positions

HTSlib version 1.10 onwards internally use 64 bit reference positions. This
is to support analysis of species like axolotl, tulip and marbled lungfish
which have, or are expected to have, chromosomes longer than two gigabases.

# File format support

Currently 64 bit positions can only be stored in SAM and VCF format files.
Binary BAM, CRAM and BCF cannot be used due to limitations in the formats
themselves. As SAM and VCF are text formats, they have no limit on the
size of numeric values.

# Compatibility issues to check

Various data structure members, function parameters, and return values have
been expanded from 32 to 64 bits. As a result, some changes may be needed to
code that uses the library, even if it does not support long references.

## Variadic functions taking format strings

The type of various structure members (e.g. `bam1_core_t::pos`) and return
values from some functions (e.g. `bam_cigar2rlen()`) have been changed to
`hts_pos_t`, which is a 64-bit signed integer. Using these in 32-bit
code will generally work (as long as the stored positions are within range),
however care needs to be taken when these values are passed directly
to functions like `printf()` which take a variable-length argument list and
a format string.

Header file `htslib/hts.h` defines macro `PRIhts_pos` which can be
used in `printf()` format strings to get the correct format specifier for
an `hts_pos_t` value. Code that needs to print positions should be
changed from:

```c
printf("Position is %d\n", bam->core.pos);
```

to:

```c
printf("Position is %"PRIhts_pos"\n", bam->core.pos);
```

If for some reason compatibility with older versions of HTSlib (which do
not have `hts_pos_t` or `PRIhts_pos`) is needed, the value can be cast to
`int64_t` and printed as an explicitly 64-bit value:

```c
#include <inttypes.h> // For PRId64 and int64_t

printf("Position is %" PRId64 "\n", (int64_t) bam->core.pos);
```

Passing incorrect types to variadic functions like `printf()` can lead
to incorrect behaviour and security risks, so it important to track down
and fix all of the places where this may happen. Modern C compilers like
gcc (version 3.0 onwards) and clang can check `printf()` and `scanf()`
parameter types for compatibility against the format string. To
enable this, build code with `-Wall` or `-Wformat` and fix all the
reported warnings.

Where functions that take `printf`-style format strings are implemented,
they should use the appropriate gcc attributes to enable format string
checking. `htslib/hts_defs.h` includes macros `HTS_FORMAT` and
`HTS_PRINTF_FMT` which can be used to provide the attribute declaration
in a portable way. For example, `test/sam.c` uses them for a function
that prints error messages:

```
void HTS_FORMAT(HTS_PRINTF_FMT, 1, 2) fail(const char *fmt, ...) { /* ... */ }
```

## Implicit type conversions

Conversion of signed `int` or `int32_t` to `hts_pos_t` will always work.

Conversion of `hts_pos_t` to `int` or `int32_t` will work as long as the value
converted is within the range that can be stored in the destination.

Code that casts unsigned `uint32_t` values to signed with the expectation
that the result may be negative will no longer work as `hts_pos_t` can store
values over UINT32_MAX. Such code should be changed to use signed values.

Functions hts_parse_region() and hts_parse_reg64() return special value
`HTS_POS_MAX` for regions which extend to the end of the reference.
This value is slightly smaller than INT64_MAX, but should be larger than
any reference that is likely to be used. When cast to `int32_t` the
result should be `INT32_MAX`.

# Upgrading code to work with 64 bit positions

Variables used to store reference positions should be changed to
type `hts_pos_t`. Use `PRIhts_pos` in format strings when printing them.

When converting positions stored in strings, use `strtoll()` in place of
`atoi()` or `strtol()` (which produces a 32 bit value on 64-bit Windows and
all 32-bit platforms).

Programs which need to look up a reference sequence length from a `sam_hdr_t`
structure should use `sam_hdr_tid2len()` instead of the old
`sam_hdr_t::target_len` array (which is left as 32-bit for reasons of
compatibility). `sam_hdr_tid2len()` returns `hts_pos_t`, so works correctly
for large references.

Various functions which take pointer arguments have new versions which
support `hts_pos_t *` arguments. Code supporting 64-bit positions should
use the new versions. These are:

Original function | 64-bit version
------------------ | --------------------
fai_fetch() | fai_fetch64()
fai_fetchqual() | fai_fetchqual64()
faidx_fetch_seq() | faidx_fetch_seq64()
faidx_fetch_qual() | faidx_fetch_qual64()
hts_parse_reg() | hts_parse_reg64() or hts_parse_region()
bam_plp_auto() | bam_plp64_auto()
bam_plp_next() | bam_plp64_next()
bam_mplp_auto() | bam_mplp64_auto()

Limited support has been added for 64-bit INFO values in VCF files, for large
values in structural variant END tags. New functions `bcf_update_info_int64()`
and `bcf_get_info_int64()` can be used to set and fetch 64-bit INFO values.
They both take arrays of `int64_t`. `bcf_int64_missing` and
`bcf_int64_vector_end` can be used to set missing and vector end values in
these arrays. The INFO data is stored in the minimum size needed, so there
is no harm in using these functions to store smaller integer values.

# Structure members that have changed size

```
File htslib/hts.h:
hts_pair32_t::begin
hts_pair32_t::end

(typedef hts_pair_pos_t is provided as a better-named replacement for hts_pair32_t)

hts_reglist_t::min_beg
hts_reglist_t::max_end

hts_itr_t::beg
hts_itr_t::end
hts_itr_t::curr_beg
hts_itr_t::curr_end

File htslib/regidx.h:
reg_t::start
reg_t::end

File htslib/sam.h:
bam1_core_t::pos
bam1_core_t::mpos
bam1_core_t::isize

File htslib/synced_bcf_reader.h:
bcf_sr_regions_t::start
bcf_sr_regions_t::end
bcf_sr_regions_t::prev_start

File htslib/vcf.h:
bcf_idinfo_t::info

bcf_info_t::v1::i

bcf1_t::pos
bcf1_t::rlen
```

# Functions where parameters or the return value have changed size

Functions are annotated as follows:

* `[new]` The function has been added since version 1.9
* `[parameters]` Function parameters have changed size
* `[return]` Function return value has changed size

```
File htslib/faidx.h:

[new] fai_fetch64()
[new] fai_fetchqual64()
[new] faidx_fetch_seq64()
[new] faidx_fetch_qual64()
[new] fai_parse_region()

File htslib/hts.h:

[parameters] hts_idx_push()
[new] hts_parse_reg64()
[parameters] hts_itr_query()
[parameters] hts_reg2bin()

File htslib/kstring.h:

[new] kputll()

File htslib/regidx.h:

[parameters] regidx_overlap()

File htslib/sam.h:

[new] sam_hdr_tid2len()
[return] bam_cigar2rlen()
[return] bam_endpos()
[parameters] bam_itr_queryi()
[parameters] sam_itr_queryi()
[new] bam_plp64_next()
[new] bam_plp64_auto()
[new] bam_mplp64_auto()
[parameters] sam_cap_mapq()
[parameters] sam_prob_realn()

File htslib/synced_bcf_reader.h:

[parameters] bcf_sr_seek()
[parameters] bcf_sr_regions_overlap()

File htslib/tbx.h:

[parameters] tbx_readrec()

File htslib/vcf.h:

[parameters] bcf_readrec()
[new] bcf_update_info_int64()
[new] bcf_get_info_int64()
[return] bcf_dec_int1()
[return] bcf_dec_typed_int1()

```
6 changes: 3 additions & 3 deletions bcf_sr_sort.c
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ void debug_vbuf(sr_sort_t *srt)
for (i=0; i<srt->sr->nreaders; i++)
{
vcf_buf_t *buf = &srt->vcf_buf[i];
fprintf(stderr,"\t%d", buf->rec[j] ? buf->rec[j]->pos+1 : 0);
fprintf(stderr,"\t%"PRIhts_pos, buf->rec[j] ? buf->rec[j]->pos+1 : 0);
}
fprintf(stderr,"\n");
}
Expand Down Expand Up @@ -330,7 +330,7 @@ int bcf_sr_sort_add_active(sr_sort_t *srt, int idx)
srt->active[srt->nactive - 1] = idx;
return 0; // FIXME: check for errs in this function
}
static int bcf_sr_sort_set(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, int min_pos)
static int bcf_sr_sort_set(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, hts_pos_t min_pos)
{
if ( !srt->grp_str2int )
{
Expand Down Expand Up @@ -556,7 +556,7 @@ static int bcf_sr_sort_set(bcf_srs_t *readers, sr_sort_t *srt, const char *chr,
return 0; // FIXME: check for errs in this function
}

int bcf_sr_sort_next(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, int min_pos)
int bcf_sr_sort_next(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, hts_pos_t min_pos)
{
int i,j;
assert( srt->nactive>0 );
Expand Down
5 changes: 3 additions & 2 deletions bcf_sr_sort.h
Original file line number Diff line number Diff line change
Expand Up @@ -90,15 +90,16 @@ typedef struct
int moff, noff, *off, mcharp;
char **charp;
const char *chr;
int pos, nsr, msr;
hts_pos_t pos;
int nsr, msr;
int pair;
int nactive, mactive, *active; // list of readers with lines at the current pos
}
sr_sort_t;

sr_sort_t *bcf_sr_sort_init(sr_sort_t *srt);
void bcf_sr_sort_reset(sr_sort_t *srt);
int bcf_sr_sort_next(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, int pos);
int bcf_sr_sort_next(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, hts_pos_t pos);
int bcf_sr_sort_set_active(sr_sort_t *srt, int i);
int bcf_sr_sort_add_active(sr_sort_t *srt, int i);
void bcf_sr_sort_destroy(sr_sort_t *srt);
Expand Down
5 changes: 3 additions & 2 deletions bgzf.c
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,8 @@ enum mtaux_cmd {
// When multi-threaded bgzf_tell won't work, so we delay the hts_idx_push
// until we've written the last block.
typedef struct {
int tid, beg, end, is_mapped; // args for hts_idx_push
hts_pos_t beg, end;
int tid, is_mapped; // args for hts_idx_push
uint64_t offset, block_number;
} hts_idx_cache_entry;

Expand Down Expand Up @@ -183,7 +184,7 @@ struct __bgzidx_t
* Returns 0 on success,
* -1 on failure
*/
int bgzf_idx_push(BGZF *fp, hts_idx_t *hidx, int tid, int beg, int end, uint64_t offset, int is_mapped) {
int bgzf_idx_push(BGZF *fp, hts_idx_t *hidx, int tid, hts_pos_t beg, hts_pos_t end, uint64_t offset, int is_mapped) {
hts_idx_cache_entry *e;
mtaux_t *mt = fp->mt;

Expand Down
Loading