The MetaGraph Sequence Index dataset offers full-text searchable index files for raw sequencing data hosted in major public repositories. These include the European Nucleotide Archive (ENA) managed by the European Bioinformatics Institute (EMBL-EBI), the Sequence Read Archive (SRA) maintained by the National Center for Biotechnology Information (NCBI), and the DNA Data Bank of Japan (DDBJ) Sequence Read Archive (DRA). Currently, the index supports searches across more than 10 million individual samples, with this number steadily increasing as indexing efforts continue.
This resources comprises MetaGraph index files that have been constructed from publicly available input data gathered from a variety of sources over the past years. Please see Data sources below for a more detailed description.
- Logan Contig Subset
- SRA High-level Subsets
- SRA MetaGut
- SRA Microbe
- RefSeq
- Unified Human Gastrointestinal Genome
- Tara Oceans
These indexes use pre-assembled contigs from the Logan project as input. For this pre-assembly singleton k-mers have been dropped and the assembly graph has been mildly cleaned.
Following the principle of phylogenetic compression, we have hierarchically clustered all samples using information from their biological taxonomy (as far as available). As a result, we currently have a pool of approximately 5,000 individual index chunks. Each of these chunks contains the information of a subset of the samples. Every chunk is assigned into one taxonomic categories. Overall, there are approx 200 taxonomic categories available, each containing only a few up to over 1,000 individual index chunks. The number of chunks within the same category is mostly driven by the number of samples available from that taxonomic group. The chunk size is limited for practical reasons, to allow for parallel construction and querying.
Individual categories were formed by grouping phylogenetically similar samples together. This grouping started at the species level of the taxonomic tree. If too few samples were available to form a chunk, the taxonomic parent was selected for aggregation for samples. The resulting list of categories is available here.
All data is available under the following root: s3://metagraph/all_sra
s3://metagraph/all_sra
+-- data
| +-- category_A
| | +-- chunk_1
| | +-- ...
| +-- ...
+-- metadata
+-- category_A
| +-- chunk_1
| +-- ...
+-- ...
Where category_A
would be one of the Available categories mentioned above. Likewise, chunk_1
would be replaced with a running number of the chunk, padded with zeros up to a total length of 4.
As an example, to reach the data for the 10th chunk of the metagenome
category, the resulting path would be s3://metagraph/all_sra/data/metagenome/0010/
.
Irrespective of whether you are in the data
or the metadata
branch, each chunk contains a standardized set of files.
In the data
branch one chunk contains:
annotation.clean.row_diff_brwt.annodbg
annotation.clean.row_diff_brwt.annodbg.md5
graph.primary.small.dbg
graph.primary.small.dbg.md5
Both files ending with dbg
are needed for a full-text query. They form the MetaGraph index. The files ending in md5
are check sums to verify correct transfer of data in case you download it.
In the metadata
branch one chunk contains:
metadata.tsv.gz
This is a gzip-compress, human readable text file containing additional information about the samples that are contained within each index chunk.
These data sets contain indexes that are formed from grouping together SRA samples based on high-level taxonomic groups, prioritizing DNA whole genome sequencing (WGS) samples. The samples were selected using metadata queried via NCBI SRA’s BigQuery interface and underwent a cleaning pipeline to ensure data quality. The samples exclude sequencing data from long-read technologies like PacBio and Oxford Nanopore.
Specifically, the following groups were formed:
- fungi
- human
- metazoa (excluding human)
- plants
Each group has its data available at a dedicated S3 path (e.g., s3://metagraph/fungi). Each dataset includes the graph file and the annotation file required for MetaGraph querying. MD5 checksums are provided for data integrity validation.
Example directory structure for fungi
:
s3://metagraph/fungi
+-- annotation.relaxed.relabeled.row_diff_brwt.annodbg
+-- annotation.relaxed.relabeled.row_diff_brwt.annodbg.md5
+-- graph_primary_small.dbg
+-- graph_primary_small.dbg.md5
Please note that the metazoa
and human
have been split into chunks for simplified processing. The metazoa
data is available as 17 separate chunks, each containing a specific graph and annotation file that need to be combined for query. The human
data consists of a joint human graph as well as 11 separate annotation parts that can be sequentially used to retrieve different label sets.
- Fungi (NCBI TaxID: 4751): 125,585 samples processed; 121,900 (97.1%) successfully cleaned.
- Plants (Viridiplantae; TaxID: 33090): 576,226 samples processed; 531,714 (92.3%) successfully cleaned.
- Human (Homo sapiens; TaxID: 9606): 454,252 samples processed; 436,494 (96.1%) successfully cleaned. Included assay types: WGS, AMPLICON, WXS, WGA, WCS, CLONE, POOLCLONE, FINISHING.
- Metazoa (excluding human; TaxID: 33208): 906,401 samples processed; 805,239 (88.8%) successfully cleaned.
This subset replicates the original sample collection used in the BIGSI project. Though the data is outdated relative to the current SRA, it provides a valuable benchmark due to its historical significance.
The data is available at s3://metagraph/microbe and has the following layout, containing both the graph as well as annotation files and the respective checksums for integrity checks:
s3://metagraph/microbe
+-- annotation.relaxed.relabeled.row_diff_brwt.annodbg
+-- annotation.relaxed.relabeled.row_diff_brwt.annodbg.md5
+-- graph_primary_small.dbg
+-- graph_primary_small.dbg.md5
- 446,506 microbial genome sequences.
This dataset comprises all samples classified under the human gut metagenome (TaxID: 408170), including both WGS and AMPLICON assay types, and excluding long-read sequencing platforms.
The data is available at s3://metagraph/metagut and has the following layout, containing both the graph as well as annotation files and the respective checksums for integrity checks:
s3://metagraph/metagut
+-- annotation_cluster_original.relaxed.row_diff_brwt.annodbg
+-- annotation_cluster_original.relaxed.row_diff_brwt.annodbg.md5
+-- graph_primary_small.dbg
+-- graph_primary_small.dbg.md5
- 241,384 total samples:
- 176,735 (73.2%) AMPLICON
- 64,849 (26.9%) WGS
The RefSeq index is built from the NCBI Reference Sequence (RefSeq) database, which provides a non-redundant, curated collection of genomic DNA, transcript, and protein sequences representing assembled reference genomes.
- Based on RefSeq release 97, covering:
- 1.7 Tbp total sequence length
- Compressed data size: 483 GB (
gzip -9
)
Three separate indexes were constructed from the RefSeq data, each offering a different level of annotation granularity:
-
Genus-level Taxonomy Annotation
- Annotated with NCBI Taxonomy IDs at the genus level
- 85,375 binary annotation columns
-
Accession-level Annotation
- Annotated with sequence accessions
- 32,881,348 binary annotation columns
-
K-mer Coordinate Annotation
- Annotated with k-mer coordinates split by taxonomy buckets
- 85,375 annotation columns with coordinate tuples
This dataset consists of high-quality assemblies of genomes from human gut microbiomes, catalogued here. We provide indexes for v1.0 of the dataset.
The data is available at s3://metagraph/uhgg_catalogue and s3://metagraph/uhgg_all. An example layout for uhgg_all
, containing both the graph as well as annotation files and the respective checksums for integrity checks:
s3://metagraph/uhgg_all
+-- annotation.relaxed.relabeled.row_diff_brwt.annodbg
+-- annotation.relaxed.relabeled.row_diff_brwt.annodbg.md5
+-- graph_complete_k31.dbg
+-- graph_complete_k31.dbg.md5
+-- graph_complete_k31.small.dbg
+-- graph_complete_k31.small.dbg.md5
- UHGG (catalogue):
- 4,644 reference genomes
- UHGG (all sequences):
- 286,997 non-redundant genomes
This collection contains genomes reconstructed from metagenomic data sets from major oceanographical surveys of global ocean microbial communities across ocean basins, depth layers, and time. The dataset is augmented with reference genome sequences of marine bacteria and archaea from other existing databases. For more details on the data composition, refer to the original publication.
The data is available at s3://metagraph/tara_oceans. Indexes are provided for both the genomes (genomes_*
) and scaffolds (scaffolds_*
). The genome index has annotations for both k-mer presence/absence (i.e., binary; genomes_annotation.row_diff_brwt.annodbg
) and for coordinates (genomes_annotation.row_diff_brwt_coord.annodbg
). The layout for tara_oceans
, containing both the graph as well as annotation files and the respective checksums for integrity checks:
s3://metagraph/uhgg_all
+-- genomes_annotation.row_diff_brwt.annodbg
+-- genomes_annotation.row_diff_brwt.annodbg.md5
+-- genomes_annotation.row_diff_brwt_coord.annodbg
+-- genomes_annotation.row_diff_brwt_coord.annodbg.md5
+-- genomes_graph.dbg
+-- genomes_graph.dbg.md5
+-- genomes_graph_small.dbg
+-- genomes_graph_small.dbg.md5
+-- scaffolds_annotation.row_diff_flat.annodbg
+-- scaffolds_annotation.row_diff_flat.annodbg.md5
+-- scaffolds_graph.dbg
+-- scaffolds_graph.dbg.md5
+-- scaffolds_graph_small.dbg
+-- scaffolds_graph_small.dbg.md5
- 34,815 genomes
- 318,205,057 scaffolds
The following steps describe how to set up a search query across all or a subset of available index files.
Please refer to the AWS docs for the installation instructions and prerequisites:
For the third step, we recommend using Single Sign-On (SSO) authentication via IAM Identity Center:
aws configure sso
You can find the SSO Start URL in your AWS access portal. Please make sure to select default
when prompted for profile name.
Alternatively, you can setup your credentials using the following environment variables:
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_SESSION_TOKEN="..."
or through creation of the plain text file ~/.aws/credentials
with the following content:
[default]
aws_access_key_id=...
aws_secret_access_key=...
aws_session_token=...
You can find specific tokens and keys in the "Access keys" section of your AWS access portal after signing in.
git clone https://github.com/ratschlab/metagraph-open-data.git
cd metagraph-open-data
We assume that you work in the eu-central-2
region, and your aws
authentication is configured in the default
profile.
The deployment script will setup the following on your AWS using the CloudFormation template:
- S3 bucket to store your queries and their results;
- AWS Batch environment to execute the queries;
- Step Function and Lambdas to schedule your queries as individual Batch tasks and merge their results;
- SNS topic to send notifications to when the query is fully processed.
If you want to receive Simple Notification Service (SNS) notifications after a query is processed, you have to provide your email to the script using the --email [email protected]
argument. You need to confirm the subscription via a link sent in an e-mail to your mailbox.:
scripts/deploy-metagraph.sh --email [email protected]
If you want to use your own Amazon Machine Image (AMI) for AWS Batch jobs (e.g., for security reasons or to support newer MetaGraph features), use --ami ami-...
to provide your AMI ID or request that it is built using your AWS resources via --ami build
. The latter uses EC2 and may take up to 30 minutes!
scripts/upload-query.sh examples/test_query.fasta
You can upload your own queries by providing /path/to/query.fasta
instead of examples/test_query.fasta
. You can also upload examples/100_studies_short.fq
if you would like to test the setup on a larger query.
You need to describe your query in a JSON file. A minimal job definition (examples/scheduler-payload.json
) looks as follows:
{
"index_prefix": "all_sra",
"query_filename": "test_query.fasta",
"index_filter": ".*000[1-5]$"
}
As of now, only dataset indexes stored in s3://metagraph
are supported. Generally, the arguments that you can provide are as follows:
index_prefix
, e.g.all_sra
orall_sra/data/metagenome
. Only chunks in the subdirectories ofindex_prefix
will be considered for querying.query_filename
, the filename of the query that you previously uploaded viascripts/upload-query.sh
.index_filter
(.*
by default), a re-compatible regular expression to filter paths to chunks on which the query is to be executed.
Additionally, you can specify the following parameters to be passed to the MetaGraph CLI for all queried chunks:
query_mode
(labels
by default),num_top_labels
(inf
by default),min_kmers_fraction_label
(0.7
by default),min_kmers_fraction_graph
(0.0
by default).
You can submit the query for execution with the following command:
scripts/start-metagraph-job.sh examples/scheduler-payload.json
It will create a dedicated AWS Batch job for each queried chunk, adjusting allocated memory (RAM) to the chunk size.
You can use our example JSON payload for the large query in examples/large-query.json
:
{
"index_prefix": "all_sra",
"query_filename": "100_studies_short.fq",
"index_filter": ".*001[0-9]$",
"query_mode": "matches",
"num_top_labels": "10",
"min_kmers_fraction_label": "0"
}
This will execute the following command for all chunks from 0010
to 0019
:
metagraph query -i graph.primary.small.dbg \
-a annotation.clean.row_diff_brwt.annodbg \
--query-mode matches \
--num-top-labels 10 \
--min-kmers-fraction-label 0 \
--min-kmers-fraction-graph 0 \
100_studies_short.fq
Then, it will save the resulting file in the S3. When all chunks are processed, a dedicated script will merge the results in a single file and send you a notification.