MAGmax is a dereplication tool designed to maximize the recovery of Metagenome-Assembled Genomes (MAGs) through bin Merging and reAssembly. It performs dereplication in three stages: (i) grouping bins based on average sequence identity, (ii) merging bins within each group, and (iii) reassembling the merged bins.
MAGmax requires three input directories,
-
<binsdir>
, directory containing bin files in FASTA format that need to be dereplicated. (e.g., output files from any metagenome binning tool) -
<readdir>
, directory containing read files in FASTQ format for each sample. -
<mapid_dir>
, directory containing mapping files for each sample. Each file is a text file listing read IDs and the corresponding contig IDs they mapped to. These files are used to retrieve reads that map to each merged bin from the FASTQ files in<readdir>
and to generate new bin-specific FASTQ files for reassembly.
An output directory named mags_<x>comp_<y>purity
will be created, where x
and y
correspond to the user-specified completeness and purity thresholds used to select final bins. By default, MAGmax uses a percentage of 50 for completeness and 95 for purity.
The output directory contains dereplicated bins, and a text file listing the completeness and contamination scores for each bin as calculated by CheckM2.
magmax -b <binsdir> -r <readdir> -m <mapid_dir> -f fasta -t 24
magmax -b <binsdir> -r <readdir> -m <mapid_dir> -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available
magmax -b <binsdir> -r <readdir> -m <mapid_dir> -f fasta -t 24 --split // if input bins are not already split by sample id
MAGmax provides an option to perform dereplication without reassembly using --no-reassembly
flag. In this mode, MAGmax selects the best bin within each genomic cluster based on a quality score (defined as completeness - 5 * contamination) that also meets the user-defined completeness and contamination thresholds. When this option is enabled, only the bin directory (-b
) is required as input.
magmax -b <binsdir> --no-reassembly -f fasta -t 24
magmax -b <binsdir> --no-reassembly -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available
magmax -b <binsdir> --no-reassembly -f fasta -t 24 --split // if input bins are not already split by sample id
- Rust: Follow the instructions here to install Rust.
- Conda: You can install Conda via Miniconda or Anaconda.
- CheckM2: Install CheckM2, download checkm2 database and set CHECKM2DB variable correctly. CheckM2 should already be installed and accessible in your PATH, regardless of the options used to install MAGmax.
Option 1: Use conda package
conda install -c bioconda magmax
or
mamba install -c bioconda magmax # faster installation
Option 2: Use the pre-built executable.
# For x86_64 Linux (glibc-based systems)
wget https://github.com/soedinglab/MAGmax/releases/download/v1.0.0/magmax-linux.tar.gz
cd magmax-linux/bin
chmod +x magmax
./magmax -h
sudo cp magmax /usr/local/bin/ # to access globally
To use this option, in addition to CheckM2, skani, SPAdes, and seqtk, and MEGAHIT (optional) must be installed already and available in your PATH. Alternatively, use environment.yml to create conda environment and activate it to run magmax.
conda env create -f environment.yml
conda activate magmax_env
Option 3: Build from source
git clone https://github.com/soedinglab/MAGmax.git
cd MAGmax
conda env create -f environment.yml
conda activate magmax_env
cargo install --path .
magmax -h
-b, --bindir <BINDIR>
Directory containing fasta files of bins
-r, --readdir <READDIR>
Directory containing read files
-m, --mapdir <MAPDIR>
Directory containing mapids files
-i, --ani <ANI>
ANI for clustering bins (%) [default: 99]
-c, --completeness <COMPLETENESS_CUTOFF>
Minimum completeness of bins (%) [default: 50]
-p, --purity <PURITY_CUTOFF>
Mininum purity (1- contamination) of bins (%) [default: 95]
-f, --format <FORMAT>
Bin file extension [default: fasta]
-t, --threads <THREADS>
Number of threads to use [default: 8]
--no-reassembly
Perform dereplication without bin merging and reassembly
--split
Split clusters into sample-wise bins before processing
-q, --qual <QUAL>
Quality file produced by CheckM2 (quality_report.tsv)
--assembler <ASSEMBLER>
Assembler choice for reassembly step (spades|megahit), spades is recommended [default: spades]
-h, --help
Print help
-V, --version
Print version
This example test run demonstrates dereplication of bins using the provided toy dataset. In the test/bins
directory, example bins generated with MetaBAT2 are given. In the test/reads
directory, paired-end read files for two samples are given and in the test/mapids
directory, mapid files mapping reads to contigs for each sample are given. Precomputed CheckM2 quality scores for the input bins are given in the test/quality_report.tsv
. Run the following command to execute the test:
magmax -b test/bins -r test/reads -m test/mapids -t 24 -q test/quality_report.tsv
To run without reassembly,
magmax -b test/bins --no-reassembly -t 24 -q test/quality_report.tsv // run dereplication without reassembly
After running MAGmax, an output folder named mags_50comp_95purity
will be created in the test
directory. This folder contains the following files:
bins_checkm2_qualities.tsv
— Table summarizing the quality metrics of the dereplicated bins.sample_ERR3405607_metabat2_results.63.fasta
— Final bin obtained after dereplication of the input bins.
-
Input contigs must have IDs prefixed with the sample ID, separated by a
C
. This is a common practice for both single- and multi-sample binning. Ensure mapping and binning are performed on contig files with these updated contig IDs. -
Ensure that headers in the fastq files have read ID separated from sequencer details by a space or tab, not by a dot. This is important for
seqtk
, which is used by MAGmax, to fetch reads correctly.Correct format: @SRR25448374.1 A00214R:157:HLMVMDSXY:1:1101:19868:1016:N:0.length=151#0/1
Wrong format: @SRR25448374.1.A00214R:157:HLMVMDSXY:1:1101:19868:1016:N:0.length=151#0/1
To fix, use the below bash command
sed -i -E 's/^(@[^.]+\.[^.]+)\./\1 /' read.fastq
-
Mapid files can be created using
aligner2counts
(https://github.com/soedinglab/binning_benchmarking/tree/main/util#aligner2counts) with theonly-mapids
option. An example file format is given below,File name:
<sampleid>_mapids
read1_id <sampleid>Ccontig1_id read2_id <sampleid>Ccontig2_id read2_id <sampleid>Ccontig4_id read3_id <sampleid>Ccontig2_id read4_id <sampleid>Ccontig3_id read4_id <sampleid>Ccontig4_id
-
FASTQ and MAPID filenames must contain the sample ID (e.g., SRR25448374.fastq, SRR25448374_mapids). This is the default unless filenames are renamed manually.
-
If input bins are not separated by sample IDs (e.g., when using MetaBAT2 or COMEBin on concatenated contigs), use the
--split
option to let MAGmax automatically separate bins by sample ID. -
We recommend Spades for reassembly which produces bins with higher purity than bins assembled using Megahit.