-
Notifications
You must be signed in to change notification settings - Fork 1
Home
This repository contains a series of programs for the post-processing of SPRITE data. The SPRITE workflow results in a BAM file in which the names of each record have had a corresponding barcode (consisting of tags) appended. As an example, suppose a SAM record has the following name:
HISEQ:623:HY5KHBCXX:2:2206:7231:7108::[DPM6A5][NYBot35_Stg][NOT_FOUND][Even2Bo19][Odd2Bo69]
The read's original name is HISEQ:623:HY5KHBCXX:2:2206:7231:7108
, and four tags were found: DPM6A5
, NYBot35_Stg
, Even2Bo19
, Odd2Bo69
. Note that this read was expected to have five tags, but the tag in the third position was not successfully identified. Unidentified tags are represented with [NOT_FOUND]
.
Simply clone the repository and run the scripts from the repo directory. A more official packaging and distribution method may be developed in the future, but it's very low priority. The NumPy and Pysam modules are required, as is the Hi-Corrector software here: http://zhoulab.usc.edu/Hi-Corrector/. The code currently works with NumPy v1.7.1 and Pysam v0.9.0. The code is not compatible with Python 3.
If the input BAM file has reads with unidentified tags, the pipeline will not handle it correctly. Remove these reads from your BAM file with filter_all_tags.py
.
python filter_all_tags.py --input input.bam --output output.bam
We assume that reads with the same barcode arose from nuclear components that had been crosslinked together due to physical proximity. The next step of the pipeline identifies reads belonging to the same crosslinked complex, or "cluster", and groups them together. Each cluster occupies one line of the resulting text file:
DPM6A5.NYBot35_Stg.Odd2Bo71.Even2Bo19.Odd2Bo6 chr1:18884355 chr1:18834000 chr2: 200041887
The above line represents a cluster of size three. Each column (aside from the barcode column) contains the alignment start-coordinate of a read that has the corresponding barcode.
python get_clusters.py --input input.bam --output output.clusters --num_tags N
where N
is the number of tags in a barcode.
The next step of the pipeline transforms a clusters file into a contacts file. A contacts file is a text file containing a simple square matrix of values representing the contact strength or contact frequency between any two points on the genome. This matrix is similar to an adjacency matrix, and can be easily plotted in MATLAB or with R. The basic steps are as follows:
1 Divide the genome or chromosome-of-interest into N bins of a given resolution. Finer resolutions will take longer to run and require much more memory. Initialize a contact matrix with dimensions N-by-N.
2 For each cluster in the input clusters file, generate all pairs of reads and record the implied contacts by incrementing the corresponding matrix cells. A value of one-per-contact will be added to the cells if no downweighting is applied; otherwise the value will be some value less than one that depends on the downweighting strategy.
3 Generate a vector of N bias factors corresponding to each row in the matrix. (The bias factor for a given row will be identical to the bias factor for a given column since the matrix is symmetric.) For each cell in the matrix, divide by (bias_factor(row_num) * bias_factor(col_num)). These bias factors are generated with Hi-Corrector. See Wenyuan Li, Ke Gong, Qingjiao Li, Frank Alber, Xianghong Jasmine Zhou; Hi-Corrector: a fast, scalable and memory-efficient package for normalizing large-scale Hi-C data. Bioinformatics 2015; 31 (6): 960-962. doi: 10.1093/bioinformatics/btu747 .
4 Transform all values in the matrix to a value between zero and one. Currently this is done by calculating the median value of all cells one-off from the diagonal and dividing by it. Any values which were originally greater than this median value (and so would be greater than one after division) are simply set to one.
Simply run get_contacts.py
with the following flags:
-
--clusters
: The input clusters file from the previous step. -
--raw_contacts
: The downweighted output (from step 2). -
--biases
: Hi-Corrector outputs a text file of bias factors. Save them to this location. -
--iced
: The ICEd output (from step 3). -
--output
: The final, transformed output (from step 4). -
--assembly
: Either "mm9" or "hg19". Used to initialize the contact matrix with the appropriate size. -
--chromosome
: For intrachromosomal heatmaps, one of "chr1", "chr2", ..., "chrX". For interchromosomal heatmaps, "genome". -
--min_cluster_size
and--max_cluster_size
: Ignore clusters that fall outside these parameters, i.e, clusters that are too big or too small. Default: 2 to 1000. -
--resolution
: The binning resolution in nt. Default: 1000000 (1 Mb). -
--downweighting
: The downweighting strategy, one of "none", "n_minus_one", and "two_over_n".- none: Each contact contributes a value of 1 to the contact matrix in step 2. For example, a contact from a 5-cluster will contribute 1.
- n_minus_one: Each contact contributes a value of 1/(N - 1), where N is the size of the corresponding cluster. For example, a contact from a 5-cluster will contribute 1/4.
- n_over_two: Each contact contributes a value of 2/N, where N is the size of the corresponding cluster. For example, a contact from a 5-cluster will contribute 2/5.
-
--hicorrector
: The path to the Hi-Corrector ic executable. This defaults to the correct location on the Guttman Lab workstation. -
--iterations
: The number of iterations to perform when running Hi-Corrector. Default: 100.