Skip to content

UOSbioinformaticslab/depmine

Repository files navigation

DepMine – v.GPEFS_2023Q2_6 – USER MANUAL

13/4/2024

INSTALLATION

Dependencies

DepMine is a Python 3.x program running on Python versions 3.8 and above. It requires the following Python libraries to be available :

pandas, numpy, matplotlib, collections, scipy, timeit, multiprocessing, re, requests, math, seaborn, random, glob

Modules not already included in your local Python should be installed with pip or conda.

Parallel Processing

DepMine makes extensive use of parallel processing on multiple CPUs, and the speed of time-sensitive operations will be greatly improved by running the program on a multi-CPU workstation.

Data Files

DepMine uses a number of datafiles which are either supplied with the code or can be downloaded from a range of different sources :

File Data Web Page Version
CRISPRGeneDependency.csv gene dependency https://depmap.org/portal/download/all/ Public 23Q2 
Model.csv cell line information https://depmap.org/portal/download/all/ Public 23Q2 
OmicsSomaticMutations.csv mutations https://depmap.org/portal/download/all/ Public 23Q2 
OmicsCNGene.csv copy number https://depmap.org/portal/download/all/ Public 23Q2 
OmicsFusionFiltered.csv gene fusions https://depmap.org/portal/download/all/ Public 23Q2 
OmicsExpressionProteinCodingGenesTPMLogp1.csv expression https://depmap.org/portal/download/all/ Public 23Q2 
Model.csv cell lines https://depmap.org/portal/download/all/ Public 23Q2 
c1.all.v2023.1.Hs.json.txt chromosomes https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.1.Hs/ 2023.1.Hs
hgnc_complete_set.json HUGO gene names https://www.genenames.org/download/archive/ Current
c2.cp.reactome.v2023.2.Hs.symbols.gmt pathways https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/ 2023.2.Hs
Homo_sapiens.GRCh38.110.refseq.tsv ENSEMBLE reference sequences https://www.ensembl.org/Homo_sapiens/Info/Index GRCh38
ChromosomeLengths.csv Chromosome lengths local
ExpressionThresholds.csv Expression thresholds local
Assignment_Solution_Activities.tsv Mutation signature assignments local
Final_GOF_all_table Annotated GoFs local
AB_Dep_Profiles cancer profiles local

On first run, DepMine will compile two further local files : AB_GeneDatabaseExpr2023Q2 and AB_CellDatabaseExpr2023Q2 which should be kept in the same directory as all the other data files and the program file.

**
RUNNING**

DepMine can be run from a Linux/UNIX shell command line (after making the distributed program file executable) :

chmod a+x DepMine.py

./DepMine.py

DepMine will print the number of CPUs it can detect and then give the option of the user specifying the individual data files to be used. Unless you have very good reasons not to, accept the default of ‘N’. for this, and DepMine will load the default set of data files. If this is the first run, two local databases will be generated – which takes a few minutes - and some commentary will be given on the progress of this.

As a general rule, when DepMine offers choices to the user, they will be given in a bracketed list, and the default choice – i.e. what will be used if you just tap return or type something that is not offered – will be given in square braces. For example :

'Aggregate by genes, residues, or not aggregate ? (G,R,[N]) : ')

Again as a general rule, DepMine only needs the first character of options to be specified and won’t care about upper or lower case.

Once the data files are loaded – usually within 60 seconds – DepMine will prompt for a command.

NOMENCLATURE

As a general principal, genes that are candidates for therapeutic intervention on the basis of their dependency in different cell lines, are referred to as GeneAs, while those that are altered in their sequence, copy or expression level in cancer cell lines are referred to as GeneBs. Synthetic sickness/lethality relationships are calculated between GeneAs and GeneBs. GeneAs are also referred to as targets.**
COMMANDS**

The available user commands are :

Commands
CHRO Show chromosomal location of a list of genes
CYTO Show chromosomal location of the CNV amplifications and deletions in a cell line
DEFI Define a cancer profile
EXPR Analyse expression profile of a gene
FIND Find genes whose disruptive mutation and/or deletion and/or low expression switches the dependency of a target gene
GOF Find genes whose dependency is switched by the presence of one or more documented gain-of-function mutations
LIST Apply FIND to a list of genes
LOCU Scan genes in a cytogenetic locus to find those where mutational disruption switches the dependency of a target gene
MINE Find target genes whose dependency is switched by a defined cancer profile, and optionally refine the cancer profile used
PARA Set various parameters and thresholds
PROF Read/write/organise cancer profiles and cancer profile collections
QUIT Does what it says
SIG Analyse the mutational signatures in cell lines matching a cancer profile
SWITCH Find gene whose dependency is switched in cell lines matching a cancer profile
TYPE Analyses the tissue type distribution of cell lines matching a cancer profile

The individual commands are described in detail below.

CHROChromosomal Gene Location

You will be prompted for a number of additional files - just press return – and then for a filename. This file will typically be generated by the FIND command – see below – but minimally needs to be a CSV file with a column called ‘GeneB’. This command needs to interrogate ENSEMBLE – so you need a good internet connection to utilise it. The command generates a graphic of human chromosomes, with the location of the GeneB genes indicated.

CYTOChromosomal CNV Location

You will be prompted for a number of additional files - just press return – and then for a the name of a cell line. This can be a DepMine identifier e.g. ACH-002968 or a common cell line name e.g. RPE1 or U2OS. If the name given is not unique a list of the matching cell lines is given so a unique name can be given.

The command generates a graphic of human chromosomes with amplified genes indicated in blue and deleted genes in red.

DEFICancer Profile Definition

You will be prompted for a genetic profile definition*,* which can be just the name of an existing cancer profile or an algebraic statement utilising one or more primitive postulates constructed from the following operators and combined with set operations : union |, intersection & and exclusion ^, and ( ).

Operator Operand Resultant set Example
+ gene cells where gene is amplified 1 +myc
+ cytogenetic locus cells where >=1 gene in locus is amplified 1 +17q12.4
gene cells where gene is deleted 1 −tp53
cytogenetic locus cells where >=1 gene in locus is deleted 1 −9p21
= gene cells where gene is diploid 1 =snca
~ gene Cells where gene is involved in a gene fusion ~alk
> gene cells where gene is over-expressed 2 >erbb2
< gene cells where gene is under-expressed 2 <tert
! gene cells where gene has a reading-frame-disruptive loss-of-function mutation !brca2
/ gene:mut

cells where gene has defined missense mutation

‘?’ matches any change

/braf:v600e

/kras:g12?

/ gene:mut,mut[,mut] cells where gene has one of several defined missense mutations /egfr:l834r,t766m
/ gene+ Cells where gene has an annotated gain-of-function mutation /kras+
/ gene Cells where gene has an annotated loss-of-function mutation /atm−
$ gene cells where gene has no non-synonymous mutations $rb1
{ Cosmic signature3 Cells displaying a defined Cosmic mutational signature {sbs7a
{ Cosmic signature aetiology Cells displaying one or more Cosmic mutational signature attributed to a defined aetiology {tobacco
# all cells (Universal Set)
% tissue cells derived from tissue %breast
*

sex

(Male,Female,Unknown)

cells with sex of patient *fem
;

age

(Adult, Paediatric,Unknown)

cells with age of patient ;p
@ Filename externally sourced CSV file of cell ACH codes @CHAN-MSI
  1. Definitions of amplification, deletion and diploid status are based on Mina et al 2020, adapted to the log2(CN ratio + 1) formalism used by DepMap/CCLE.

  2. Over- and under-expression are defined per gene, based on prior analysis of DepMap expression level values across the full population of cell lines for which data is available (see below).

  3. As defined in (Diaz-Gay et al, 2023).

A profile name can be specified directly e.g. PROFILE_NAME = <profile definition>, however if no profile name is provided one will be asked for before the definition can be saved.

In response to the definition, DepMine will report the number of cell lines that are matched by the cancer profile, and offer to save/list the matching cell lines. Finally, you will be prompted to save the profile by name. If the name is already in use, the choice of overwriting the previous definition will be given. If a profile is overwritten, any other profiles that reference it will still work, but their behaviour will reflect the new definition of the overwritten profile. If in doubt, do not overwrite existing profiles.

Defined profiles will only be saved to disc, by explicitly saving them in the PROF module.

EXPRExpression Profiles

You will be prompted for a gene name and histogram of the distribution of expression levels for that gene across the entire cell line collection is generated. Additionally, scatter plots of copy number versus expression level and gene dependency versus expression level, are generated.

FINDFind SSL Partners for a Specified Target Gene

You will be prompted for a target GeneA name, and then whether mutation and/or deletion and/or under-expression should be used to specify the ‘defect’ in GeneBs. This can be specified as 1, 2 or 3 of M, D and U, in any order.

The command will then run a full genome calculation which can take up to an hour, depending on whether 1, 2 or 3 defect types are being used and the number of processors in your computer. Once completed, the sorted list of ‘hits’ that pass the significance threshold is presented and for each gene in order, you will get the option to plot comparative dependency distributions for the cells where that gene is defective and a control set of cells in which it isn’t.

GOFFind SSL Targets for Gain-of-Function Mutations

You will be asked whether you want to use the GoF mutations annotated in the DepMap data, or wish to read in an external GoF dataset. Usually the DepMap annotations will be sufficient.

You will then be asked if you wish to :

  • aggregate GoF mutations at the gene level – i.e. KRAS-G12C, KRAS-G13C, KRAS-Q61R will count as equivalent;

  • aggregate GoF mutations at the residue level – i.e. KRAS-G12C, KRAS-G12V, KRAS-G12A will count as equivalent, but KRAS-G12? and KRAS-G13? will be considered separately.

  • not aggregate – so all mutations are counted individually.

The command will then run a full genome calculation for each mutation or aggregate looking for target GeneAs that show high dependency in cells harbouring those mutations relative to cells in which these mutations do not occur. For statistical power, only those mutations or aggregates which occur in > 5 cell lines are considered.

Depending on whether the command uses aggregated or non-aggregated mutations and on the number of processors in your computer this can take a couple of hours to complete. At the end of the run, a p-value sorted list of ‘hits’ is written to disk for subsequent analysis.

A subordinate command PGOF reads back the GOF generated hit-list and plots comparative dependency distributions for each target ‘hit’ in cells harbouring the GoF mutations, and a control set of cells that don’t.

LISTBatch SSL Finding

You will be prompted for a CSV file that minimally contains pairs of gene names (GeneA, GeneB) for which SSL will be calculated. An additional ancillary column can be specified to be carried over into the output, and the calculation can optionally be restricted to a specified tissue type.

The command will generate an output CSV file with p-value and Cohen d value for the comparison of dependency distributions for GeneA in cells in which GeneB is disruptively mutated, CNV deleted and under-expressed. The calculation is repeated with GeneA and GeneB swapped.

LOCUIdentifying Direct Targets in Locus Deletions

This command tries to identify individual GeneBs within a potentially deleted cytogenetic locus that show a direct SSL with a target GeneA.

The command will prompt for a target gene name and a cytogenetic locus, and will respond with the names of the genes at that locus. It will then calculate SSL between the target gene and each gene in the locus but only in cells where that gene is CNV diploid but has frame-disruptive or missense LoF mutations, and report the results as a waterfall plot based on Cohen d-value, marked up to indicate those genes where the SSL with the target also has a p-value better than the threshold parameter.

MINEOptimise Cancer Profiles

The command will prompt for a target gene name and a defined cancer profile, and will then plot comparative dependency distributions for the target gene in those cells that match the cancer profile and a control set of cells that don’t.

You will then be offered the option to mine the cells matching the cancer profile for factors additional to the original profile that display significant over/under-representation in the high/low dependency ends of that distribution. These are in order : sex (female ,male, unknown), age (paediatric, adult, unknown), tissue, mutational signatures, frame-disruptive / missense LoF mutations, GoF mutations, copy number deletion/amplification (singleton gene and locus), high/low-expression level (aggregated to pathways).

At each stage the distribution of identified factors can be plotted overlayed on the parental matching distribution. Once all secondary factors have been identified these can be reviewed and then amended cancer profiles embodying all possible combinations of secondary factors added to the original profile, are evaluated in terms of effect size and the number of cell lines matched. The Pareto optimal surface for these two parameters is plotted, and the refined profiles providing optimal solutions are optionally written to a profile collection, which can be loaded for subsequent use using PROF.

PARASpecify Parameters

Allows you to override default values for various parameters : dependency ‘high’ threshold,

minimal mutational frequency, maximum p-value for significance, minimum Cohen d-value for a strong effect.

PROFCancer Profile Management

In this and all modules that utilise cancer profiles, if no cancer profiles have been loaded previously you will be prompted to load a cancer profile collection from a list. DepMine will build this list from files in the launch directory of the type filename.profiles if any are present. The default profile collection provided with DepMine is called AB_Dep_Profiles and comes initialised with a single profile – ALL – which encompasses all the cell lines in the DepMap dataset for which any data is present.

Once a profile collection is loaded, the following subcommands are available :

Show Lists the top-level definition of a profile including references to other profiles
Delete Removes a profile from the loaded collection – if this is referenced by other profiles this will be require confirmation that all affected profiles will be removed. NB this only affects the loaded copy of the profiles
Name Allows the name of a profile to be changed. If this is referenced by other profiles the references will be automatically updated to the new profile name
Flatten Lists the definition of a profile with all references to other profiles flattened to primitive set postulate functions
Read Reads in profiles from a file – optionally these are added to the profiles already loaded, or replaces them. NB any changes to the previously loaded profiles will not be automatically saved prior to replacement by the new set
Write Writes the current loaded profiles to a file.
Quit Quits this module

SIGMutational Signature Distribution

You will be prompted for a named genetic profile and a histogram of the distribution of mutational signatures within the cell lines matching that profile will be generated.

SWITCHIdentify Target Genes for Cancer Profiles

You will be prompted for a named cancer profile and the command will then perform a full genome calculation for looking for target GeneAs that show high dependency in cells matching the cancer profile relative to cells that do not match the profile. Depending on the cancer profile and the number of processors in your computer the analysis will take 10-20 minutes. On completion a waterfall plot of GeneA ‘hits’ based on p-value is generated.

You will then be offered the opportunity to plot comparative dependency distributions for the target genes in those cells that match the cancer profile and a control set of cells that don’t.

The list of GeneAs that pass the requirement for a strong effect with statistical significance can be saved as a CSV file.

TYPETissue Type Distribution

You will be prompted for a named genetic profile and histograms of the absolute and relative tissue distribution of the matching cell lines will be generated

APPENDIX – Generating Local Files

ExpressionThresholds.csv can be regenerated from OmicsExpressionProteinCodingGenesTPMLogp1.csv with the local program expr_scorer2.py

Assignment_Solution_Activities.tsv can be regenerated from cell line vcf files by https://cancer.sanger.ac.uk/signatures/assignment/

The directory of vcf files required can be generated from OmicsSomaticMutations.csv with the local program MakeVCF.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages