CHiCANE is a toolkit for identifying chromatin regions that interact more often than expected by chance given the specific interaction dynamics of a capture Hi-C (CHi-C) experiment. It offers extensive flexibility in implementing interaction calling via numerous optional parameters so that it can be tailored to a wide range of CHi-C experiments. Additionally, CHiCANE includes multiple helper functions for use in both pre- and post-processing/analysis of the interaction calls.
For details of statistical concepts implemented by CHiCANE and their application to both the region CHi-C and promoter CHi-C datasets, please see our manuscript: Holgerson, Erle, Gillespie, Andrea, et al. “Identifying High Confidence Capture Hi-C Interactions Using CHiCANE” (Nature Protocols 2021).
The main function for running the CHiCANE method is
chicane()
. This takes a bam file and bed files with all
restriction fragments and targeted restriction fragments as input, and
produces a table with fragment interactions and associated p-values.
The package comes with a subset of reads from a capture Hi-C experiment on the Bre80 normal epithelial breast tissue cell line (Baxter et al. 2018). The experiment targeted several breast cancer risk loci in non-coding regions of the genome, and aimed to associate these risk variants with genes. Two of the risk SNPs are rs13387042 and rs16857609, and reads that map to them have been included in the file Bre80_2q35.bam.
library(chicane);
#> Loading required package: gamlss.tr
#> Loading required package: gamlss.dist
#> Loading required package: gamlss
#> Loading required package: splines
#> Loading required package: gamlss.data
#>
#> Attaching package: 'gamlss.data'
#> The following object is masked from 'package:datasets':
#>
#> sleep
#> Loading required package: nlme
#> Loading required package: parallel
#> ********** GAMLSS Version 5.4-22 **********
#> For more on GAMLSS look at https://www.gamlss.com/
#> Type gamlssNews() to see new features/changes/bug fixes.
#> Loading required package: data.table
# example BAM file, baits, and restriction fragments
bam <- system.file('extdata', 'Bre80_2q35.bam', package = 'chicane');
baits <- system.file('extdata', '2q35.bed', package = 'chicane');
fragments <- system.file('extdata', 'GRCh38_HindIII_chr2.bed.gz', package = 'chicane'); # HindIII fragments on chromosome 2
if( bedtools.installed() ) {
chicane.results <- chicane(
bam = bam,
baits = baits,
fragments = fragments
);
print( chicane.results[ 1:10 ] );
}
The chicane
function returns a data.table
object sorted by q-value. Each row contains information about the target
and bait fragments, and the observed and expected number of reads
linking the two fragments.
Before running the main chicane()
function, a minumum of
three input files must be obtained. The
chicane()
function, which is the main interaction calling
step, can be run on these three input files directly as it will work as
a wrapper function to prepare the data and the run interaction calling
in a single step. Alternatively, there is a prepare.data()
function which allows a user to run data preparation on these same three
input files to make an interactions data table, which can subsequently
be used as input for interaction calling separately in a step-wise
fashion (as well as a number of other useful functions as discussed
below). This section dicusses how to obtain each of the
three required inputs, followed by a
discussion of the optional preparation of an
interactions data table.
The first required input is at least one BAM file
containing mapped interaction reads. To convert the BAM file into a text
file, R calls the shell script prepare_bam.sh
. This script
relies on bedtools
to identify the restriction fragments corresponding to each read (based
on the second required input, the in silico
genome digest) and select reads with at least one end overlapping the
baitmap file (the third required input). The reads are also filtered to
remove those in which both reads map to the same fragment. The resulting
data is read into a data.table
object, and the number of reads linking any two fragments is
calculated.
The second required input is the in silico
digest file, which contains the coordinates of the fragments created by
the restriction enzyme (e.g. HindIII) used in the
experiment, in BED format. For users that have a digested genome output
from HiCUP, this needs to be converted to a strict 3 column, header-free
BED format to be compatible with bedtools. CHiCANE includes a helper
function, convert.hicup.digest.bed()
, which can be used to
convert a digested genome in HiCUP format to a BED file of the
restiction digest fragments. Any BED file containing the coordinates of
restriction enzyme digested fragments of the genome build used according
to the experiment are appropriate to be used with CHiCANE. For the
purposes of this vignette, the fragments
variable is
assigned to the HindIII fragments of GRCh38 chromosome 2 (sample data
included in the package) at the top of this document. Here is an example
of what this should look like showing the top six entries in the example
fragments file included in the package:
chr2 0 15298
chr2 15298 16049
chr2 16049 18132
chr2 18132 24992
chr2 24992 28070
chr2 28070 28179
The third and final input required is a baitmap in
BED format containing the coordinates of only the restriction digested
fragments which contain a bait from the capture step of the experiment.
This can be obtained by conducting a bedtools intersect
command on the fragment digest above and a BED file with the coordinates
of the individual probes included on the array used in the capture step.
The following is an example command to create the baitmap, then limit it
to only unique fragments with an overlap of at least 20bp:
# intersect restriction fragments file with coordinates for probes included on the array
bedtools intersect -wo -a /path/to/fragments.bed -b /path/to/capture_regions.bed \
> /path/to/fragments_overlap_full.bed
# limit to overlap of at least 20bp and unique fragments only
awk '$7>=20 {print $1 "\t" $2 "\t" $3}' /path/to/fragments_overlap_full.bed | uniq > /path/to/baitmap.bed
The final output of the above command (baitmap.bed)
is
then used as the baits
parameter in either the
chicane
or prepare.data
function calls. At the
top of this vignette we have set baits
to the sample
baitmap at the 2q35 locus which is included in the package. As an
example of how this file should look, this is how the BED file for the
sample baits at 2q35 would appear:
chr2 217035649 217042355
chr2 217430817 217437011
NOTE: The BED format used for the fragments and baits parameters is a header-free, tab-separated, 3 column (chr, start, end) format. Containing strictly 3 columns is necessary both as input to the CHiCANE function calls and for the snippet of code above which creates the baitmap to function properly.
To speed up model fitting, it is possible to pre-process the data
with the prepare.data()
function and run
chicane
directly on the resulting data table. Furthermore,
running in this step-wise manner is useful for those wanting to add
custom covariates not already included in the data table or test
variations of parameters and/or distibutions in the interaction calling.
To first compare biological replicates using the
compare.replicates()
function, prepare.data()
should be run on each replicate separately to assess concordance before
deciding to merge them. This processing can take a while (~45 minutes
for a 19GB BAM file on a commodity computer with 16GB RAM), and only
needs to be done once for each BAM.
if( bedtools.installed() ) {
interaction.data <- prepare.data(
bam = bam,
baits = baits,
fragments = fragments
);
}
The interaction data object contains details of the fragments, and the number of reads linking the two fragments.
The interactions
parameter of the chicane()
function can take either a data.table
object from
prepare.data
or the path to such a table.
CHiCANE models the expected number of paired reads containing two restriction fragments as a function of distance between them and how frequently the bait interacts with other fragments.
In the CHiCANE model, Yij represents the number of reads linking bait i and target fragment j, such that it is assumed to follow a distribution with a mean μij as:
where dij represents the distance between the two fragments, and ti represents the number of reads linking the bait with a trans fragment.
For bait to bait interactions, terms are added to the model to adjust for trans counts of the other end, fragment j, and the product of trans counts of both fragments as:
Each possible interaction is assaigned an expected μij and a p-value for the observed counts yij versus what is expected by chance is calculated as:
This probability is dependent on how the distribution is modelled.
The CHiCANE method supports several different distributions of the
counts Yij
conditional on the mean μij. By
default the counts are assumed to follow a negative binomial
distribution with E(Yij) = exp(μij).
After fitting the model by a maximum likelihood over all possible
interactions, the fitted model is used to estimate a p-value for each pair. Poisson,
truncated Poisson, or truncated negative binomial distributions are also
supported and can override the default negative binomial by being passed
as the distribution
parameter.
Negative binomial is the default distribution in CHiCANE as it is deemed more appropriate for modelling datasets with high variance in comparison to Poisson due to its inclusion of a term for overdispersion.
Using this default distribution, the counts are modelled as Y ∼ NB(μ, θ), with mean μ and dispersion term θ. The variance of the counts conditional on the mean is then
This model implements the glm.nb()
function in the MASS
package.
If the model fails to converge due to a lack of overdispersion in the
data chicane()
will attempt to diagnose the error and if
this is found to be the case a Poisson distribution will be implemented
instead.
The Poisson distribution assumes that Y ∼ Poisson(μ), such that Var(Y) = μ. As the Poisson model does not allow variance of counts to exceed the mean, capture Hi-C counts usually possesses more variability than can be explained using this distribution, the result of which is false positives.
By default CHiCANE only includes observed interactions, rather than
all possible interactions, i.e. leaving out zero
counts (see Inclusion of Zeros). One way to
account for this is in the statistical model, by fitting a truncated
negative binomial distribution in the chicane()
call which
implements the GAMLSS R package
(Stasinopoulos and Rigby 2016). It assigns
P(Y = 0) = 0, and
assumes proportional probabilities to the negative binomial model for
values y > 0.
The truncated Poisson distribution is also supported, which behaves much like the truncated negative binomial model, but rather assumes the probabilities are proportional to Poisson probabilities for values y > 0.
It is worth noting that specifying either of the truncated distributions will significantly slow down the model fitting.
CHiCANE allows users to specify additional covariates that can be
added to the model with the adjustment.terms
parameter. For
example, we can add a term to adjust for the chromosome of the target
fragment as shown below.
data(bre80);
adjusted.results <- chicane(
interactions = bre80,
adjustment.terms = 'target.chr'
);
print( adjusted.results[ 1:5 ] );
#> target.id bait.id target.chr target.start
#> <char> <char> <char> <int>
#> 1: chr19:1155128-1228414 chr2:217035649-217042355 chr19 1155128
#> 2: chr1:117125276-117135137 chr2:217035649-217042355 chr1 117125276
#> 3: chr1:14950583-14951791 chr2:217035649-217042355 chr1 14950583
#> 4: chr1:49076960-49080684 chr2:217035649-217042355 chr1 49076960
#> 5: chr3:145680459-145682483 chr2:217035649-217042355 chr3 145680459
#> target.end bait.chr bait.start bait.end bait.to.bait bait.trans.count
#> <int> <char> <int> <int> <lgcl> <int>
#> 1: 1228414 chr2 217035649 217042355 FALSE 9494
#> 2: 117135137 chr2 217035649 217042355 FALSE 9494
#> 3: 14951791 chr2 217035649 217042355 FALSE 9494
#> 4: 49080684 chr2 217035649 217042355 FALSE 9494
#> 5: 145682483 chr2 217035649 217042355 FALSE 9494
#> target.trans.count distance count expected p.value q.value
#> <int> <num> <int> <num> <num> <num>
#> 1: 2 NA 2 1.003963 0.2656989 0.6417435
#> 2: 2 NA 2 1.003992 0.2657096 0.6417435
#> 3: 2 NA 2 1.003992 0.2657096 0.6417435
#> 4: 2 NA 2 1.003992 0.2657096 0.6417435
#> 5: 2 NA 2 1.004014 0.2657179 0.6417435
The adjustment.terms
parameter also supports expressions
such as log transformations. Multiple adjustment terms can be added by
passing as a vector.
Any adjustment term must correspond to a column already present in the data. If a user would like to adjust for anything that is not already present in the CHiCANE output (e.g. GC content), there’s a three step approach:
prepare.data()
on your BAM file(s)chicane()
by setting the interactions
parameter to the output from step 2By default, all baits and targets are included when fitting the
CHiCANE model. An alternative approach is to filter out baits and
fragments with low (Dryden et al. 2014) or
high (Cairns et al. 2016) degree of
“interactibility”, which is inferred via trans
counts for each bait. CHiCANE also supports this filtering through the
bait.filters
and target.filters
parameters.
Each of these take a vector of length two, where each element
corresponds to the proportion of fragments that should be considered to
fall below the “lower limit” and “upper limit.” For example, passing in
bait.filters = c(0.3, 0.9)
means that the baits with the
lowest 30% or highest 10% of trans
counts will be removed before fitting the model.
Filtering fragments may affect multiple testing correction by changing the number of tests performed.
The helper function compare.replicates()
can be used to
assess concordance between biological replicates before deciding whether
to merge them to improve signal to noise ratio. This is done by
conducting the prepare.data()
step on each BAM separately
before passing to the compare.replicates()
function.
CHiCANE can merge replicates at the data processing step. If more
than one BAM file is available, these can be passed as a vector to the
bam
parameter of the chicane()
and
prepare.data()
functions. The
replicate.merging.method
parameter determines how
replicates are merged. Available options are ‘sum’ and ‘weighted-sum’,
with the former being the default.
In order to assess goodness of fit for the model implemented, CHiCANE
outputs rootogram model fit plots and statistics to the directory
provided through the interim.data.directory
parameter in
the chicane()
function. The default NULL
will
skip these so it is important for a user to set this directory in order
assess the model fit.
Adjusting for distance is done in a piecewise linear fashion. cis data is ordered by distance, and split into equally sized bins. The count model is then fit separately within each bin, and results are merged at the end by concatenating the results from the individual model fits.
By default, CHiCANE will split the data into distance quantiles,
i.e. 100 equally
sized groups. However, if the resulting datasets are too small, the
number of distance bins is halved until all resulting datasets are large
enough for the model to be fit. To override this default, pass the
desired number of distance bins to distance.bins
. For
example, setting distance.bins = 4
results in the count
model being fit separately in each quartile. trans
and cis
interactions are fit separately.
Finally, CHiCANE applies the Benjamini-Hochberg method of multiple
testing correction separately for each bait. To change this to a global
multiple testing correction, set
multiple.testing.correction = 'global'
in the main
chicane()
function.
The chicane()
function’s final output is a q-value
sorted data table of all interactions included in the model fitting.
This full interactions table is useful in some instances such as the locus plot and background interactions
functionality discussed below. However, its large size is rather
unwieldy for most downstream analysis, so it is useful to also filter
for only interactions below a desired q-value threshold and save this
more workable table as well.
In order to assist users with interpretation of interaction calls, CHiCANE also includes several helper functions for this purpose designed to be applicable to a myriad of experimental designs as delineated below and in more detail in our aforementioned manuscript.
The first step in interpretation of the interaction calls is likely
to be annotation of the target fragments. Selecting just columns 3-5
from the CHiCANE interactions data table and saving the resulting
coordinates as a BED file will then allow a
bedtools intersect
to be done between the target fragments
and a genome annotation file also in BED format. The resulting output
will indicate the genes that any target fragments overlap with.
Similarly, an intersect could be performed with VCF files containing
variant calls or BED files containing ChIP-Seq peak calls for histone
marks or CTCF binding sites. There are many disparate datasets which use
a file type compatible with bedtools intersect
and numerous
options for the command to customise the output.
CHiCANE also contains a function to convert its results into a format
compatible with the WashU Epigenome Browser (Li
et al. 2019), convert.standard.format()
. The path to
significant interaction calling results is provided to the
chicane.results
parameter and the desired name for the
output file is passed to file.name
. Once standard format is
obtained it can be uploaded to the browser from the ‘Tracks’ dropdown,
selecting ‘Local Text tracks’. In the upload window select ‘long-range
format by CHiCANE’ as the text file type and then your standard format
file in ‘Browse’. Once the track is loaded in the browser, you can right
click on the track and set the Display Mode to ‘ARC’.
CHiCANE results are also usable with the Gviz (Hahne and Ivanek 2016) package to visualise
along with highly customisable genome anootation tracks. Additionally,
the create.locus.plot()
function in the CHiCANE package
will make a basic implementation of such a plot with 6 tracks:
The user needs to provide the genome build to the genome
parameter and coordinates of the locus desired to be plotted to the
chr
, start
, and end
parameters. A
genome annotation file in GTF format needs to be provided to
gene.data
. The genomic feature chosen to be included will
be passed as a BED format file to genomic.features
and the
desired name for this track is passed as feature.name
.
Finally, the path to the FULL interaction calls table is passed to
interaction.data
and the name for the output plot is set by
file.name
. By default only interactions below an FDR
threshold of 0.05 will be included in the interactions track; a user can
override this default by setting the `fdr.filter’ parameter. A brief
example keeping all other parameters at default is shown below:
if( bedtools.installed() ) {
create.locus.plot(
genome = 'hg38',
chr = 'chr2', start = 111100000, end = 112200000,
gene.data = '/path/to/gene_data.gtf',
genomic.features = '/path/to/baitmap.bed',
feature.name = 'baits',
interaction.data = '/path/to/interaction_calls.txt',
file.name = '/path/to/locus_plot.pdf'
);
}
In order to accurately estimate histone marker enrichment, the helper
function stratified.enrichment.sample()
will randomise
interactions which match the number and proportion of observed
interactions which serve as background for comparison, stratified by
distance bins: 0-100kbp, 100kbp-1Mbp, 1Mbp-10Mbp, 10Mbp+, and trans.
Full paths to data tables containing significant and insignificant calls
are passed to significant.results
and
nonsignificant.results
, respectively.
The code in this workflow is intended to be run on Baxter T-47D from the Zenodo directory of Supplementary Data for our manuscript at: https://zenodo.org/record/4073433
First pre-processing of input files needs to be completed as
discussed above (see Pre-Processing Data).
The BAM files referred to in this vignette have been created using
HiCUP, the digested fragments were processed with the
convert.hicup.digest.bed()
function, and the baitmap was
created via bedtools intersect
between the aforementioned
fragments and array bait probes used in the capture step. The fragments
used here (hg38, HindIII digested) are found in Zenodo directory under
data/
and the baitmap is found in
data/captured_fragments/T47D/
as used in the
prepare.data()
call below.
The interactions data table can then be prepared with these processed
inputs (aligned BAMs, fragments BED, and baitmap BED). The resulting
output can then be saved before being passed to the main
chicane()
function for interaction calling. Once completed
save both the full interaction calls results as well as calls filtered
for significance to be used for different purposes.
if( bedtools.installed() ) {
# initial prep of interactions data table
interaction.data <- prepare.data(
bam = list.files('data/bams/T47D', pattern = '.bam', full.names = TRUE),
baits = 'data/captured_fragments/T47D/Baxter_captured_fragments.bed',
fragments = 'data/GRCh38_HindIII_fragments.bed'
);
# save interactions data table
file.name <- tempfile('interaction_data');
write.table(interaction.data, file.name, row.names = FALSE);
# run interaction calling and save model fit plots and statistics
chicane.results <- chicane(
interactions = file.name,
interim.data.dir = 'graphs/T47D/model_fits'
);
# save full interaction calls table
write.table(
chicane.results,
file = 'data/chicane/T47D/interaction_calls.txt',
row.names = FALSE,
quote = FALSE,
sep = '\t'
);
# filter for significant only and save these results separately
significant.results <- chicane.results[q.value < 0.05]
write.table(
significant.results,
file = 'data/chicane/T47D/interaction_calls_significant.txt',
row.names = FALSE,
quote = FALSE,
sep = '\t'
);
}
After completion of interaction calling, the model fit plots and
statistics in the interim.data.dir
should be examined to
ensure goodness of fit (detailed discussion of this in manuscript).
Additionally, it is useful to examine the proportion of bait-to-bait,
cis and
trans
interactions in the significant calls and the number of interaction
calls by distance bins, which can be done with the following code:
if( bedtools.installed() ) {
# calculate proportions by interaction type
total <- nrow(significant.results);
trans.prop <- sum(is.na(significant.results$distance))/total;
cis.prop <- sum(!is.na(significant.results$distance))/total;
b2b.prop <- sum(significant.results$bait.to.bait)/total;
int.data <- c('trans' = trans.prop, 'cis' = cis.prop, 'bait.to.bait' = b2b.prop);
print(int.data);
# get number of interactions by distance bins
binned.data <- NULL;
distance.bins <- list(
"0-10kbp" = c(0, 1e4),
"10-100kbp" = c(1e4, 1e5),
"100kbp-1Mbp" = c(1e5, 1e6),
"1-10Mbp" = c(1e6, 1e7),
"10-100Mbp" = c(1e7, 1e8),
"100-1000Mbp" = c(1e8, 1e9)
);
for(dist.i in names(distance.bins)){
bin.sum <- length(which(significant.results$distance >= distance.bins[[dist.i]][1] & significant.results$distance < distance.bins[[dist.i]][2]));
binned.data <- cbind(binned.data, bin.sum);
}
colnames(binned.data) <- names(distance.bins);
print(binned.data);
}
Once the interaction calls are determined to be reasonable, it is useful to examine the significant interactions in context. Uploading the files to the WashU Epigenome Browser is a quick and user-friendly way to do so. Use the CHiCANE helper function to easily convert to the correct format for the browser.
if( bedtools.installed() ) {
# make a browser compatible file from significant interactions
browser.file <- tempfile('significant_calls_standard.txt')
convert.standard.format(
chicane.results = 'data/chicane/T47D/interaction_calls_significant.txt',
file.name = browser.file
);
}
Viewing in the browser will help to elucidate interaction peaks for loci of interest, then a user can make a plot of interactions at a particular locus using another of CHiCANE’s helper functions. In the following example we plot interactions at the 2q35 locus.
if( bedtools.installed() ) {
# set genome annotation file
gene.ann <- system.file('extdata', 'gencode_2q35.gtf', package = 'chicane');
# generate 2q35 locus plot
create.locus.plot(
genome = 'hg38',
chr = 'chr2',
start = 216600000,
end = 217140000,
gene.data = gene.ann,
genomic.features = 'data/captured_fragments/T47D/Baxter_captured_fragments.bed',
feature.name = 'baits',
interaction.data = 'data/chicane/T47D/interaction_calls.txt',
file.name = 'graphs/T47D/2q35.pdf'
);
}
sessionInfo();
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel splines stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] MASS_7.3-61 chicane_0.1.8 data.table_1.16.2 gamlss.tr_5.1-9
#> [5] gamlss_5.4-22 nlme_3.1-166 gamlss.data_6.0-6 gamlss.dist_6.1-1
#> [9] rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.8-0 jsonlite_1.8.9 futile.logger_1.4.3
#> [4] compiler_4.4.2 brio_1.1.5 jquerylib_0.1.4
#> [7] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
#> [10] R6_2.5.1 knitr_1.49 iterators_1.0.14
#> [13] maketools_1.3.1 bslib_0.8.0 R.utils_2.12.3
#> [16] rlang_1.1.4 testthat_3.2.1.1 cachem_1.1.0
#> [19] bedr_1.0.7 xfun_0.49 sass_0.4.9
#> [22] sys_3.4.3 VennDiagram_1.7.3 cli_3.6.3
#> [25] magrittr_2.0.3 formatR_1.14 futile.options_1.0.1
#> [28] foreach_1.5.2 digest_0.6.37 grid_4.4.2
#> [31] lifecycle_1.0.4 R.oo_1.27.0 R.methodsS3_1.8.2
#> [34] evaluate_1.0.1 lambda.r_1.2.4 codetools_0.2-20
#> [37] buildtools_1.0.0 survival_3.7-0 tools_4.4.2
#> [40] htmltools_0.5.8.1
Development of CHiCANE was supported by Breast Cancer Now.