vignettes/shaman-package.Rmd
shaman-package.Rmd
Shaman package impelements functions for resampling Hi-C matrices in order to generate expected contact distributions given constraints on marginal coverage and contact-distance probability distributions. The package also provides support for visualizing normalized matrices and statistical analysis of contact distributions around selected landmarks. It is an R package embedding algorithms implemened in C++, which is built over support from the Tanay's lab Misha genomic database system that is provided with the package.
The normalization workflow consists of the following steps:
In the example misha database provided in this package we have created a low-footprint matrix to examplify the shaman workflow. We included 4.6 million contacts from ELA K562 dataset covering the hoxd locus (chr2:175e06-178e06) and convergent CTCF regions. Processing the complete matrix from this study requires downloading the full contact list and regenerating the reshuffled matrix.
Once an observed HiC dataset is available in the misha db (refere to Import) for additional information), there are two possible modes to continue analysis: 1. Distributed computation - relying on sun grid engine (sge), or using a single machine with multi-cores. 2. Local computation - for computing normalized matrices in small genomic regions
options(shaman.mc_support=1)
shaman_shuffle_hic_track(track_db=shaman_get_test_track_db(), obs_track_nm="hic_obs", work_dir=tempdir())
shaman_score_hic_track(track_db=shaman_get_test_track_db(), work_dir=tempdir(), score_track_nm="hic_score_new", obs_track_nms="hic_obs")
gsetroot(shaman_get_test_track_db())
point_score = gextract("hic_score", gintervals.2d(2, 175e06, 178e06, 2, 175e06, 178e06), colnames="score")
shaman_plot_map_score_with_annotations("hg19", point_score, gintervals(2, 175e06, 178e06), misha_tracks=list("K562.k27ac", "rna-seq"), annotations=list("ctcf_forward", "ctcf_reverse"), a_colors=c("#4572A7", "#AA4643"))
point_score = shaman_shuffle_and_score_hic_mat(obs_track_nms="obs", interval=interval, work_dir="work_dir")
shaman_plot_map_score_with_annotations("hg19", point_score$points, region, misha_tracks=list("K56.k27ac", "rna-seq"), annotations=list("ctcf_pos", "ctcf_neg"), a_colors=c("#4572A7", "#AA4643"))
Relies on an existance of an expected (shuffled) 2d track. Builds a grid comprising of all combinations of intervals from feature 1 and feature 2 that fall within a band defined by min_dist and max_dist. For each point on the grid, look at th surrounding window, defined by range parameter. Discard all windows that do not contain a point with a score (defined in scotre_track_nm) above the score_filter parameter. This allows for focusing on potentially enriched pairs. Discect the window into small bins, size in base pairs defined by the resolution parameter, and count the number of observed contacts, and the number of expected contacts in each bin. All windows are then summed together, generating a single matrix of observed and expected contacts, which is returned by function. Note that grid contains only points in which feature 1 position is smaller than feature 2 position.
Create data grid for two sets of features, and visualize it:
grid = shaman_generate_feature_grid(feature1, feature2, obs_track, exp_track, range=25000, resolution=500)
shaman_plot_feature_grid(grid, range=25000, grid_resolution=500, plot_resolution=1000)
The following options are available for this package: