misha package is intended to help users to efficiently analyze genomic data achieved from various experiments. The data must be stored in Genomic Database in certain format that is described later in this document. In addition the document describes fundamental concepts of the package such as track expression, iterators, etc.
Genomic Database starts with a root (also frequently
referred as GROOT), i.e. top directory containing certain
subdirectories and files. A new database can be created using
gdb.create_genome
and gdb.create
functions.
This is the easiest way to do it, see the “Genomes” vignette for more
details. One can also build a database manually by generating all the
necessary components that will be described later in this document.
Before the data in a Genomic Database can be accessed one must
establish connection with it by calling gdb.init
function.
On launch the package connects to a Genomic Database located in
PACKAGEDIR/trackdb/test
which serves all the examples in
the reference manual.
A valid Genomic Database should contain the following files and subdirectories:
chrom_sizes.txt
: is a file containing the list of
chromosomes and their sizes.tracks
: is a directory that servers as a repository for
all tracks and interval sets. May contain other
subdirectories.pssms
: is a directory containing PSSM sets (PSSM data
and PSSM key files).seq
: is a directory containing full genomic
sequences.pssms
and seq
directories are optional and
are required only by a subset of functions in the package.
An example of a Genomic Database file structure:
hg38/ <- Genomic Database root directory
chrom_sizes.txt
.ro_attributes <- List of read-only attributes
pssms/ <- (optional)
motif1.data <- pssm data file
motif1.key <- pssm key file
mypssm.data <- ...
mypssm.key <- ...
seq/ <- (optional)
chr1.seq <- seq (sequence) files
chr2.seq <- ...
chr3.seq <- ...
tracks/
tss.interv <- small intervals set = tss
big_data.interv/ <- big intervals set = big_data
.meta <- summary of the intervals set
chr1 <- chrom files
chr5 <- ...
rpt.track/ <- track = rpt
.attributes <- track attributes (optional)
chr1 <- chrom files
chr2 <- ...
chr3 <- ...
vars/ <- track variables (optional)
myresult <- track variable
test/
intervals1.interv <- intervals = test.intervals1
track1.track/ <- track = test.track1
.attributes <- track attributes (optional)
chr1 <- chrom files
chr2 <- ...
chr3 <- ...
savta/
fourC.track/ <- track = savtra.fourC
chr1 <- chrom files
chr2 <- ...
chr3 <- ...
chrom_sizes.txt
chrom_sizes.txt
file must be located under the root
directory of Genomic Database. This file lists the chromosomes and their
sizes. The chromosome name appears in the first column, the size is
indicated in the second column. The chromosome name should appear
without “chr” prefix. The two columns are separated by tab character.
Example:
1 247249719
2 242951149
3 199501827
X 154913754
Y 57772954
Seq (aka sequence) files are located in seq
directory. Each of the Seq files contains a genomic sequence for a given
chromosome as a contiguous string of ASCII characters. The length of the
string should match the length of the chromosome. The file must be
called chrXXX.seq
where XXX
indicates the name
of the chromosome as it appears in chrom_sizes.txt
file.
Here is an example of an unusually short (25 base pairs) Seq file:
ggtgaAGccctggagattcttatta
Each PSSM Set consists of two files: PSSM key and
PSSM data. The files should be named XXX.key
and
XXX.data
accordingly, where XXX
is the name of
PSSM set. Both files must be placed into pssms
directory.
PSSM Key file contains description of PSSMs in the following format (columns are separated by tab character):
Column | Type | Description |
---|---|---|
ID | Integer | Unique ID (referenced in PSSM Data file) |
Sequence | String | PSSM sequence |
Bidirectional | ‘0’ or ‘1’ | If Bidirectional is ‘1’ energy is calculated on complementary strand as well |
Example:
0 *************ATTAAT************** 1
1 *********A*ACACACACA*****A******* 1
2 *************AAAATGGC*G********** 1
3 *************ACTGCTTG************ 1
4 ****WW**GTWGCATACTTTT*GGCG******* 1
5 *********C*RGCAACATKTTG********** 1
6 ****G*G*G*G*GAGCGAGA*RG********** 1
7 **************CCGAAG************* 1
PSSM Data file contains probability matrices for each PSSM key in the following format (columns are separated by tab character):
Column | Type | Description |
---|---|---|
ID | Integer | Unique ID (must appear in PSSM Key file) |
Position | Integer | Zero based position in the range of [0, length(PSSM sequence)-1] |
Probability of ‘A’ | Numeric | Probability of ‘A’ in the range of [0, 1] |
Probability of ‘C’ | Numeric | Probability of ‘C’ in the range of [0, 1] |
Probability of ‘G’ | Numeric | Probability of ‘G’ in the range of [0, 1] |
Probability of ‘T’ | Numeric | Probability of ‘T’ in the range of [0, 1] |
A 1D interval (or one-dimensional interval)
represents a genomic section. It is defined by
(chrom, start, end)
where start
and
end
are genomic coordinates (start < end
).
The coordinates are zero-based, meaning the chromosome starts at
coordinate 0. The end coordinate marks the last coordinate in the
section plus 1. To represent a point in the genome at coordinate
X
, one should create an interval with start coordinate set
to X
and end coordinate set to X + 1
.
A 2D interval (or two-dimensional interval)
represents a rectangle in a genomic space. It is defined by
(chrom_1, start_1, end_1, chrom_2, start_2, end_2)
, where
start_1, start_2, end_1,
and end_2
are start
and end coordinates that mark the limits of a rectangle.
Multiple intervals can be combined into a table, known as an
interval set or often simply referred to as
intervals. This table is represented by a data frame.
In the case of 1D intervals, the data frame must have the first three
columns named chrom
, start
, and
end
. Similarly, 2D intervals must have the first six
columns named chrom1
, start1
,
end1
, chrom2
, start2
, and
end2
.
Additional columns might be added to the intervals, and some of them
might be utilized by various functions. For instance, the
gintervals.neighbors
function uses the strand
column if it is presented in 1D intervals (should come after the regular
three columns). Use gintervals
and
gintervals.2d
functions to create 1D and 2D intervals,
respectively.
Both 1D and 2D intervals are prevalent in various functions. Some of these functions manipulate the intervals (unify, intersect, etc.). Others use the intervals to limit the function’s scope. There are also functions that perform their calculation for each interval in the interval set.
Dual intervals is a list containing two elements. The first element is a 1D interval set, while the second element is a 2D interval set.
The .misha$ALLGENOME
variable is frequently used as a
default value for the intervals argument. .misha$ALLGENOME
is an interval set of a dual type. .misha$ALLGENOME[[1]]
represents a set of intervals that cover the entire genome (1D), while
.misha$ALLGENOME[[2]]
contains all the possible pairs
between the chromosomes (2D). One can also use
gintervals.all
and gintervals.2d.all
functions
to return all 1D or 2D intervals.
Interval sets can be saved in the Genomic Database. Use the
gintervals.save
and gintervals.load
functions
to save or load an interval set from the database, and
gintervals.update
to update/add/delete a certain chromosome
from the set.
Internally, interval sets can be stored in two different formats:
small interval set or big interval
set. The specific format is chosen depending on the size of the
interval set. Big format is selected for interval sets that contain more
than gbig.intervals.size
intervals
(gbig.intervals.size
is set via options
),
while smaller sets are stored in the small format. Use
gintervals.is.bigset
to determine the format of the stored
interval set.
Saved interval sets in the small format can be seamlessly used in all functions and track expressions without the need to explicitly load them.
# 'annotations' is an intervals set saved in Genomic Database
gintervals.intersect("annotations", gintervals(2))
Likewise, big interval sets can be used in many but not all
functions. A notable exception is gintervals.load
that
allows loading only a single chromosome (or a chromosome pair for 2D
cases) of a big intervals set.
A Track is a data structure that allows binding numeric data (floating point values) to a genomic space (a set of genomic intervals). The data in the tracks can typically be accessed through track expressions that are widely used by various functions of the package.
Two fundamental types of tracks exist: 1D and 2D.
A 1D track (or one-dimensional track) maps numeric values to non-overlapping 1D intervals. The package supports two formats of 1D tracks: Dense (sometimes also referred to as Fixed Bin) and Sparse.
For a Dense track, the size of the genomic interval is always fixed
and called bin size. Numeric values are stored for all
genomic intervals that cover the genome, although some values can be
NaN
. A Dense track file appears as a continuous chunk of
values
,
where
maps to an interval
.
Dense track files do not store interval coordinates, allowing them to
represent large amounts of numeric data compactly. The size of a Dense
track is inversely proportional to the bin size. Random access to a
value at a given coordinate has constant complexity, i.e.,
.
Sparse tracks offer more flexibility compared to Dense tracks. Each
numeric value can map to a genomic interval of any size. The size of a
Sparse track is proportional to the number of numeric values (excluding
NaN
s). On the downside, the complexity of random access to
a value at a given coordinate is
,
where
is the number of values in the track.
To summarize the differences between Dense and Sparse tracks:
Dense | Sparse | |
---|---|---|
Optimal use case | Data covering nearly the whole genome | Data covering a limited portion of the genome |
Values stored | Per bin (interval of a fixed size) | Per interval of an arbitrary size |
Random access complexity | ||
Disk usage | 4 bytes per bin | 20 bytes per value |
1D tracks can be created by various functions, such as
gtrack.create
, gtrack.create_sparse
,
gtrack.import_set
, and more.
An Array track is similar to a Sparse track in that it maps data to one-dimensional intervals of any size. However, unlike Sparse tracks, an Array track can map multiple values to each interval. Array tracks thus store large amounts of data in one track - a task that would otherwise require numerous tracks.
The values in an Array track are organized into
columns, each with a name and an index. You can view it
as an NxM table, where N is the number of intervals, and M is the number
of columns. The size of an Array track is proportional to the total
number of numeric values stored (excluding NaN
s).
As useful as they are, Array tracks should not replace Dense or Sparse tracks. A single Sparse track will always be more compact and efficient than an Array track holding only one column.
You can create Array tracks with the gtrack.array.import
function.
A 2D track (or two-dimensional track) maps numeric values to non-overlapping 2D intervals. These are often used to represent interactions between different parts of the genome.
2D tracks are stored internally in chunks with each
chunk containing multiple track values. When accessing a track value,
its entire chunk is loaded into memory. The byte size of a chunk,
determined by the gtrack.chunk.size
option, is a tradeoff
between single value access (smaller chunk) and multiple value access
(larger chunk).
When accessing multiple track values, multiple chunks might be loaded
into memory. Given the potential size of 2D tracks, the total number of
chunks in memory can be limited using the gtrack.num.chunks
parameter.
Typically, 2D tracks use the Rectangles format. However, a more space-efficient Points format exists, similar in behavior to Rectangles. There’s also a Computed format, which is beyond the scope of this documentation.
Rectangles tracks: gtrack.create
,
gtrack.2d.create
.
Points tracks: gtrack.2d.import_contacts
.
Tracks represent sets of intervals, augmented with values, and can
therefore replace interval sets in functions like gextract
,
gintervals.neighbors
, and
gintervals.chrom_sizes
. The only exception is
Dense tracks which can’t be used in place of interval
sets.
Beyond numeric data, tracks can store metadata, such as descriptions
or sources. This metadata is stored as name-value pairs or attributes
with the value being a string. Tracks created using
gtrack.create
, gtrack.smooth
, etc.,
automatically have created.by
, created.date
,
and description
attributes.
While there’s no strict rule, attributes typically store short strings. For other data formats, consider using track variables.
Attribute management: - Retrieval/Modification:
gtrack.attr.get
, gtrack.attr.set
. - Bulk
Actions: gtrack.attr.export
,
gtrack.attr.import
. - Search by Pattern:
gtrack.ls
.
Some attributes are read-only, like created.by
and
created.date
. Use gdb.get_readonly_attrs
and
gdb.set_readonly_attrs
to manage these.
Track variables store statistics, computation results, historical data, etc., related to a track. Unlike attributes, they can store data in any format.
Variable management: - Retrieval/Modification/Removal:
gtrack.var.get
, gtrack.var.set
,
gtrack.var.rm
. - List variables:
gtrack.var.ls
.
Both track attributes and variables store track metadata, but they have distinct uses:
Track Attributes | Track Variables | |
---|---|---|
Use Case | Track metadata as short strings (e.g., description) | Arbitrary track-associated data |
Value Type | String | Any |
Single Value Retrieval | gtrack.attr.get |
gtrack.var.get |
Single Value Modification | gtrack.attr.set |
gtrack.var.set |
Bulk Retrieval | gtrack.attr.export |
N/A |
Bulk Modification | gtrack.attr.import |
N/A |
Object Names Retrieval | gtrack.attr.export |
gtrack.var.ls |
Object Removal |
gtrack.attr.set (with an empty string) |
gtrack.var.rm |
Search by Value | gtrack.ls |
N/A |
Track expression is a key concept of the package. Track
expressions are widely used in various functions (gscreen
,
gextract
, gdist
, …).
Track expression is a character string that closely resembles a valid
R expression. Just like any other R expression it may include
conditions, functions and variables defined beforehand.
"1 > 2"
, "mean(1:10)"
and
"myvar < 17"
are all valid track expressions. Unlike
regular R expressions track expression might also contain track names or
virtual track names.
How does a track expression get evaluated? A track expression is
accompanied by an iterator that determines a set of intervals
over which the expression iterator goes. For each each iterator interval
the track expression is evaluated. The value of a track expression
"mean(1:10)"
is constant regardless the iterator interval.
However suppose the track expression contains a track name
mytrack
, like: "mytrack * 3"
, and the whole
story becomes very different. The library first recognizes that
mytrack
is not a regular R variable but rather a track
name. A new R variable named mytrack
is added then to R
environment. For each iterator interval this variable is assigned the
corresponding value of the track. This value obviously depends on the
iterator interval. Once mytrack
is assigned the
corresponding value, the track expression is evaluated in R.
So how exactly the value of mytrack
variable is
determined given the iterator interval? We will demonstrate the answer
by the following example. Suppose the track mytrack
is in
sparse format. It consists of a single chromosome with the following
values:
chrom | start | end | value |
---|---|---|---|
chr1 | 100 | 200 | 10 |
chr1 | 200 | 250 | 25 |
chr1 | 500 | 560 | 17 |
chr1 | 600 | 700 | 44 |
What would be the value of the variable mytrack
given an
iterator interval? The resulted value is an average of all values of
track mytrack
covered by the iterator interval. For
example, if the iterator interval is [230, 620)
then the
resulted value is an average of values 25, 17 and 44. Similarly if the
iterator interval is [0, 300)
then the resulted value is an
average of 10 and 25. Lastly if the iterator intervals is
[300, 400)
then the resulted value is
.
Same evaluation logics is applied for Dense and Array tracks. (In the
latter case the values from all columns are averaged.) On contrary
Rectangles track value is calculated as a weighted average of
the values covered by the iterator interval. The weight equals to the
intersection area of the iterator interval and the 2D interval that
contains the value.
See the table below:
Track Type | Value |
---|---|
Dense | Average of non values covered by iterator interval. |
Sparse | Average of non values covered by iterator interval. |
Array | Average of non values from all columns covered by iterator interval. |
Rectangles | Weighted average of non values covered by iterator interval. Each weight equals to the intersection area between iterator interval and track interval that contains the value. |
So far we showed that the value of a mytrack
variable is
set to be the average (or weighted average) of the track values that are
covered by the iterator interval. But what if we do not want to average
the values but rather pick up the maximal or minimal values? What if we
want to use the percentile of a track value rather than the value
itself? And maybe we even want to alter the iterator interval itself on
the fly? This is where virtual tracks become useful.
Virtual track is a set of rules that describe how the “source” (a
real track or intervals) should be proceeded, and how the iterator
interval should be modified. Virtual tracks are created with
gvtrack.create
function:
gvtrack.create("myvtrack", "dense_track")
This call creates a new virtual track named myvtrack
.
This virtual track can be used in the track expression instead of a real
track dense_track
. In our example myvtrack
is
just an alias of dense_track
. Yet we can go on and create a
more complicated virtual track if we specify a “function”, i.e. instruct
the virtual track of what should be its value in track expression.
gvtrack.create("myvtrack", "dense_track", "global.percentile")
In this example when myvtrack
is evaluated in the track
expression it will return the percentile of
among the values of dense_track
where
is an average (or weighted average) of the track values that are covered
by the iterator interval.
Virtual tracks are especially useful for Array tracks. By default if
an Array track is used in a track expressions, its interval value would
be the average of all non-NaN column values covered by an iterator
interval. gvtrack.array.slice
allows to select specific
columns and to specify the function applied to the values of each track
interval.
gvtrack.create("myvtrack", "array_track", "sum")
gvtrack.array.slice("myvtrack", c("col2", "col5"), "max")
In this example we create a virtual track based on
array_track
. Assume that an iterator interval
covers
different intervals in array_track
:
.
The value of myvtrack
in a track expression would be then:
where
is a value of the track in column
for interval
.
Virtual tracks allow also to alter the iterator interval “on the fly”:
gvtrack.iterator("myvtrack", sshift = -100, eshift = 200)
In this example we expand each iterator interval by adding -100 to
its start
coordinate and 200 to its end
coordinate.
Similarly, iterator modifiers can be defined for 2D intervals. Moreover, an iterator modifier can create a 1D interval from a 2D iterator interval by projecting one of its axes.
gvtrack.create("myvtrack", "dense_track")
gvtrack.iterator("myvtrack", dim = 2)
It is important to remember that iterator modifiers transform the iterator interval only for the given virtual tracks. Assume an iterator interval and two virtual tracks and . If is a 2D interval then band rules are applied first to it. is transformed then to and according to the modification rules defined by the virtual tracks. Finally, and are passed to and accordingly as the iterator intervals.
So far we have used a track dense_track
as a “source” of
a virtual track. We can also use intervals as a source. In this case,
the value of the virtual track will be some function that takes into
account the “source” intervals and the current iterator interval.
gvtrack.create("myvtrack", "annotations", "distance")
intervs <- gscreen("dense_track > 0.45")
gextract("myvtrack", .misha$ALLGENOME, iterator = intervs)
In this example myvtrack
returns the minimal distance
between intervals from an interval set annotations
and the
center of the current iterator interval from intervs
.
For a full list of supported functions please see
gvtrack.create
and gvtrack.array.slice
functions.
As described in the previous chapter virtual tracks define a set of
rules of how to access and proceed the values of the “source” object.
The connection between the virtual track and the source object is done
via “soft link”, i.e. by name and not by reference. For example, a
virtual track will continue to exist until explicitly removed by
gvtrack.rm
even if the physical track that it is pointing
to is deleted or renamed.
Operations such as gdb.init
and gdir.cd
alter the list of available tracks and intervals sets. Since these
objects are referenced by virtual tracks, these latter are always
defined in the context of the current working directory in Genomic
Database (not to be confused with shell’s current working directory).
Changing the current working directory using gdb.init
or
gdir.cd
will also change the list of available virtual
tracks.
Another issue to bare in mind is that unlike regular tracks whose
data is stored on disk virtual tracks are non-persistent objects in
current R environment. Their definition is stored in
GVTRACKS
R variable. In particular a virtual track named
“vtrack” that was created within a context of “/home/user/trackdb”
Genomic Database working directory would reside in
GVTRACKS[["/home/user/trackdb"]][["vtrack"]]
. One can also
use gvtrack.info
function that provides a more convenient
access to virtual track definitions.
As the virtual tracks are stored in an R variable their behavior hence complies with the rules of other R variables: a virtual track defined by one user will not be seen by another one, virtual tracks might disappear once R is relaunched, etc.
To preserve the definition of virtual tracks between the sessions one
would need to save GVTRACKS
variable on disk. The
serialization of GVTRACKS
is under user’s responsibility.
The standard suit of functions for saving / loading R variables can be
used for that purpose.
Note that if GVTRACKS
is loaded from a file or changed
manually by a user the auto-completion list (in case it is
turned on) might need to be refreshed by calling
gdb.reload
.
Previously we described how a track expression
"mytrack * 3"
(where mytrack
is a track name)
leads to an implicit definition of mytrack
variable in R
environment. To make our explanation easier we presented this variable
as a scalar whose value is altered each time the iterator interval
changes. It’s time to admit that that was oversimplification. In reality
the library defines mytrack
variable as a vector (i.e. an
array) and not as a single scalar. The vector is filled then with the
corresponding values of the track. Finally the track expression is
evaluated in R and the result is expected to be also a vector of the
same size as mytrack
vector. Working with vectors rather
than single scalars reduces the number of evaluations within R and hence
improves run-times.
The size of the vector is controlled via gbuf.size
option. By default it equals to 1000. Altering this value (for instance
setting it to 1) might significantly affect the run-time of various
functions in the library. If you still wish to force the functions to
define scalars rather than vectors, set gbuf.size
to 1:
options(gbuf.size = 1)
One might wonder why should we care about the fact that
mytrack
is not a scalar but rather a vector? Indeed in many
cases it does not really matter. For example mytrack * 3
expression produces exactly the same results regardless whether
mytrack
is defined internally as a vector or as a scalar.
This is due to the fact that the expression V * 3
(V
stands for a vector) results in each value of
V
being multiplied by 3.
Multiplication is a good example of “parallel” operation in R (works
on each element in vector separately). On contrary some functions that
accept a vector might return a scalar rather than a vector. Such is, for
example, min
function.
Let’s look at the following track expression:
track1 + min(track1, track2)
. This expression was probably
meant to produce a sum of track1
track and a minimum value
between track1
and track2
tracks for each
iterator interval. However the library defines the variables
track1
and track2
to be vectors of
gbuf.size
size (by default: 1000). min
is not
a “parallel” operation. Given two vectors of any size it returns a
single scalar that is the minimal value of all values in both
of the vectors. Therefore track1 + min(track1, track2)
will
be interpreted as track1 + M
, where M is minimum of 2000
values (1000 values from track1
track, and another 1000 -
from track2
track). We can hardly imagine that a user would
have really meant this! Sadly enough the expression will be seamlessly
evaluated and produce a valid, but meaningless result. The solution for
our example is to use pmin
rather than min
function.
The library always verifies that the evaluation of the track expression produces a vector of the same size as the size of a track variable. In many cases this procedure is able to reveal faulty track expressions. Yet in more tricky examples like the one that we used before the library will not warn the user.
Make sure your track expressions work correctly on vectors!
During the evaluation of a track expression one can access a
specially defined variable named GITERATOR.INTERVALS
. This
variable contains a set of iterator intervals for which the track
expression is evaluated. GITERATOR.INTERVALS
contains the
same number of intervals as the size of mytrack
vector from
our previous example. The value of a track mytrack
for an
interval i
is stored at mytrack[i]
.
Note that some intervals in GITERATOR.INTERVALS
might
have a start coordinate equal to -1. Skip those intervals and the values
of mytrack
at the corresponding index.
So far we have discussed in details how the track expression is evaluated given the iterator interval. Yet how the iterator intervals can be controlled?
Most of the functions that accept track expressions have an
additional parameter named iterator
. The value of this
parameter determines the iterator intervals which is also sometimes
called an iterator policy:
Value | Iterator Policy Type | Example | Description |
---|---|---|---|
Integer | Fixed Bin | 50 |
Iterator intervals will advance by a fixed step (bin) starting from
zero coordinate up to chromosome’s length:
[0,50), [50,100), [100,150), ...
|
Dense track | Fixed Bin | "dense_track" |
Use the bin size of the track as a fixed step. |
1D intervals | 1D Intervals | "annotations" |
Iterate over the supplied intervals. Note: the intervals are sorted and overlapping intervals are unified. |
Sparse track | 1D Intervals | "sparse_track" |
Iterate over the intervals of a sparse track. |
Array track | 1D Intervals | "array_track" |
Iterate over the intervals of an array track. |
c(integer, integer) | 2D Intervals | c(1000, 2000) |
2D iterator intervals will cover the whole 2D chromosomal space by rectangles of fixed size: Width X Height. Please keep in mind that small rectangles used without a limiting scope might result in immense number of iterator intervals. |
2D intervals | 2D Intervals | gintervals.2d(c(1, 2)) |
Iterate over the supplied intervals. Note: the intervals are sorted and overlapping is forbidden. |
Rectangles track | 2D Intervals | "rects_track" |
Iterate over the intervals of a Rectangles track |
Cartesian grid iterator | 2D Intervals | giterator.cartesian_grid( intervals1, intervals2, c(10, 20, 30)) |
Iterate over 2D cartesian grid (see
giterator.cartesian_grid function) |
NULL | Fixed Bin OR 1D Intervals OR 2D Intervals | NULL | Implicitly determine the iterator policy based on the tracks that appear in the track expression. If no track names presented or two different tracks determine different iterator policy, an error is reported. |
Many functions that accept a track expressions and iterator policy accept an additional set of intervals that limit the scope of a function. This scope also limits the iterator intervals. For instance:
gextract("dense_track", gintervals(2, 340, 520))
As one can notice the first and the last intervals in the result are
truncated by the scope [340, 520)
.
In some cases the combination of iterator policy and scope might
result in nontrivial set of iterator intervals. Use
giterator.intervals
function to retrieve the iterator
intervals given a track expression, scope and an iterator.
As explained before track expression iterator can be determined
implicitly or through an iterator
parameter. In either case
the result is a set of 1D or 2D intervals depending on how the iterator
was defined. If iterator intervals are 2D an additional filter can be
applied to them: a band.
A band is a pair of integers: . We say that a 2D iterator interval intersects a band if and only if the next two conditions are true:
In a less formal way we can see a band as a space between two 45-degrees diagonals where determine where these diagonals cross axis. An iterator interval represents a rectangle in a 2D space and can be therefore intersected with S. The result of the intersection can be a rectangle, a trapeze, a triangle, a hexagon or it can be empty if the interval does not intersect with the band. If the intersection is non empty, the resulted figure, whatever it is, can be bound by some larger rectangle. The rectangle that has the minimal space and yet containing the intersected shape is called the minimal rectangle.
After the formal definitions it’s time to say how band is actually applied.
If the intersection between the 2D iterator interval and the band is non-empty and , the minimal rectangle replaces the original iterator interval. Otherwise the iterator interval is skipped as it lies outside of the band or the two chromosomes are not equal.
The gintervals.2d.band_intersect
function can help one
better understand the concept:
intervs <- gintervals.2d(1, 200, 800, 1, 100, 1000)
intervs <- rbind(intervs, gintervals.2d(1, 900, 950, 1, 0, 200))
intervs <- rbind(intervs, gintervals.2d(1, 0, 100, 1, 0, 400))
intervs <- rbind(intervs, gintervals.2d(1, 900, 950, 2, 0, 200))
intervs
gintervals.2d.band_intersect
intersects the intervals
with the band and returns the intervals shrunk to the minimal rectangle.
As you can see we have four different intervals. The first one
(chr1, 200, 800, chr1, 100, 1000)
intersects the band and
after shrinking to the minimal rectangle it becomes
(chr1, 600, 800, chr1, 100, 300)
. The second interval lies
entirely within the band and hence is returned without any change. The
third interval lies entirely outside of the band, and hence is
eliminated from the result. The last interval is coming from two
different chromosomes and therefore is also filtered out.
As said band filters out and alters 2D iterator intervals. Yet it also affects the result of 2D tracks. Let’s look at the following example:
intervs <- gintervals.2d(1, c(100, 400), c(300, 490), 1, c(120, 180), c(200, 500))
gtrack.2d.create("test2d", "test 2D track", intervs, c(10, 20))
gextract("test2d", .misha$ALLGENOME)
gextract("test2d", .misha$ALLGENOME, iterator = gintervals.2d(1, 0, 1000, 1, 0, 1000))
gintervals.2d.band_intersect(intervs, band = c(150, 1000))
gextract("test2d", .misha$ALLGENOME, iterator = gintervals.2d(1, 0, 1000, 1, 0, 1000), band = c(150, 1000))
gtrack.rm("test2d", force = TRUE)
We created a 2D track test2d
and inserted two values
into it:
and
.
If an iterator interval covers all the track’s rectangles, the resulted
value of the track would be a weighted average of its values where the
weight is equal to the intersected area. In our example it is
.
We added a band then. gintervals.2d.band_intersect
shows
the minimal rectangles: the intersection result of the original
rectangles with the band. The output of the new gextract
has been changed accordingly: the new weights in the weighted average
are equal to the new and smaller intersected area. The value has changed
therefore to:
.
Note, however, that the space used in the calculation of the weighted average is the actual space of the intersection and not the space occupied by the minimal rectangles!
Various functions in the library such as gsample
make
use of pseudo-random number generator. Each time the function is invoked
a unique series of random numbers is issued. Hence two identical calls
might produce different results. To guarantee reproducible results call
set.seed
before invoking the function.
To boost the run time performance various functions in the library
support multitasking mode, i.e. parallel computation of the result by
several concurrent processes. The exact number of processes internally
launched depends on the specific call however the upper bound can be
controlled by a few parameters such as gmax.processes
(absolute upper bound), gmax.processes2core
(maximal number
of processes per CPU core) and gmin.scope4process
(minimal
scope range / surface assigned to a process). Multitasking can also be
completely switched off by setting gmultitasking
parameter
to FALSE
.
For certain functions multitasking might result in higher memory
consumption. Users who have per process virtual memory limit (see:
ulimit -v
) might be the first to suffer from memory
allocation errors.
Various factors can affect the memory usage such as the number of
running processes used for parallel computation, the value of
gmax.data.size
option or the combination of both. Some of
the functions such as gscreen
or gextract
consume in multitasking mode amount of memory proportional to
gmax.data.size
. Please be aware of it while altering the
value of this option.
To limit memory consumption in multitasking mode one might lower down
the values of gmax.data.size
and
gmax.mem.usage
options or even switch off multitasking mode
completely. gmax.mem.usage
indicates the upper limit in KB
of memory consumed cumulatively by the child processes. Once this limit
is breached an internal mechanism tries to pause some of the running
child processes, thereby preventing them from allocating more memory.
The paused processes are resumed once the memory consumption drops or
other sibling processes end.
One should not expect the internal limiting mechanism to be the
panacea for memory hungry tasks. First, the memory consumption of some
of the functions is proportional to gmax.data.size
option
regardless of the number of running processes. Second, even when the
memory limit is exceeded at least one process is still left to run and
to potentially increase the memory consumption further. Third, the
mechanism is mainly periodic, i.e. excessive memory consumption is
detected only once in a while. The decision to pause running processes
is thus periodic as well. The memory that has already been consumed in
the time gap between the checks will not be release up until the whole
task is complete.
It is worth to say a word about memory consumption. Deducting real memory usage of the process based on “top”, “ps” or other utilities of similar kind might be highly misleading. Since all the processes are spawned from R, their memory usage as reported by these utilities will be at least as high as that of their parent process. If, for example, R process uses 5 Gb of memory and 10 processes are spawned from it, the virtual memory of all these 11 processes will top 55 Gb. Yet the majority of the consumed memory will be shared and unless the child processes start modifying this memory or allocating new one, the physical free memory of the machine will remain almost unaltered. The internal memory consumption limiting mechanism tries to estimate the drop of system free memory and hence deducts its data from counting “Private Dirty” bytes (on Linux) or from internal estimation (on other platforms) - a very different datum from what “top” is reporting.