Naryn allows efficient access and analysis of medical records that are maintained in a custom database.
Naryn can work under R (as a package) or Python (as a module). The vast majority of the functions and the concepts are shared between the two implementations, yet certain differences still exist and are summarized in a table below. Code examples and function names in this document are presented for R but they can equally run in Python with the interface changes as to the table.
Naryn allows accessing the data that resides in tracks where
each track holds certain type of medical data such as patients’
diagnoses or their hemoglobin level at certain points of time. The track
files can be aggregated from one or more directories. Before the tracks
can be accessed, Naryn needs to establish connection to the directories,
also referred as db dirs. Call emr_db.connect
function to establish the access to the tracks in the db_dirs. To
establish a connection using emr_db.connect
, Naryn requires
to specify at-least one db dir. Optionally, emr_db.connect
accepts additional db dirs which can also contain additional tracks. In
a case where 2 or more db dirs contain the same track name (namespace
collision), the track will be taken from the db dir which was passed
last in the order of connections. For example, if we have 2 db
dirs /db1
and /db2
which both contain a track
named track1
, the call
emr_db.connect(c('/db1', '/db2'))
will result with Naryn
using track1
from /db2
. As you might expect
the overriding is consistent not only for the track’s data itself, but
also for any other Naryn entity using or pointing to the track.
Even though all db directories may contain track files, their
designation is different. All the db dirs except for the last dir in the
order of connections are mainly read-only. The directory which was
connected last in the order, is termed user dir, and is
intended to store volatile data like the results of intermediate
calculations. New tracks can be created only in the db dir which was
last in the order of connections, using emr_track.import
or
emr_track.create
. In order to write tracks to a db dir
which is not last in the connection order, you must explicitly pass the
path to the required db dir, and this should be done for a well
justified reason.
A track may be marked as read-only to prevent its accidental deletion
or modification. Use emr_track.readonly
to set or get
read-only property of the track. A newly created tracks is always
writable. If you wish to mark it as “read-only”, please do it in a
separate call.
emr_db.connect
supports two modes of work - ‘load on
demand’ and ‘pre-load’. In ‘load on demand’ mode tracks are loaded into
memory only when they are accessed. Tracks stay in the memory up until R
sessions ends or the package is unloaded (Python: since modules cannot
be forced to unload, db_unload
is introduced).
In ‘pre-load’ mode, all the tracks are pre-loaded into memory making
subsequent track access significantly faster. As loaded tracks reside in
shared memory, other R sessions running on the same machine may also
enjoy significant run-time boost. On the flip side, pre-loading all the
tracks prolongs the execution of emr_db.connect
and
requires enough memory to accommodate all the data.
Choosing between the two modes depends on the specific needs. While
load_on_demand=TRUE
seems to be a solid default choice, in
an environment where there are frequent short-living R sessions, each
accessing a track, one might opt for running a “daemon” - an additional
permanent R session. The daemon would pre-load all the tracks in advance
and stay alive thus boosting the run-time of the later emerging
sessions.
Naryn caches certain data on the disk to maintain fast run-times. In
particular two files (.naryn
and .ids
) are
created in any database, and another file called
.logical_tracks
is created in global databases.
.naryn
file contains a list of all tracks in the current
root directory and their last modification dates. This file spares a
full root directory rescan when emr_db.connect
is called.
The recorded modification dates allow to efficiently synchronize the
track changes induced by synchronously running R sessions.
.logical_tracks
implements the same mechanism for
logical tracks, which store their properties (source and values) under a
folder called logical
.
.ids
file contains available ids that are used to run
certain types of track expression iterators (see below). The
source of these ids comes from a `patients.dob} (i.e. Date Of Birth)
track, which must be present in the global root directory before these
iterators may be utilized.
Various functions such as emr_track.import
modify these
files according to the changes that DB undergoes (addition / removal /
modification of tracks). Thus manual (outside of Naryn) modification,
replacement, addition or deletion of track files cause the cache files
to go out of sync. Various problems might arise as a consequence, such
as run-time errors, out-dated data from modified tracks and sub-optimal
run-time performance.
Manual modifications of the database files can still be performed,
yet they must be ratified by running emr_db.reload
.
Naryn creates files and directories with a umask of 007
(except for read-only tracks), which means that files and directories
would have permissions of 660 (rw-rw----)
and
770 (rwxrwx---)
respectively. This means that in order to
access a database that someone outside the group created, the file and
folder permissions need to be changed first.
Each track is stored in a binary file with .nrtrack
file
extension. One of the two internal formats, dense or
sparse, is automatically selected during the track creation.
The choice of the exact format is based on the optimal run-time
performance.
Track is a data structure that stores a set of records of
(id, time, ref, numeric value)
type. For example,
hemoglobin level of patients can be stored in this way, where
id
would be the id of the patient and time
would indicate the moment when the blood test was made. Another track
can contain the code of the laboratory which carried out the test. If
the times of the records from the two tracks match, one would conclude
which lab performed the given test.
Time resolution is always in hours. It might happen that two different blood tests are carried out by two different labs for the same patient at the same hour. Assuming that each lab has certain bias due to different equipment used, the reads of the hemoglobin might come out different. Since both of the tests are carried out at exactly the same hour it will be impossible later to link each result to the lab that performed it.
In those cases when two or more values share identical
id
and time
Naryn requires them to use then
different ref
(references). A reference is an
integer number in the range of [-1, 254], which when no time collision
occurs is normally set to -1. However, in cases of ambiguity it can give
additional resolution to the time. In our blood example the results of
the first lab could have been recorded with ref = 0
and the
second lab would do it with ref = 1
. This way the two
hemoglobin readings could later be separated and correctly linked to
their originating labs.
Tracks store numerical values assigned to the patients and times. The numerical data however can have different meaning and hence impose different set of operations to be applied to it. Laboratory codes, diagnosis codes, binary information such as date of birth or doctor visits are one type of data which we call categorical. Another type of data indicate usually the readings of different instruments such as the heartbeat rate or glucose level. This type of data is called quantitative.
The operations that can be applied to both of these types can be very different. One might want to search for the specific diagnosis code, yet it makes little sense to search for the very specific heartbeat rate, say “68”. On contrary heartbeat rate readings from different times can be averaged or a mean value might be calculated - something that has no meaning in case of categorical data.
During the track creation one must specify the type of the track: categorical or quantitative. Various operations that can be later applied to the track are bound to the track type.
In addition to the physical tracks which are stored in the binary
files, naryn
supports a concept of a logical track
which is an alias to a physical track. For example, assume we have a
track called lab.103
which contains hemoglobin levels of
patients. It would be more convenient to refer to it explicitly by
hemoglobin
instead of remembering the lab code. Logical
tracks do exactly this, we can create a logical track called
hemoglobin
which refers to the physical
lab.103
:
emr_track.logical.create("hemoglobin", "lab.103")
emr_extract("hemoglobin")
You can also use logical tracks to create an alias for
specific values from a categorical track. For example, suppose
we have a track called diagnosis.250
which contains the
diagnosis times of ICD code 250 (“250.*”), with the values
being the sub-diagnosis (e.g. 1
for 250.1 and
4
for 250.4). Logical tracks allow us to create an
alias for a specific sub-diagnosis value and then refer to it as a
regular track:
emr_track.logical.create("dx.250.1_4", "diagnosis.250", values = c(1,4))
emr_extract("dx.250.1_4")
Under the hood logical tracks are implemented using the virtual
tracks mechanism (see below), but unlike virtual tracks - they are part
of the database and are persistent between sessions. You can delete a
logical track by calling emr_track.logical.rm
and list them
using emr_track.logical.ls
.
In addition to numeric data a track may store arbitrary meta-data such as description, source, etc. The meta-data is stored in the form of name-value pairs or attributes where the value is a character string.
Though not officially enforced attributes are intended to store relatively short character strings. Please use track variables to store data in any other format.
A single attribute can be retrieved, added, modified or deleted using
emr_track.attr.get
and emr_track.attr.set
functions. Bulk access to more than one attribute is facilitated by
emr_track.attr.export
function.
Track names which attributes values match a pattern can be retrieved
using emr_track.ls
, emr_track.global.ls
and
emr_track.user.ls
functions.
Track statistics, results of time-consuming per-track calculations,
historical data and any other data in arbitrary format can be stored in
a track’s supplementary data in the form of track variables. Track
variable can be retrieved, added, modified or deleted using
emr_track.var.get
, emr_track.var.set
and
emr_track.var.rm
functions. List of track variables can be
retrieved using emr_track.var.ls
function.
Note: track variables created in R are not visible in Python and vice versa.
Though both track attributes and track variables can be used to store meta-data of a track, there are a few important differences between the two that are summed up in the following table:
Track Attributes | Track Variables | |
---|---|---|
Optimal use case | Track properties as short, non-empty character strings (description, source, …) | Arbitrary data associated with the track |
Value type | Character string | Arbitrary |
Single value retrieval | emr_track.attr.get |
emr_track.var.get |
Bulk value retrieval | emr_track.attr.export |
— |
Single value modification | emr_track.attr.set |
emr_track.var.set |
Object names retrieval | emr_track.attr.export |
emr_track.var.ls |
Object removal | emr_track.attr.rm |
emr_track.var.rm |
Search by value | R: emr_track.ls , emr_track.global.ls ,
emr_track.user.ls
|
— |
R vs. Python compatibility | Yes | No |
The analysis of data often involves dividing the data to train and
test sets. Naryn allows to subset the data via
emr_db.subset
function. emr_db.subset
accepts
a list of ids or samples the ids randomly. These ids constitute the
subset. The ids that are not in the subset are skipped by all the
iterators, filters and various functions.
One may think of a subset as an additional layer, a “viewport”, that
filters out some of the ids. Some lower-level functions such as
emr_track.info
or emr_track.unique
ignore the
subsets. Same applies to percentile.*
functions of the
virtual tracks.
Track expression allows to retrieve numerical data that is
recorded in the tracks. Track expressions are widely used in various
functions (emr_screen
, emr_extract
,
emr_dist
, …).
Track expression is a character string that closely resembles a valid
R/Python expression. Just like any other R/Python expression it may
include conditions, function calls and variables defined beforehand.
"1 > 2"
, "mean(1:10)"
and
"myvar < 17"
are all valid track expressions. Unlike
regular R/Python expressions track expression might also contain track
names and / or virtual track names.
To understand how the track expression allows the access to the tracks we must explain how the track expression gets evaluated.
Every track expression is accompanied by an iterator that
produces a set of id-time points of
(id, time, ref)
type. For each each iterator point the
track expression is evaluated. The value of the track expression
"mean(1:10)"
is constant regardless the iterator point.
However the track expression might contain a track name
mytrack
, like: "mytrack * 3"
. Naryn recognizes
then that mytrack
is not a regular R/Python variable but
rather a track name. A new run-time track variable named
mytrack
is added then to R environment (or Python module
local dictionary). For each iterator point this variable is assigned the
value of the track that matches (id, time, ref)
(or NaN if
no matching value exists in the track). Once mytrack
is
assigned the corresponding value, the track expression is evaluated in
R/Python.
To boost the performance of the track expression evaluation, run-time track variables are actually defined as vectors in R rather than scalars. The result of the evaluation is expected to be also a vector of a similar size. One should always keep in his mind the vectorial notation and write the track expressions accordingly.
For example, at first glance a track expression
"min(mytrack, 10)"
seems to be perfectly fine. However the
evaluation of this expression produces always a scalar, i.e. a single
number even if mytrack
is actually a vector. The way to
correct the specific track expression so that it works on vectors, is to
use pmin
function instead of min
.
Python
Similarly to R, a track variable in Python is not a scalar but rather
an instance of numpy.ndarray
. The evaluation of a track
expression must therefore produce a numpy.ndarray
as well.
Various operations on numpy arrays indeed work the same way as with
scalars, however logical operations require different syntax. For
instance:
will produce an error given that mytrack1
and
mytrack2
are numpy arrays. The correct way to write the
expression is:
One may coerce the track variable to behave like a scalar: by setting
emr_eval.buf.size
option to 1
(see Appendix
for more details). Beware though that this might take its heavy toll on
run-time.
If the track expression contains a track (or virtual track) name,
then the values from the track are fetched one-by-one into the
identically named R variable based on id
, time
and ref
of the iterator point. If however ref
of the iterator point equals to -1
, we treat it as a
“wildcard”: matching is required then only for id
and
time
.
“Wildcard” reference in the iterator might create a new issue: more
than one track value might match then a single iterator point. In this
case the value placed in the track variable (e.g. mytrack
)
depends on the type of the track. If the track is categorical the track
variable is set to -1
, otherwise it is set to the average
of all matching values.
So far we have shown that in some situations mytrack
variable can be set to the average of the matching track values. But
what if we do not want to average the values but rather pick up the
maximal, minimal or median value? What if we want to use the percentile
of a track value rather than the value itself? And maybe we even want to
alter the time of the iterator point: shift it or expand to a time
window and by that look at the different set of track values? For
instance: given an iterator point we might want to know what was the
maximal level of glucose during the last year that preceded the time of
the point.
This is where virtual tracks come in use.
Virtual track is a named set of rules that describe how the track
should be proceeded, and how the time of the iterator point should be
modified. Virtual tracks are created by emr_vtrack.create
function:
emr_vtrack.create("annual_glucose",
src = "glucose_track", func = "quantile",
param = 0.5, time.shift = c(-year(), 0)
)
This call creates a new virtual track named
annual_glucose
based on the underlying physical source
track glucose_track
. For each iterator point with time
T
we look at values of glucose_track
in the
time window of [T-365*24,T]
, i.e. one year prior to
T
. We calculate then the median over the values
(func="quantile"
, param=0.5
).
There is a rich set of various functions besides “quantile” that can
be applied to the track values. Some of these functions can be used only
with categorical tracks, other ones - only with quantitative tracks and
some functions can be applied to both types of the track. Please refer
the documentation of emr_vtrack.create
.
Once a virtual track is created it can be used in a track expression:
emr_extract("annual_glucose", iterator = list(year(), "patients.dob"))
This would give us a median of an annual glucose level in year-steps starting from the patient’s birthday. (This example makes use of an Extended Beat Iterator that would be explained later.)
Let’s expand our example further and ignore in our calculations the
glucose readings that had been made within a week after steroids had
been prescribed. We can use an additional filter
parameter
to do that.
emr_filter.create("steroids_filter", "steroids_track", time.shift=c(-week(), 0))
emr_vtrack.create("annual_glucose",
src = "glucose_track", func = "quantile",
param = 0.5, time.shift = c(-year(), 0), filter = "!steroids_filter"
)
emr_extract("annual_glucose", iterator = list(year(), "date_of_birth_track"))
Filter is applied to the ID-Time points of the source track
(e.g. glucose_track
in our example). The virtual track
function (quantile
, …) is applied then only to the points
that pass the filter. The concept of filters is explained extensively in
a separate chapter.
Virtual tracks allow also to remap the patient ids. This is done via
id.map
parameter which accepts a data frame that defines
the id mapping. Remapping ids might be useful if family ties are
explored. For example, instead of glucose level of the patient we are
interested to check the glucose level of one of his family members.
So far we have discussed the track expressions and how they are evaluated given the iterator point. In this section we will show how the iterator points are generated.
An iterator is defined via iterator
parameter. There are
a few types of iterators such as track iterator, beat
iterator, etc. The type determines which points are generated by
the iterator. The information about each type is listed below.
Iterator is always accompanied by four additional parameters:
stime
, etime
, keepref
and
filter
. stime
and etime
bind the
time scope of the iterator: the points that the iterator generates lie
always within these boundaries. The effect of keepref=TRUE
depends on the iterator type. However for all the iterator types if
keepref=FALSE
the reference of all the iterator points is
set to -1
. filter
parameter sets the iterator
filter which is discussed thoroughly later in the document in a separate
chapter.
Track iterator returns the points (including the reference) from the specified track. Track name is specified as a string.
If keepref=FALSE
the reference of each point is set to
-1
.
Example:
# Returns the level of glucose one hour after the insulin shot was made
emr_vtrack.create("glucose", "glucose_track", func="avg", time.shift=1)
emr_extract("glucose", iterator="insulin_shot_track")
Id-Time points iterator generates points from an id-time
points table (see: Appendix). If keepref=FALSE
the
reference of each point is set to -1
.
Example:
# Returns the level of glucose one hour after the insulin shot was made
emr_vtrack.create("glucose", "glucose_track", func="avg", time.shift=1)
r <- emr_extract("insulin_shot_track") # <-- implicit iterator is used here
emr_extract("glucose", iterator=r)
Ids iterator generates points with ids taken from an ids
table (see: Appendix) and times that run from stime
to
etime
with a step of 1.
If keepref=TRUE
for each id-time pair the iterator
generates 255 points with references running from 0
to
254
. If keepref=FALSE
only one point is
generated for the given id and time, and its reference is set to
-1
.
Example:
# Returns the level of glucose for each hour in year 2016 for ids 2 and 5
stime <- emr_date2time(1, 1, 2016, 0)
etime <- emr_date2time(31, 12, 2016, 23)
emr_extract("glucose", iterator=data.frame(id=c(2,5)), stime=stime, etime=etime)
Time intervals iterator generates points for all the ids
that appear in ‘patients.dob’ track with times taken from a time
intervals table (see: Appendix). Each time starts at the beginning
of the time interval and runs to the end of it with a step of 1. That
being said the points that lie outside of [stime, etime]
range are skipped.
If keepref=TRUE
for each id-time pair the iterator
generates 255 points with references running from 0
to
254
. If keepref=FALSE
only one point is
generated for the given id and time, and its reference is set to
-1
.
Example:
# Returns the level of hangover for all patients the next day after New Year Eve
# for the years 2015 and 2016
stime1 <- emr_date2time(1, 1, 2015, 0)
etime1 <- emr_date2time(1, 1, 2015, 23)
stime2 <- emr_date2time(1, 1, 2016, 0)
etime2 <- emr_date2time(1, 1, 2016, 23)
emr_extract("alcohol_level_track", iterator=data.frame(stime=c(stime1, stime2),
etime=c(etime1, etime2)))
Id-Time intervals iterator generates for each id points that
cover ['stime', 'etime']
time range as specified in
id-time intervals table (see: Appendix). Each time starts at
the beginning of the time interval and runs to the end of it with a step
of 1. That being said the points that lie outside of
[stime, etime]
range are skipped.
If keepref=TRUE
for each id-time pair the iterator
generates 255 points with references running from 0
to
254
. If keepref=FALSE
only one point is
generated for the given id and time, and its reference is set to
-1
.
Beat Iterator generates a “time beat” at the given period for each id that appear in ‘patients.dob’ track. The period is given always in hours.
Example:
emr_extract("glucose_track", iterator=10, stime=1000, etime=2000)
This will create a beat iterator with a period of 10 hours starting
at stime
up until etime
is reached. If, for
example, stime
equals 1000
then the beat
iterator will create for each id iterator points at times: 1000, 1010,
1020, …
If keepref=TRUE
for each id-time pair the iterator
generates 255 points with references running from 0
to
254
. If keepref=FALSE
only one point is
generated for the given id and time, and its reference is set to
-1
.
Extended beat iterator is as its name suggests a variation
on the beat iterator. It works by the same principle of creating time
points with the given period however instead of basing the times count
on stime
it accepts an additional parameter - a track or a
Id-Time Points table - that instructs what should be the
initial time point for each of the ids. The two parameters (period and
mapping) should come in a list. Each id is required to appear only once
and if a certain id does not appear at all, it is skipped by the
iterator.
Anyhow points that lie outside of [stime, etime]
range
are not generated.
Example:
# Returns the maximal weight of patients at one year span starting from their birthdays
emr_vtrack.create("weight", "weight_track", func = "max", time.shift = c(0, year()))
emr_extract("weight", iterator = list(year(), "birthday_track"), stime = 1000, etime = 2000)
periodic iterator goes over every year/month. You
can use it by running emr_monthly_iterator
or
emr_yearly_iterator
.
Example:
iter <- emr_yearly_iterator(emr_date2time(1, 1, 2002), emr_date2time(1, 1, 2017))
emr_extract("dense_track", iterator = iter, stime = 1, etime = 3)
iter <- emr_monthly_iterator(emr_date2time(1, 1, 2002), n = 15)
emr_extract("dense_track", iterator = iter, stime = 1, etime = 3)
The iterator is set implicitly if its value remains NULL
(which is the default). In that case the track expression is analyzed
and searched for track names. If all the track variables or virtual
track variables point to the same track, this track is used as a source
for a track iterator. If more then one track appears in the track
expression, an error message is printed out notifying ambiguity.
During the evaluation of a track expression one can access a
specially defined variable named EMR_TIME
(Python:
TIME
). This variable contains a vector
(numpy.ndarray
in Python) of current iterator times. The
length of the vector matches the length of the track variable (which is
a vector too).
Note that some values in EMR_TIME
might be set 0. Skip
those intervals and the values of the track variables at the
corresponding indices.
# Returns times of the current iterator as a day of month
emr_extract("emr_time2dayofmonth(EMR_TIME)", iterator = "sparse_track")
Filter is used to approve / reject an ID-Time point. It can
be applied to an iterator, in which case the iterator points are
required to be approved by the filter before they are passed further to
the track expression. Filter may also be used by a virtual track. In
this case the virtual track function (see func
parameter of
emr_vtrack.create
) is applied only to the points from the
source track (src
parameter) that pass the filter.
Filter has a form of a logical expression consisting of
named or unnamed elementary filters (the
“building bricks” of the filter) connected with the logical operators:
&
, |
, !
(and
,
or
and not
in Python) and brackets
()
.
Suppose we are interested in hemoglobin levels of patients who were
prescribed either drugX or drugY but not drugZ within a time window of
one week before the test. Assume that drugX, drugY and drugZ are
residing each in its separate track. Without filters we would need to
call emr_extract
four times, store potentially huge data
frame results in the memory and finally merge the tables within R while
caring about time windows. With filters we can do it much easier:
emr_filter.create("filterX", "drugX", time.shift = c(week(), 0))
emr_filter.create("filterY", "drugY", time.shift = c(week(), 0))
emr_filter.create("filterZ", "drugZ", time.shift = c(week(), 0))
emr_extract("hemoglobin", filter = "(filterX | filterY) & !filterZ")
We can further expand the example above by specifying the ‘operator’ argument on filter creation. If we wish to extract, the same information as before, but in this case we are interested only in patients which have an hemoglobin level of at least 16 (in addition to our drug treatment requirements). Under the same assumptions in the previous example, our code would look like:
emr_filter.create("filterX", "drugX", time.shift = c(week(), 0))
emr_filter.create("filterY", "drugY", time.shift = c(week(), 0))
emr_filter.create("filterZ", "drugZ", time.shift = c(week(), 0))
emr_filter.create("hemoglobin_gt_16", "hemoglobin", val=16, operator=">")
emr_extract("hemoglobin", filter = "(filterX | filterY) & !filterZ & hemoglobin_gt_16")
Python
Filter with logical conditions will use Python’s notation like:
Each call to emr_filter.create
creates a named
elementary filter (or simply: named filter) with a unique name. The
named filter can then be used in filter
parameter of an
iterator and be combined with other named filters using the logical
operators.
In our previous example we created three named filters based on three
tracks. If time window was not required, we could have used the names of
the tracks directly in the filter, like:
filter = "(drugX | drugY) & !drugZ"
.
In addition to track names other types of objects can be used within
the filter. These are: Id-Time Points Table, Ids
Table, Time Intervals Table and Id-Time Intervals
Table (see Appendix for the format of these tables). When
used in the filter the object should be constructed in advanced and be
referred by its name. “In place” construction (aka:
filter = "data.frame(...)"
is not allowed.
The ID-Time Point embeds within itself a reference value. Named
filters allow to specify whether the reference should be used for
matching or not. When keepref=TRUE
is set within
emr_filter.create
, the candidate point’s reference is
matched with the filter’s reference. Otherwise the references are
ignored.
It is important to remember that references are always ignored when
any object but a named filter is used within a filter. For instance, if
filter = "drug"
and drug
is a name of a track
(and not a name of a named filter), then the references will be ignored
during the matching. To ensure the filter matches the references of
drug
track, one must define a named filter with
keepref=TRUE
parameter:
emr_filter.create("drug_filter", "drug", keepref=TRUE)
emr_extract(my.track.expression, filter="drug_filter", keepref=TRUE)
Various functions in the library such as emr_quantiles
make use of pseudo-random number generator. Each time the function is
invoked a unique series of random numbers is issued. Hence two identical
calls might produce different results. To guarantee reproducible results
call set.seed
(Python: seed
) before invoking
the function.
Note: R and Python implementations of Naryn use different pseudo-random number generator algorithms. Sadly it means that the result achieved in R cannot be reproducible in Python if random is used, even if identical seed is shared between the two platforms.
To boost the run time performance various functions in the library support multitasking mode, i.e. parallel computation by several concurrent processes. Multitasking is not invoked immediately: approximately 0.3 seconds from the function launch the actual progress is measured and total run-time is estimated. If the estimated run-time exceeds the limit (currently: 2 seconds), multitasking kicks in.
The number of processes launched in the multitasking mode depends on
the total run-time estimation (longer run-time will use more processes)
and the values of emr_min.processes
and
emr_max.processes
R options. In any case the number of
processes never exceeds the number of CPU cores available.
Multitasking can significantly boost the performance however it
utilizes more CPU. When CPU utilization is the priority it is advisable
to switch off multitasking by setting emr_multitasking
R
option to FALSE
.
In addition to increased CPU usage multitasking might also alter the
behavior of functions that return ID-Time points such as
emr_extract
and emr_screen
. When multitasking
is not invoked these functions return the results always sorted by ID,
time and reference. In multitasking mode however the result might come
out unsorted. Moreover subsequent calls might return results reshuffled
differently. One might use sort
parameter in these
functions to ensure the points come out sorted. Please bear in mind that
sorting the results takes its toll especially on particularly large data
frames. That’s why by default sort
is set to
FALSE
.
R | Python | |
---|---|---|
Naming Conventions (except for virtual track ‘func’, which stays unchanged) |
emr_xxx.yyy.zzz |
xxx_yyy_zzz |
Variables | Defined in .naryn environment:
EMR_GROOT EMR_UROOT
|
Defined in module’s
environment:_GROOT _UROOT
|
Run-time Variables (available only during track expression evaluation) | EMR_TIME |
TIME |
Package / Module Options | Controlled via standard options mechanism: options(emr_xxx.yyy=zzz) getOption("emr_xxx.yyy")
|
Controlled by module’s CONFIG variable: CONFIG['xxx_yyy']=zzz CONFIG['xxx_yyy']
|
Data Types (used as function parameters) |
data.frame list vector of strings vector of numerics
NULL
|
pandas.DataFrame list list of strings numpy.ndarray of
numerics None
|
Data Types (return value) |
data.frame list vector of strings vector of
numerics, no labels vector of numerics, with labels
NULL
|
pandas.DataFrame dict numpy.ndarray of objects (strings) numpy.ndarray of numerics pandas.DataFrame with two columns (label, numeric) None
|
Database Management | Database is unloaded when the package is detached. |
db_unload() must be called explicitly to unload the
database. |
Setting seed for random number generator. Note: R and Python use different random generators, results are therefore not reproducible between them. |
set.seed |
seed |
Track Variables | Variables saved in Python are not visible in R. | Variables saved in R are not visible in Python. |
Setting Track Variables |
emr_track.set creates a directory named
.trackname.var
|
track_set creates a directory named
.trackname.pyvar
|
Named Filters and Virtual Tracks | Named filters and virtuals tracks may be saved along with the rest of R’s environment. |
filter_export , filter_import ,
vtrack_export , vtrack_import must be
explicitly called to save / restore named filters or virtual
tracks. |
Pattern Matching |
emr_track.ls , emr_track.global.ls ,
emr_track.user.ls , emr_track.var.ls , emr_filter.ls accept pattern matching parameters. Return: vector
of strings that match the pattern. |
track_ls , track_global_ls , track_user_ls , track_var_ls , filter_ls do not support pattern matching. Return: numpy.ndarray of objects (strings) that contains
all the objects (tracks, …) |
Time shift parameter (used in various functions) |
time.shift is a numeric or a vector of two
numerics. |
time_shift is a numeric or a list of two numerics. |
Calculating Distribution |
emr_dist returns N-dimensional vector with
labels (dimension names) |
dist return N-dimensional numpy.ndarray
without labels. |
Calculating Correlation Statistics |
emr_cor : For N-dimensional binning the returned value r may be addressed as: r$cor[bin1,...,binN,i,j] , where i and
j are indices of cor.exprs . |
cor : For N-dimensional binning the returned value r may be addressed as: r['cor'][bin1,...,binN,i,j] , where i and
j are indices of cor_exprs . |
Others |
emr_annotate |
Not implemented, use pandas.DataFrame.merge
or pandas.merge_sorted instead. |
Naryn supports the following options. The options can be set/examined
via R’s options
and getOption
.
(Use CONFIG['option_name']
to control the module options
in Python. Please mind as well Python’s name convention: R’s
emr_xxx.yyy
option will change its name to
xxx_yyy
.)
Option | Default Value | Description |
---|---|---|
emr_multitasking |
TRUE |
Should the multitasking be allowed? |
emr_min.processes |
8 |
Minimal number of processes launched when multitasking is invoked. |
emr_max.processes |
20 |
Maximal number of processes launched when multitasking is invoked. |
emr_max.data.size |
10000000 |
Maximal size of data sets (rows of a data frame, length of a vector, …) stored in memory. Prevents excessive memory usage. |
emr_eval.buf.size |
1000 |
Size of the track expression evaluation buffer. |
emr_warning.itr.no.filter.size |
100000 |
Threshold above which “beat iterator used without filter” warning is issued. |
Id-Time Points table is a data frame having two first columns named
‘id’ and ‘time’. References might be specified by a third column named
‘ref’. If ‘ref’ column is missing or named differently references are
set to -1
. Additional columns, if presented, are
ignored.
Id-Time Values table is an extension of Id-Time Points table with an additional column named ‘value’. Additional columns, if presented, are ignored.
Ids table is a data frame having the first column named ‘id’. Each id must appear only once. Additional columns of the data frame, if presented, are ignored.