Calculates correlation statistics for pairs of track expressions

Calculates correlation statistics for pairs of track expressions.

emr_cor(
  ...,
  cor.exprs = NULL,
  include.lowest = FALSE,
  right = TRUE,
  stime = NULL,
  etime = NULL,
  iterator = NULL,
  keepref = FALSE,
  filter = NULL,
  dataframe = FALSE,
  names = NULL
)

Arguments

...: pairs of [factor.expr, breaks], where factor.expr is the track expression and breaks are the breaks that determine the bin or 'NULL'.
cor.exprs: vector of track expressions for which correlation statistics is calculated.
include.lowest: if 'TRUE', the lowest (or highest, for 'right = FALSE') value of the range determined by breaks is included.
right: if 'TRUE' the intervals are closed on the right (and open on the left), otherwise vice versa.
stime: start time scope.
etime: end time scope.
iterator: track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. See also 'iterator' section.
keepref: If 'TRUE' references are preserved in the iterator.
filter: Iterator filter.
dataframe: return a data frame instead of an N-dimensional vector.
names: names for track expressions in the returned dataframe (only relevant when dataframe == TRUE)

Value

A list of 5 elements each containing a N-dimensional vector (N is the number of 'expr'-'breaks' pairs). The member of each vector is a specific statistics matrix. If dataframe == TRUE - a data frame with a column for each track expression, additional columns i,j with pairs of cor_exprs and another 5 columns: 'n', 'e', 'var', 'cov', 'cor', see description.

Details

This function works in a similar manner to 'emr_dist'. However instead of returning a single counter for each bin 'emr_cor' returns 5 matrices of 'length(cor.exprs) X length(cor.exprs)' size. Each matrix represents the correlation statistics for each pair of track expressions from 'cor.exprs'. Given a 'bin' and a pair of track expressions 'cor.exprs[i]' and 'cor.exprs[j]' the corresponding matrix contains the following information:

$n[bin,i,j] - number of times when both 'cor.exprs[i]' and 'cor.exprs[j]' exist $e[bin,i,j] - expectation (average) of values from 'cor.exprs[i]' when 'cor.exprs[j]' exists $var[bin,i,j] - variance of values from 'cor.exprs[i]' when 'cor.exprs[j]' exists $cov[bin,i,j] - covariance of 'cor.exprs[i]' and 'cor.exprs[j]' $cor[bin,i,j] - correlation of 'cor.exprs[i]' and 'cor.exprs[j]'

Similarly to 'emr_dist' 'emr_cor' can do multi-dimensional binning. Given N dimensional binning the individual data in the matrices can be accessed as: $cor[bin1, ..., binN, i, j].

If dataframe = TRUE the return value is a data frame with a column for each track expression, additional columns i,j with pairs of cor_exprs and another 5 columns: 'n', 'e', 'var', 'cov', 'cor' with the same values as the matrices described above.

iterator

There are a few types of iterators:

Track iterator:: Track iterator returns the points (including the reference) from the specified track. Track name is specified as a string. If `keepref=FALSE` the reference of each point is set to `-1`
Example:

# Returns the level of glucose one hour after the insulin shot was made
emr_vtrack.create("glucose", "glucose_track", func="avg", time.shift=1)
emr_extract("glucose", iterator="insulin_shot_track")
Id-Time Points Iterator:: Id-Time points iterator generates points from an *id-time points table*. If `keepref=FALSE` the reference of each point is set to `-1`.
Example:

# Returns the level of glucose one hour after the insulin shot was made
emr_vtrack.create("glucose", "glucose_track", func = "avg", time.shift = 1)
r <- emr_extract("insulin_shot_track") # <– implicit iterator is used here
emr_extract("glucose", iterator = r)
Ids Iterator:: Ids iterator generates points with ids taken from an *ids table* and times that run from `stime` to `etime` with a step of 1. If `keepref=TRUE` for each id-time pair the iterator generates 255 points with references running from `0` to `254`. If `keepref=FALSE` only one point is generated for the given id and time, and its reference is set to `-1`.
Example:

stime <- emr_date2time(1, 1, 2016, 0)
etime <- emr_date2time(31, 12, 2016, 23)
emr_extract("glucose", iterator = data.frame(id = c(2, 5)), stime = stime, etime = etime)
Time Intervals Iterator:: *Time intervals iterator* generates points for all the ids that appear in 'patients.dob' track with times taken from a *time intervals table* (see: Appendix). Each time starts at the beginning of the time interval and runs to the end of it with a step of 1. That being said the points that lie outside of `[stime, etime]` range are skipped.
If `keepref=TRUE` for each id-time pair the iterator generates 255 points with references running from `0` to `254`. If `keepref=FALSE` only one point is generated for the given id and time, and its reference is set to `-1`.
Example:
# Returns the level of hangover for all patients the next day after New Year Eve for the years 2015 and 2016
stime1 <- emr_date2time(1, 1, 2015, 0)
etime1 <- emr_date2time(1, 1, 2015, 23)
stime2 <- emr_date2time(1, 1, 2016, 0)
etime2 <- emr_date2time(1, 1, 2016, 23)
emr_extract("alcohol_level_track", iterator = data.frame(
stime = c(stime1, stime2),
etime = c(etime1, etime2)
))
Id-Time Intervals Iterator:: *Id-Time intervals iterator* generates for each id points that cover `['stime', 'etime']` time range as specified in *id-time intervals table* (see: Appendix). Each time starts at the beginning of the time interval and runs to the end of it with a step of 1. That being said the points that lie outside of `[stime, etime]` range are skipped.
If `keepref=TRUE` for each id-time pair the iterator generates 255 points with references running from `0` to `254`. If `keepref=FALSE` only one point is generated for the given id and time, and its reference is set to `-1`
Beat Iterator:: *Beat Iterator* generates a "time beat" at the given period for each id that appear in 'patients.dob' track. The period is given always in hours.
Example:
emr_extract("glucose_track", iterator=10, stime=1000, etime=2000)
This will create a beat iterator with a period of 10 hours starting at `stime` up until `etime` is reached. If, for example, `stime` equals `1000` then the beat iterator will create for each id iterator points at times: 1000, 1010, 1020, ...
If `keepref=TRUE` for each id-time pair the iterator generates 255 points with references running from `0` to `254`. If `keepref=FALSE` only one point is generated for the given id and time, and its reference is set to `-1`.
Extended Beat Iterator:: *Extended beat iterator* is as its name suggests a variation on the beat iterator. It works by the same principle of creating time points with the given period however instead of basing the times count on `stime` it accepts an additional parameter - a track or a *Id-Time Points table* - that instructs what should be the initial time point for each of the ids. The two parameters (period and mapping) should come in a list. Each id is required to appear only once and if a certain id does not appear at all, it is skipped by the iterator.
Anyhow points that lie outside of `[stime, etime]` range are not generated.
Example:
# Returns the maximal weight of patients at one year span starting from their birthdays
emr_vtrack.create("weight", "weight_track", func = "max", time.shift = c(0, year()))
emr_extract("weight", iterator = list(year(), "birthday_track"), stime = 1000, etime = 2000)
Periodic Iterator:: periodic iterator goes over every year/month. You can use it by running emr_monthly_iterator or emr_yearly_iterator.
Example:
iter <- emr_yearly_iterator(emr_date2time(1, 1, 2002), emr_date2time(1, 1, 2017))
emr_extract("dense_track", iterator = iter, stime = 1, etime = 3)
iter <- emr_monthly_iterator(emr_date2time(1, 1, 2002), n = 15)
emr_extract("dense_track", iterator = iter, stime = 1, etime = 3)
Implicit Iterator:: The iterator is set implicitly if its value remains `NULL` (which is the default). In that case the track expression is analyzed and searched for track names. If all the track variables or virtual track variables point to the same track, this track is used as a source for a track iterator. If more then one track appears in the track expression, an error message is printed out notifying ambiguity.

Revealing Current Iterator Time: During the evaluation of a track expression one can access a specially defined variable named `EMR_TIME` (Python: `TIME`). This variable contains a vector (`numpy.ndarray` in Python) of current iterator times. The length of the vector matches the length of the track variable (which is a vector too).
Note that some values in `EMR_TIME` might be set 0. Skip those intervals and the values of the track variables at the corresponding indices.
# Returns times of the current iterator as a day of month
emr_extract("emr_time2dayofmonth(EMR_TIME)", iterator = "sparse_track")

Examples


emr_db.init_examples()
#> NULL
emr_cor("categorical_track", c(0, 2, 5),
    cor.exprs = c("sparse_track", "1/dense_track"),
    include.lowest = TRUE, iterator = "categorical_track",
    keepref = TRUE
)
#> $n
#> , , sparse_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]            3             3
#> (2,5]            2             2
#> 
#> , , 1/dense_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]            3             4
#> (2,5]            2             2
#> 
#> 
#> $e
#> , , sparse_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]           48    0.02608341
#> (2,5]          130    0.02136752
#> 
#> , , 1/dense_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]           48    0.03879333
#> (2,5]          130    0.02136752
#> 
#> 
#> $var
#> , , sparse_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]     338.6667  0.0001876685
#> (2,5]   10816.0000  0.0002922054
#> 
#> , , 1/dense_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]     338.6667  0.0006253773
#> (2,5]   10816.0000  0.0002922054
#> 
#> 
#> $cov
#> , , sparse_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]     338.6667    -0.2520039
#> (2,5]   10816.0000    -1.7777778
#> 
#> , , 1/dense_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]   -0.2520039  0.0006253773
#> (2,5]   -1.7777778  0.0002922054
#> 
#> 
#> $cor
#> , , sparse_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]            1    -0.9995979
#> (2,5]            1    -1.0000000
#> 
#> , , 1/dense_track
#> 
#>       sparse_track 1/dense_track
#> [0,2]   -0.9995979             1
#> (2,5]   -1.0000000             1
#> 
#> 
#> attr(,"breaks")
#> attr(,"breaks")[[1]]
#> [1] 0 2 5
#> 
emr_cor("categorical_track", c(0, 2, 5),
    cor.exprs = c("sparse_track", "1/dense_track"),
    include.lowest = TRUE, iterator = "categorical_track",
    keepref = TRUE,
    dataframe = TRUE
)
#>   categorical_track             i             j n            e          var
#> 1             [0,2]  sparse_track  sparse_track 3  48.00000000 3.386667e+02
#> 2             (2,5]  sparse_track  sparse_track 2 130.00000000 1.081600e+04
#> 3             [0,2] 1/dense_track  sparse_track 3   0.02608341 1.876685e-04
#> 4             (2,5] 1/dense_track  sparse_track 2   0.02136752 2.922054e-04
#> 5             [0,2]  sparse_track 1/dense_track 3  48.00000000 3.386667e+02
#> 6             (2,5]  sparse_track 1/dense_track 2 130.00000000 1.081600e+04
#> 7             [0,2] 1/dense_track 1/dense_track 4   0.03879333 6.253773e-04
#> 8             (2,5] 1/dense_track 1/dense_track 2   0.02136752 2.922054e-04
#>                cov        cor
#> 1   338.6666666667  1.0000000
#> 2 10816.0000000000  1.0000000
#> 3    -0.2520039101 -0.9995979
#> 4    -1.7777777778 -1.0000000
#> 5    -0.2520039101 -0.9995979
#> 6    -1.7777777778 -1.0000000
#> 7     0.0006253773  1.0000000
#> 8     0.0002922054  1.0000000