library(naryn)
#> 
#> Attaching package: 'naryn'
#> The following object is masked from 'package:base':
#> 
#>     months
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
theme_set(theme_classic())

Getting started with naryn

Naryn is an implementation of a time-series database, for efficient storage, retrieval and analysis of electronic health records (EHR).

Download the example database

Towards this vignette we are going to use a small database which was simulated to include an example of a typical EMR database. It can be downloaded from here or using the following code:

Note that although smaller than the real database, this example database is still quite large (~1.2GB) and will take a few minutes to download.

Tracks

The basic element of Naryn is a track - a single numerical data element (e.g. RBC lab test result) that is recorded for many patients at various time points. A track can be thought of as a very sparse two-dimensional matrix with a row for each patient in the database, and a column for each timepoint in the resolution of hours. Another way to think of a track is as a table with triplets of patient, timepoint and value:

id time lab.RBC
1 1218708 4.784053
1 1223571 4.534783
1 1227064 4.114022
2 1230857 4.588766
2 1272632 4.507858
3 1218461 4.620333
3 1242600 5.412831
3 1246149 5.555339

The value can be any numerical value, but it is usually a lab test result, or categorical variable representing a diagnosis. The time is a number representing the number of hours since 1/3/1867 00:00, and the id is a unique identifier for each patient which is defined in a special track called ‘patients.dob’, which contains the time of birth for each patient (see below).

For more information on tracks, see the ‘Tracks’ section in the advanced vignette.

Connect to database

A naryn database is a folder containing a number of tracks. To connect to a database, use the emr_db.connect function:

emr_db.connect("sample_db")

Now we can use emr_track.ls to list all the tracks in the database:

head(emr_track.ls())
#> [1] "dx.icd9_000" "dx.icd9_001" "dx.icd9_002" "dx.icd9_003" "dx.icd9_004"
#> [6] "dx.icd9_005"

# show number of available tracks in database
length(emr_track.ls())
#> [1] 2995

We can also use emr_track.ls to list all the tracks that match a certain pattern:

emr_track.ls("RBC")
#> [1] "lab.RBC"      "lab.RBC.URIN"

Note: Naryn supports connecting to multiple databases at the same time, by giving a vector of paths to emr_db.connect, see more in the ‘Database’ section of the advanced vignette.

Extract data from tracks

We can now go back to the ‘RBC’ example and extract the data from the track. This can be done using the emr_extract() function which is the ‘Swiss army knife’ of the package. It can be used to extract data from a single track, or from multiple tracks, while applying various filters and transformations, but we will start with the simplest example of extracting the data from a single track:

rbc_df <- emr_extract("lab.RBC")
head(rbc_df)
#>   id    time ref  lab.RBC
#> 1  1 1218708  -1 4.784053
#> 2  1 1223571  -1 4.534783
#> 3  1 1227064  -1 4.114022
#> 4  1 1227864  -1 4.002392
#> 5  1 1232022  -1 4.384029
#> 6  1 1234887  -1 5.053118

We can see that the data is returned as a data frame with four columns: id, time, ref and value. The id column contains the patient id, the time column contains the time of the measurement in hours since 1/3/1867 00:00, and the value column contains the Red Blood Cell count of the patient at each timepoint. For information regarding the ref column, see the ‘Records and References’ section in the advanced vignette.

The data frame is sorted by id and time, so that the data for each patient is consecutive.

Track expression

The first argument of emr_extract was, in the example above, a string representing the name of the track. However, it can also be a track expression, which is a string that can contain functions that will be applied to the track after extracting the data. For example, we can extract the lab value multiplied by two:

rbc_df <- emr_extract("lab.RBC * 2")
head(rbc_df)
#>   id    time ref lab.RBC * 2
#> 1  1 1218708  -1    9.568107
#> 2  1 1223571  -1    9.069567
#> 3  1 1227064  -1    8.228045
#> 4  1 1227864  -1    8.004785
#> 5  1 1232022  -1    8.768059
#> 6  1 1234887  -1   10.106236

Note that the functions applied to the track in the track expression should be functions that can be applied to a vector of values, and should return a vector of the same length.

Changing time to date

We can transform the time to year, month day and hour using the emr_time2date() function:

rbc_df %>%
    mutate(emr_time2date(time)) %>%
    head()
#>   id    time ref lab.RBC * 2 year month day hour
#> 1  1 1218708  -1    9.568107 2006     3  11   12
#> 2  1 1223571  -1    9.069567 2006     9  30    3
#> 3  1 1227064  -1    8.228045 2007     2  22   16
#> 4  1 1227864  -1    8.004785 2007     3  28    0
#> 5  1 1232022  -1    8.768059 2007     9  17    6
#> 6  1 1234887  -1   10.106236 2008     1  14   15

Iterators and virtual tracks

In the previous example we extracted the data from a single track, what happens if we want to extract data from multiple tracks at once? For example, we would want to extract the RBC and WBC (White Blood cell Count) of patients which had a heart disease (ICD9 code 411), 5 years before their diagnosis. In order to achieve this we need to introduce the concept of iterators.

An iterator is a set of points in the patient-time space1 that defines the way in which naryn traverses the database. In the RBC example, the iterator was simply the set of all patient-time points that were included in the track, but now - we would want our point-of-view to be the heart disease diagnosis so we would set the iterator to the track of ICD code 411:

heart_blood <- emr_extract(c("lab.RBC", "lab.WBC"), iterator = "dx.icd9_411")
head(heart_blood)
#>   id    time ref lab.RBC lab.WBC
#> 1  1 1220308  -1     NaN     NaN
#> 2  1 1220947  -1     NaN     NaN
#> 3  1 1241362  -1     NaN     NaN
#> 4  4 1280579  -1     NaN     NaN
#> 5  4 1329136  -1     NaN     NaN
#> 6  4 1335742  -1     NaN     NaN
nrow(na.omit(heart_blood))
#> [1] 19

We can see that the “lab.RBC” and “lab.WBC” are all NA. This is because the iterator is set to the “dx.icd9_411” diagnosis, and the RBC and WBC tests were not performed at the exact time the diagnosis was recorded. In order to get the RBC test that was done 5 years prior to the diagnosis, we would have to:

  1. Tell naryn to look back 5 years before the diagnosis
  2. Tell naryn what to do if there is more than one RBC test in the 5 years prior to the diagnosis, for example - take the earliest, or more generally - tell naryn which function to apply to values of the track.

This can be done using a virtual track. A virtual track is a way to tell naryn how to compute the value of a track when at a specific point in the patient-time space. It is created using the emr_virtual_track() function, which tells naryn how to shift the time for a given track (1, time.shift) and which function to apply to the values of the track (2, func). Note that the time shift is always in reference to the iterator. So for example, in our case we would like to look at an RBC / WBC measurement 5 years before the diagnosis. If our iterator is at the time of heart disease diagnosis, we want to look at a time window that starts 5 years before and ends at the time of diagnosis:

emr_vtrack.create("rbc_5y", "lab.RBC", time.shift = c(-years(5), 0), func = "earliest")
emr_vtrack.create("wbc_5y", "lab.WBC", time.shift = c(-years(5), 0), func = "earliest")

Now, we can extract the data from the virtual tracks (instead of the tracks themselves):

heart_blood <- emr_extract(c("rbc_5y", "wbc_5y"), iterator = "dx.icd9_411", names = c("RBC", "WBC"))
head(heart_blood)
#>   id    time ref      RBC      WBC
#> 1  1 1220308  -1 4.784053 8.008108
#> 2  1 1220947  -1 4.784053 8.008108
#> 3  1 1241362  -1 4.784053 8.008108
#> 4  4 1280579  -1 4.617556 5.105508
#> 5  4 1329136  -1 4.421784 6.366830
#> 6  4 1335742  -1 4.619302 5.197200

We would like to also know how long before the diagnosis each test was performed, so we will create an additional virtual track, this time with a function that computes the difference between the time of the diagnosis and the earliest blood test. Also, we would like the time difference to be in the resolution of months, so we will give emr_extract a track expression that divides the time difference by 30 * 24 (the number of hours in a month), which is equivalent to the month function:

emr_vtrack.create("rbc_5y_d", "lab.RBC", time.shift = c(-years(5), 0), func = "dt2.earliest")
emr_vtrack.create("wbc_5y_d", "lab.WBC", time.shift = c(-years(5), 0), func = "dt2.earliest")
heart_blood <- emr_extract(
    c("rbc_5y", "wbc_5y", "rbc_5y_d/month()", "wbc_5y_d/month()"),
    iterator = "dx.icd9_411",
    names = c("RBC", "WBC", "RBC_d", "WBC_d")
)
head(heart_blood)
#>   id    time ref      RBC      WBC     RBC_d     WBC_d
#> 1  1 1220308  -1 4.784053 8.008108  2.222222  2.222222
#> 2  1 1220947  -1 4.784053 8.008108  3.109722  3.109722
#> 3  1 1241362  -1 4.784053 8.008108 31.463889 31.463889
#> 4  4 1280579  -1 4.617556 5.105508 57.434722 57.434722
#> 5  4 1329136  -1 4.421784 6.366830 59.809722 59.809722
#> 6  4 1335742  -1 4.619302 5.197200 52.718056 52.718056

Yay! we got what we wanted, but something is still weird - we can see that some patients (and in this sampled database - most patients) have more than one diagnosis of a heart disease. Many times, this is indeed the case - a patient can be diagnosed with heart disease multiple times, but more commonly in EMR data - the same diagnosis is recorded multiple times, and what we actually want is the earliest diagnosis. In order to achieve that we would have to use filters.

Filters

A filter is a logical condition that is applied to the iterator in order to decide which points to include and which to exclude. In our case we want to include only the earliest diagnosis of heart disease, so we would use the emr_filter.create function to create a filter that would exclude all points that had a diagnosis of heart disease prior to the current point:

emr_filter.create("heart_disease_in_past", "dx.icd9_411", time.shift = c(-years(120), -1))

We will now apply the filter to our query:

heart_blood <- emr_extract(
    c("rbc_5y", "wbc_5y", "rbc_5y_d/month()", "wbc_5y_d/month()"),
    iterator = "dx.icd9_411",
    names = c("RBC", "WBC", "RBC_d", "WBC_d"),
    filter = "!heart_disease_in_past"
)
head(heart_blood)
#>   id    time ref      RBC      WBC     RBC_d     WBC_d
#> 1  1 1220308  -1 4.784053 8.008108  2.222222  2.222222
#> 2  4 1280579  -1 4.617556 5.105508 57.434722 57.434722
#> 3  5 1306345  -1 6.196034 9.277723 57.438889 57.438889
#> 4  6 1259009  -1 4.271791 7.004467 43.415278 43.415278
#> 5 19 1322741  -1 4.385486 7.695645 53.869444 53.869444
#> 6 22 1227666  -1 3.858857 4.801625 13.520833 13.520833

Voila! every patient now has only the earliest diagnosis of heart disease.

Value filters

Filters can not only be used to exclude or include points by the mere existence of a point in the patient-time space (like the previous example), but also by the value of the point. For example, we can create a filter that would include only points where the RBC was abnormal (say, above 6):

emr_filter.create("abnormal_rbc", "rbc_5y", val = 6, operator = ">")
heart_blood_abnormal_rbc <- emr_extract(
    c("rbc_5y", "wbc_5y"),
    iterator = "dx.icd9_411",
    filter = "!heart_disease_in_past & abnormal_rbc",
    names = c("RBC", "WBC")
)
head(heart_blood_abnormal_rbc)
#>     id    time ref      RBC       WBC
#> 1    5 1306345  -1 6.196034  9.277723
#> 2 2230 1320077  -1 7.511171  7.800000
#> 3 3312 1276025  -1 6.453111 11.903167
#> 4 3432 1332812  -1 6.169343  7.750395
#> 5 3529 1269751  -1 6.435419  8.229447
#> 6 4326 1234490  -1 6.006401  9.864933

Another example is to include or exclude based on the value of a categorical track. For example, in order to include only patients that were diagnosed with Postmyocardial infarction syndrome (ICD9 code 411.0) we filter “dx.icd9_411” to include only points with value of “10” (see note below):

emr_filter.create("postmyocardial_infarction", "dx.icd9_411", val = 10)
postmyo_blood <- emr_extract(
    c("rbc_5y", "wbc_5y", "dx.icd9_411"),
    iterator = "dx.icd9_411",
    filter = "!heart_disease_in_past & postmyocardial_infarction",
    names = c("RBC", "WBC", "icd9_411")
)
head(postmyo_blood)
#>     id    time ref      RBC       WBC icd9_411
#> 1  252 1243750  -1 4.303543  5.099193       10
#> 2 1537 1277787  -1 4.400787  5.095808       10
#> 3 3226 1280679  -1 4.061680  4.804326       10
#> 4 3320 1349528  -1 4.672515  5.503540       10
#> 5 3407 1263521  -1 5.311868 10.404092       10
#> 6 3529 1269751  -1 6.435419  8.229447       10

NOTE: since ICD9 diagnosis codes have a tree like structure, and X.0, X.00 are both valid codes and must be distinguishable, the diagnosis tracks all include a prefix of 1 for the minor code, so X.0 will be translated to icd9_X and a value of 10 will be stored instead of 0.

Extract all patients in database

As noted above, all patients must be listed in a track called ‘patients.dob’. This track contains for each patient a single time point at the time of birth. In this mock database, the value per patient reflect their sex: 1 for male, 2 for female:

patients <- emr_extract("patients.dob", names = "sex")
head(patients)
#>   id    time ref sex
#> 1  1  535998  -1   1
#> 2  2 1024040  -1   2
#> 3  3  807917  -1   1
#> 4  4  864871  -1   1
#> 5  5  826123  -1   1
#> 6  6  817547  -1   2
patients %>%
    count(sex) %>%
    mutate(sex = c("male", "female")[sex])
#>      sex     n
#> 1   male 48621
#> 2 female 51379

Define virtual tracks for age and sex

Given what we learned about iterators, virtual tracks and filters, how would we extract the age of a patient at a given timepoint? For example, say that we want to know the age of a patient at the time of their first diagnosis of heart disease. We can do this by creating a virtual track that computes the difference between the time of the diagnosis and the time of birth.

A virtual track needs 4 things - name, source, time shift and function. We should set the name to “age”, the source to “patients.dob” and at each point of the iterator we would like to go backward in time a maximal amount (say - 120 years) and compute the time difference between the point and the birth, so we will set the time.shift to c(-years(120), 0) and the function to dt2.earliest:

emr_vtrack.create("age", "patients.dob", time.shift = c(-years(120), 0), func = "dt2.earliest")

We can now extract the age of the patient at the time of their first diagnosis of heart disease in the resolution of years:

age_at_heart_diag <- emr_extract("age/year()", iterator = "dx.icd9_411", name = "age")
head(age_at_heart_diag)
#>   id    time ref      age
#> 1  1 1220308  -1 78.11758
#> 2  1 1220947  -1 78.19053
#> 3  1 1241362  -1 80.52100
#> 4  4 1280579  -1 47.45525
#> 5  4 1329136  -1 52.99829
#> 6  4 1335742  -1 53.75240

Sex can be defined in a similar way going backward in time and taking the earliest value at the “patients.dob” track:

emr_vtrack.create("sex", "patients.dob", time.shift = c(-years(120), 0), func = "earliest")

Define a virtual track for survival - time until death

Assume that patients.dod is a track with a single entry (at most) for each patient in the EHR database at time of death. Define a virtual track that computes the time until death for each patient in the current iterator timepoint:

emr_vtrack.create("survival", "patients.dod", time.shift = c(0, 120 * 365 * 24), func = "dt1.earliest")

Time range

Until now we learned how to use filters to exclude or include datapoints by their value or existence, but what if we want to include only points that are within a certain time range? For example, say that we want to extract the age and WBC test values of tests that were performed in 2010. We can do this by using the stime and etime arguments of emr_extract which limit the query to points that are between the given start and end times:

wbc_2010 <- emr_extract(c("age/year()", "lab.WBC"), iterator = "lab.WBC", stime = emr_date2time(1, 1, 2010), etime = emr_date2time(1, 1, 2011), names = c("age", "wbc"))
head(wbc_2010)
#>   id    time ref      age       wbc
#> 1  1 1252814  -1 81.82831  8.058747
#> 2  1 1253515  -1 81.90833  7.199924
#> 3  3 1257273  -1 51.29635  6.600398
#> 4  4 1253349  -1 44.34680  8.300706
#> 5  4 1254029  -1 44.42443  7.102493
#> 6  4 1256304  -1 44.68413 11.600000

Computing distributions

emr_extract is not the only function that can be used to extract data from the database. Another function that can be used to extract data is emr_dist, which is used to compute distributions of values of a track. For example, say that we want to compute the distribution of Hemoglobin (HGB) values in the database, stratified by age and sex. One way to do it would be to extract the values of HGB, age and sex, and then compute the distribution using standard R functions. However, this would require loading the entire track into memory, which is not always possible, and moreover - it is less efficient to go over the same data twice. emr_dist can to it in one shot, by accepting pairs of tracks expression and breaks and counting the number of values in each strata:

hgb_dist <- emr_dist(
    "age/year()", c(20, 50, 90),
    "sex", NULL,
    "lab.HGB", seq(10, 16, by = 0.5),
    iterator = "lab.HGB",
    dataframe = TRUE,
    names = c("age", "sex", "hgb"),
    right = FALSE
)
head(hgb_dist)
#>       age sex       hgb    n
#> 1 [20,50)   1 [10,10.5)  266
#> 2 [50,90)   1 [10,10.5) 2948
#> 3 [20,50)   2 [10,10.5) 6930
#> 4 [50,90)   2 [10,10.5) 8402
#> 5 [20,50)   1 [10.5,11)  344
#> 6 [50,90)   1 [10.5,11) 4355

Note that for the categorical track “sex” the breaks were set to NULL which implicitly uses all the possible values in the track.

We can now plot the distributions:

hgb_dist %>%
    mutate(sex = factor(c("male", "female")[sex])) %>%
    group_by(age, sex) %>%
    mutate(p = n / sum(n)) %>%
    ggplot(aes(x = hgb, y = p, colour = age, group = age)) +
    scale_color_manual(values = c("darkred", "blue")) +
    geom_line() +
    facet_wrap(~sex) +
    ylab("Fraction of tests") +
    xlab("HGB") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Additional examples

In the next few sections we will present a few additional examples of how to use naryn for analysis of EMR data. It is recommended to try and implement a solution to each example before looking at the code below it.

Extract survival time for patients with pancreatic cancer

Extract the survival time (time until death) for patients with pancreatic cancer (ICD9 code 157.9):

Define a filter for just Pancreatic cancer in the 157 icd9 code:

emr_filter.create("pancreatic_cancer", "dx.icd9_157", val = 19)

Define a filter for previous pancreatic cancer (sometime in the past):

emr_filter.create("pancreatic_cancer_in_past", "dx.icd9_157", val = 19, time.shift = c(-years(120), -1))

To find the first diagnosis of pancreatic cancer we will go over all 157 diagnosis, filter out those that are not pancreatic cancer and make sure there wasn’t a prior diagnosis of pancreatic cancer:

pancreatic_cancer_survival <- emr_extract("survival", iterator = "dx.icd9_157", filter = "pancreatic_cancer & !pancreatic_cancer_in_past")
nrow(pancreatic_cancer_survival)
#> [1] 25
head(pancreatic_cancer_survival)
#>      id    time ref survival
#> 1 12192 1256866  -1    19910
#> 2 17022 1311645  -1      NaN
#> 3 24217 1309964  -1    14332
#> 4 27971 1280769  -1     3927
#> 5 28135 1263030  -1    23106
#> 6 34252 1341200  -1     9016

Note that NA in survival means that the patient still has not died.

We can now compute the Kaplan-Meier survival curve for pancreatic cancer patients. Censoring is applied to reflect latest update of database (Jan 6, 2022):

pancreatic_cancer_survival <- pancreatic_cancer_survival %>%
    mutate(
        follow_time = ifelse(!is.na(survival), survival, emr_date2time(6, 1, 2022) - time),
        status = ifelse(is.na(survival), 0, 1)
    ) %>%
    mutate(follow_time = follow_time / month())

Fit the survival curve:

library(survminer)
#> Loading required package: ggpubr
library(survival)
#> 
#> Attaching package: 'survival'
#> The following object is masked from 'package:survminer':
#> 
#>     myeloma
fit <- survminer::surv_fit(survival::Surv(follow_time, status) ~ 1, data = pancreatic_cancer_survival)
survminer::ggsurvplot(fit, data = pancreatic_cancer_survival, xlab = "Time (months)")

Extract HGB and age for all patients between the ages 60 and 70

Extract all Hemoglobin tests (HGB), age (in years) and sex for patients between the ages of 60 and 70:

emr_filter.create("age_60_70", "patients.dob", time.shift = c(-years(70), -years(60)))
hgb_60_70 <- emr_extract(c("age/year()", "lab.HGB", "sex"), iterator = "lab.HGB", filter = "age_60_70", names = c("age", "HGB", "sex"))

We can now plot the distribution:

hgb_60_70 %>%
    mutate(sex = factor(c("male", "female")[sex])) %>%
    ggplot(aes(x = HGB, color = sex, group = sex)) +
    geom_density()

Extract patients age at time of diagnosis of heart disease that did not have diabetes in the past

Extract patients age at the time of their first diagnosis of heart disease (diagnosis.411) that did not have diabetes (diagnosis.250) in the past.

This is very similar to the example above, in which we used a filter to exclude patients that had a previous diagnosis of heart disease. We are now going to use a slightly different approach that would first extract a the patient-time points of the heart disease onset and then use an additional query with these points as the iterator to filter out the patients that had diabetes in the past. This approach is useful in cases where we want to use the onset for other purposes, and it is convenient to have it in a separate table.

What would be our iterator? We can iterate the heart disease diagnosis track, and in many cases this would be the right choice. However, sometimes this track can be quite large, for example if it is a very common lab test. In such cases we can iterate over the “patients.dob” track instead, while shifting the time forward by the maximum amount (e.g. 120 years):

# Find onset of heart disease
emr_vtrack.create("heart_onset", "dx.icd9_411", time.shift = c(0, years(120)), func = "earliest.time")
emr_filter.create("has_heart_disease", "dx.icd9_411", time.shift = c(0, years(120)))

# Retreive the earliest time of heart disease for all patients in db that have a heart disease
heart_onset <- emr_extract("heart_onset", iterator = "patients.dob", filter = "has_heart_disease")
head(heart_onset)
#>   id   time ref heart_onset
#> 1  1 535998  -1     1220308
#> 2  4 864871  -1     1280579
#> 3  5 826123  -1     1306345
#> 4  6 817547  -1     1259009
#> 5 19 734365  -1     1322741
#> 6 22 469544  -1     1227666

Now we can use the data frame we created as an iterator:

# filter patients that already have diabetes before heart disease
emr_filter.create("has_diabetes", "dx.icd9_250", time.shift = c(-years(120), 0))
heart_onset_no_diabetes <- emr_extract("age/year()", iterator = heart_onset %>% select(id, time = heart_onset), filter = "!has_diabetes", names = "age")

nrow(heart_onset)
#> [1] 25596
nrow(heart_onset_no_diabetes)
#> [1] 5137

head(heart_onset_no_diabetes)
#>    id    time ref      age
#> 1   6 1259009  -1 50.39521
#> 2  31 1229936  -1 69.97317
#> 3  36 1236550  -1 42.01963
#> 4  66 1237208  -1 71.25753
#> 5 114 1320426  -1 68.11872
#> 6 132 1352258  -1 68.12420

Extract median Hemoglobin for all males between ages 60 to 70

Extract for each male patient the median Hemoglobin value between ages 60 and 70.

Again we would use “patients.dob” as our iterator - this is very efficient as every patient would be examined only once, and define a virtual track with a time shift of 60 to 70 years with a function of median:

emr_vtrack.create("median_hgb", "lab.HGB", time.shift = c(years(60), years(70)), func = "quantile", params = c(0.5))

Define a filter for males only:

emr_filter.create("is_male", "patients.dob", val = 1)

Add another filter for having an HGB test in the relevant ages:

emr_filter.create("has_hgb", "lab.HGB", time.shift = c(years(60), years(70)))

Extract median_hgb for males only:

male_hgb_60_70_q50 <- emr_extract("median_hgb", iterator = "patients.dob", filter = "is_male & has_hgb")

Compare with female median hgb:

female_hgb_60_70_q50 <- emr_extract("median_hgb", iterator = "patients.dob", filter = "!is_male & has_hgb")
plot(density(male_hgb_60_70_q50$median_hgb, na.rm = TRUE), col = "blue", main = "median HGB in ages 60-70")
lines(density(female_hgb_60_70_q50$median_hgb, na.rm = TRUE), col = "red")

Count number of patients by age that were in the system in January 2020

What does it mean to be “in the system”? Every EHR system would have its own definition of this, but in general - we want the patients that have already been born, have not died yet, have registered with the EHR system and haven’t left the system (for good) yet.

We will start by creating a set of filters that will define the above conditions:

emr_filter.create("born", "patients.dob", time.shift = c(-years(120), 0))
emr_filter.create("dead", "patients.dod", time.shift = c(-years(120), 0))
emr_filter.create("registered", "patients.status.register", time.shift = c(-years(120), 0))
emr_filter.create("left_for_good", "patients.status.lfg", time.shift = c(-years(120), 0))

We can now use an iterator of a single point while applying all the filters to count the number of patients that were in the system in January 2020:

age_dist_2020 <- emr_dist("age/year()", 0:120, "sex", NULL,
    iterator = 1,
    stime = emr_date2time(1, 1, 2020),
    etime = emr_date2time(1, 1, 2020),
    filter = "born & !dead & registered & !left_for_good",
    names = c("age", "sex"),
    dataframe = TRUE,
    right = FALSE
)
head(age_dist_2020 %>% filter(n > 0))
#>       age sex   n
#> 1 [19,20)   1  38
#> 2 [20,21)   1 105
#> 3 [21,22)   1 176
#> 4 [22,23)   1 250
#> 5 [23,24)   1 370
#> 6 [24,25)   1 436

We can plot the distribution:

age_dist_2020 %>%
    mutate(sex = factor(c("male", "female")[sex])) %>%
    mutate(age = as.numeric(age) - 1) %>%
    ggplot(aes(x = age, y = n, colour = sex, fill = sex, group = sex)) +
    geom_col() +
    facet_wrap(~sex) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))


  1. Actually, patient-time-reference space, but this is explained in the advanced vignette.↩︎