Skip to contents

build an xgboost cross validation classification model with k-fold cross-validation for each featureset provided, assumed that the classification is defined by the previous model

Usage

mldpEHR.mortality_multi_age_predictors(
  patients,
  features,
  step,
  nfolds,
  required_conditions = "id==id",
  q_thresh = 0.05,
  xgboost_params = list(booster = "gbtree", objective = "binary:logistic", subsample =
    0.7, max_depth = 3, colsample_bytree = 1, eta = 0.05, min_child_weight = 1, gamma =
    0, eval_metric = "auc"),
  nrounds = 1000
)

Arguments

patients
  • list of data.frames of all the patients in the system going back in time. For example the first data.frame represents age 80, next is 75 and so forth. Each patient data.frame contains the following columns:

  • patient id

  • sex

  • age

  • death - age at death, NA if unknown

  • followup - available followup time (in years) for this patient - time until end of database or until patient exists the system (not due to death) and any additional columns required for patient filtering in the future

features
  • list of data.frames of features. Muste contain patient id column.

step
  • time between prediction models

nfolds
  • number of folds used for k-fold cross validation

required_conditions
  • any filter to apply to the patients to filter out training/testing samples (e.g. missing data)

q_thresh
  • score quantile threshold for target classification of 1

xgboost_params
  • parameters used for xgboost model training

nrounds
  • number of training rounds

Value

the full list of predictors, according to provided patients. Each predictor is a list with the following members:

  • model - list of xgboost models, for each fold

  • train - data.frame containing the patients id, fold, target class and predicted value in training (each id was used in nfolds-1 for training)

  • test - data.frame containing the patients id, fold, target class and predicted value in testing

  • xgboost_params - the set of parameters used in xgboost

  • nrounds - number of training iterations conducted

Examples


library(dplyr)
library(ggplot2)
# build base predictor
N <- 1000
patients <- purrr::map(0:5, ~ data.frame(
    id = 1:N,
    sex = rep(c(1, 2), N / 2),
    age = 80 - .x * 5,
    death = c(rep(NA, 0.2 * N), rep(82, 0.8 * N)),
    followup = .x * 5 + 5
)) %>%
    setNames(seq(80, by = -5, length.out = 6))
features <- purrr::map(0:5, ~ data.frame(
    id = 1:N,
    a = c(rnorm(0.2 * N), rnorm(0.8 * N, mean = 2, sd = 0.5))
)) %>% setNames(seq(80, by = -5, length.out = 6))
predictors <- mldpEHR.mortality_multi_age_predictors(patients, features, 5, 3, q_thresh = 0.2)
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  1s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  1s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  1s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  2s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  1s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  1s
#> 
test <- purrr::map2_df(predictors, names(predictors), ~ .x$test %>%
    mutate(n = .y) %>%
    arrange(id) %>%
    mutate(
        outcome =
            c(
                rep("alive", 0.2 * N),
                rep("death", 0.8 * N)
            )
    ))
ggplot(test, aes(x = predict, colour = factor(outcome))) +
    facet_wrap(~n, nrow = 1) +
    stat_ecdf() +
    theme_bw()