Skip to contents

build an xgboost cross validation classification model with k-fold cross-validation for each featureset provided, assumed that the classification is defined by the previous model

Usage

mldpEHR.disease_multi_age_predictors(
  patients,
  features,
  step,
  nfolds,
  required_conditions = "id==id",
  xgboost_params = list(booster = "gbtree", objective = "binary:logistic", subsample =
    0.7, max_depth = 3, colsample_bytree = 1, eta = 0.05, min_child_weight = 1, gamma =
    0, eval_metric = "auc"),
  nrounds = 1000
)

Arguments

patients
  • list of data.frames of all the patients in the system going back in time. For example the first data.frame represents age 80, next is 75 and so forth. Each patient data.frame contains the following columns:

  • patient id

  • sex

  • age

  • death - age at death, NA if unknown

  • disease - age at disease, NA if unknown

  • followup - available followup time (in years) for this patient - time until end of database or until patient exists the system (not due to death) and any additional columns required for patient filtering in the future

features
  • list of data.frames of features

step
  • time between prediction models

nfolds
  • number of folds used for k-fold cross validation

required_conditions
  • any filter to apply to the features to filter out training/testing samples (e.g. missing data)

xgboost_params
  • parameters used for xgboost model training

nrounds
  • number of training rounds

Value

the full list of predictors, according to provided patients, Each predictor is a list with the following members:

  • model - list of xgboost models, for each fold

  • train - data.frame containing the patients id, fold, target class and predicted value in training (each id was used in nfolds-1 for training)

  • test - data.frame containing the patients id, fold, target class and predicted value in testing (each id was tested once in the fold it was not used for training)

  • xgboost_params - the set of parameters used in xgboost

  • nrounds - number of training iterations conducted

Examples


library(dplyr)
library(ggplot2)
# build base predictor
N <- 1000
patients <- purrr::map(0:5, ~ data.frame(
    id = 1:N,
    sex = rep(1, N),
    age = 80 - .x * 5,
    death = c(rep(NA, 0.4 * N), rep(82, 0.6 * N)),
    disease = rep(rep(c(NA, 81), each = N / 4), 2),
    followup = .x * 5 + 5
)) %>%
    setNames(seq(80, by = -5, length.out = 6))
features <- purrr::map(0:5, ~ data.frame(
    id = 1:N,
    a = c(rnorm(0.4 * N), rnorm(0.6 * N, mean = 2, sd = 1)),
    b = rep(c(rnorm(N / 4), rnorm(N / 4, mean = 3)), 2)
)) %>% setNames(seq(80, by = -5, length.out = 6))
predictors <- mldpEHR.disease_multi_age_predictors(patients, features, 5, 3)
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  1s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  1s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  2s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  1s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  2s
#> 
#> 

#>   Training [-----------------------------] 0/3 (  0%) in  0s
#> 

#>   Training [=========>-------------------] 1/3 ( 33%) in  0s
#> 

#>   Training [==================>----------] 2/3 ( 67%) in  1s
#> 

#>   Training [=============================] 3/3 (100%) in  2s
#> 


test <- purrr::map2_df(predictors, names(predictors), ~ .x$test %>%
    mutate(n = .y) %>%
    arrange(id) %>%
    mutate(
        outcome =
            c(
                rep("healthy", 0.25 * N),
                rep("disease", 0.15 * N),
                rep("disease_death", 0.1 * N),
                rep("death", 0.25 * N),
                rep("disease_death", 0.25 * N)
            )
    ))
ggplot(test, aes(x = predict, colour = factor(outcome))) +
    facet_wrap(~n, nrow = 1) +
    stat_ecdf() +
    theme_bw()