CONSORT diagrams are part of the requirements in reporting parallel group clinical trials or case control designs in observational studies. They are described in the updated 2010 CONSORT statement (Schulz, Altman, and Moher 2010). They clarify how patients were recruited, selected, randomised and followed up. For observational studies an equivalent requirement is the STROBE statement (von Elm et al. 2008). There are other similar requirements for other types of study such as the TRIPOD statement that are applicable for multivariate models (Collins et al. 2015).
For this demonstration we use the cgd
data set from the
survival
package (Terry M. Therneau
and Patricia M. Grambsch 2000; Therneau 2022), which is from a
placebo controlled trial of gamma interferon in Chronic Granulomatous
Disease.
In this example the treatment
column contains the
intervention.
For the analysis we are considering only the first observations for each patient, and our study criteria are as follows:
This can be coded into the dplyr
pipeline, with
additional dtrackr
functions:
# Some useful formatting options
old = options(
dtrackr.strata_glue="{tolower(.value)}",
dtrackr.strata_sep=", ",
dtrackr.default_message = "{.count} records",
dtrackr.default_headline = NULL
)
demo_data = survival::cgd %>%
track() %>%
filter(enum == 1, .type="inclusion", .messages="{.count.out} first observation") %>%
include_any(
hos.cat == "US:NIH" ~ "{.included} NIH patients",
hos.cat == "US:other" ~ "{.included} other US patients"
) %>%
group_by(treat, .messages="cases versus controls") %>%
comment() %>%
capture_exclusions() %>%
exclude_all(
age<5 ~ "{.excluded} subjects under 5",
age>35 ~ "{.excluded} subjects over 35",
steroids == 1 ~ "{.excluded} on steroids at admission"
) %>%
comment(.messages = "{.count} after exclusions") %>%
status(
mean_height = sprintf("%1.2f \u00B1 %1.2f",mean(height),sd(height)),
mean_weight = sprintf("%1.2f \u00B1 %1.2f",mean(weight),sd(weight)),
.messages = c(
"average height: {mean_height}",
"average weight: {mean_weight}"
)
) %>%
ungroup(.messages = "{.count} in final data set")
# restore to originals
options(old)
With a bit of experimentation the flowchart needed for a
STROBE/CONSORT checklist can be generated. One option to output the
flowchart is svg
which can then be manually formatted as
required, but for publication ready output pdf
is usually
preferred.
During this pipeline, we may be keen to understand why certain data
items are being rejected. This would enable us to examine the source
data, and potentially correct it during the data collection process.
We’ve used it to allow continuous quality checks on the data to feed
back to the data curators, as we regularly conduct analyses. By tracking
the exclusions, not only do we track the data flow through the pipeline
we also retain all excluded items, with the reason for exclusion. Thus
we can reassure ourselves that the exclusions are as expected. We
enabled this by calling capture_exclusions()
in the
pipeline above. Having tracked the exclusions we can retrieve them by
calling excluded()
which gives a data frame with the
excluded records and the reasons. If the exclusions happened over
multiple stages as the dataframe format change in between then this will
be held as a nested dataframe (i.e. see ?tidyr::nest
):
# here we filter out the majority of the actual content of the excluded data to focus on the
# metadata recovered during the exclusion.
demo_data %>% excluded() %>% select(.stage,.message,.filter,age, steroids)
#> # A tibble: 15 × 5
#> .stage .message .filter age steroids
#> <chr> <glue> <chr> <chr> <chr>
#> 1 stage 1 7 subjects under 5 age < 5 2 0
#> 2 stage 1 7 subjects under 5 age < 5 1 0
#> 3 stage 1 7 subjects under 5 age < 5 1 0
#> 4 stage 1 7 subjects under 5 age < 5 1 0
#> 5 stage 1 7 subjects under 5 age < 5 1 0
#> 6 stage 1 7 subjects under 5 age < 5 4 0
#> 7 stage 1 7 subjects under 5 age < 5 3 0
#> 8 stage 1 4 subjects under 5 age < 5 3 0
#> 9 stage 1 4 subjects under 5 age < 5 4 0
#> 10 stage 1 4 subjects under 5 age < 5 1 0
#> 11 stage 1 4 subjects under 5 age < 5 4 0
#> 12 stage 1 3 subjects over 35 age > 35 44 0
#> 13 stage 1 3 subjects over 35 age > 35 38 0
#> 14 stage 1 3 subjects over 35 age > 35 37 0
#> 15 stage 1 1 on steroids at admission steroids == 1 6 1
This list may have multiple entries for a single data item, if for example something is excluded in any one step for many reasons.
For reporting results it is useful to have the numbers from the
flowchart to embed into the text of the results section of the write up.
Here we show the same pipeline as above, but with 4 uses of the
.tag
system for labelling part of the pipeline. This
captures data in a tag-value list during the pipeline, and retains it as
metadata for later reuse.
demo_data = survival::cgd %>%
track(.messages = NULL) %>%
filter(enum == 1, .type="inclusion", .messages="{.count.out} first observation") %>%
comment(.tag = "initial cohort") %>%
# ^^^^^^^^^^^^^^^^^^^^^
# TAGS DEFINED
include_any(
hos.cat == "US:NIH" ~ "{.included} NIH patients",
hos.cat == "US:other" ~ "{.included} other US patients"
) %>%
group_by(treat, .messages="cases versus controls") %>%
comment(.tag="study cohort") %>%
# ^^^^^^^^^^^^^^^^^^^
# SECOND SET OF TAGS DEFINED
capture_exclusions() %>%
exclude_all(
age<5 ~ "{.excluded} subjects under 5",
age>35 ~ "{.excluded} subjects over 35",
steroids == 1 ~ "{.excluded} on steroids at admission"
) %>%
comment(.messages = "{.count} after exclusions") %>%
status(
mean_height = sprintf("%1.2f \u00B1 %1.2f",mean(height),sd(height)),
mean_weight = sprintf("%1.2f \u00B1 %1.2f",mean(weight),sd(weight)),
.messages = c(
"average height: {mean_height}",
"average weight: {mean_weight}"
),
.tag = "qualifying patients"
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# THIRD SET TAGS DEFINED
) %>%
ungroup(.messages = "{.count} in final data set", .tag="final set")
# ^^^^^^^^^^^^^^^^
# LAST TAGS DEFINED
The tagged data can be retrieved as follows, which will give you all tagged data for all 4 points in the pipeline:
demo_data %>% tagged() %>% tidyr::unnest(.content)
#> # A tibble: 6 × 7
#> .tag .count .total .strata treat mean_height mean_weight
#> <chr> <int> <int> <chr> <fct> <chr> <chr>
#> 1 initial cohort 128 128 "" <NA> <NA> <NA>
#> 2 study cohort 43 89 "treat:placeb… plac… <NA> <NA>
#> 3 study cohort 46 89 "treat:rIFN-g" rIFN… <NA> <NA>
#> 4 qualifying patients 36 74 "treat:placeb… plac… 145.58 ± 2… 45.46 ± 23…
#> 5 qualifying patients 38 74 "treat:rIFN-g" rIFN… 143.14 ± 2… 40.70 ± 20…
#> 6 final set 74 74 "" <NA> <NA> <NA>
More often though you will want to retrieve specific values from specific points for the results text for example:
initialSet = demo_data %>% tagged(.tag = "initial cohort", .glue = "{.count} patients")
finalSet = demo_data %>% tagged(.tag = "final set", .glue = "{.count} patients")
# there were `r initialSet` in the study, of whom `r finalSet` met the eligibility criteria.
For example there were 128 patients in the study, of whom 74 patients met the eligibility criteria.
More complex formatting and calculations are made possible by use of
the glue
specification, including those that happen on a
per group basis, and we can also pull in values from elsewhere in our
analysis.
demo_data %>% tagged(
.tag = "qualifying patients",
.glue = "{.strata}: {.count}/{.total} ({sprintf('%1.1f', .count/.total*100)}%) patients on {sysDate}, with a mean height of {mean_height}",
sysDate = Sys.Date()
# we could have included any number of other parameters here from the global environment
) %>% dplyr::pull(.label)
#> treat:placebo: 36/74 (48.6%) patients on 2024-11-18, with a mean height of 145.58 ± 29.10
#> treat:rIFN-g: 38/74 (51.4%) patients on 2024-11-18, with a mean height of 143.14 ± 25.12
Sometimes it will be necessary to operate on all tagged content at
once. This is possible but be aware that the content available depends
somewhat on where the tag was set in the pipeline so not all fields will
always be present (although .count
and .total
will be). The .total
is the overall number of cases at that
point in the pipeline. .count
is the number of cases in
each strata.
demo_data %>% tagged(.glue = "{.count}/{.total} patients")
#> # A tibble: 6 × 3
#> .tag .strata .label
#> <chr> <chr> <glue>
#> 1 initial cohort "" 128/128 patients
#> 2 study cohort "treat:placebo" 43/89 patients
#> 3 study cohort "treat:rIFN-g" 46/89 patients
#> 4 qualifying patients "treat:placebo" 36/74 patients
#> 5 qualifying patients "treat:rIFN-g" 38/74 patients
#> 6 final set "" 74/74 patients
For comparing inclusions and exclusions at different stages in the pipeline using tags the following example may be useful:
demo_data %>%
tagged() %>% # selects only top level content
tidyr::unnest(.content) %>%
dplyr::select(.tag, .total) %>%
dplyr::distinct() %>%
tidyr::pivot_wider(values_from=.total, names_from=.tag) %>%
glue::glue_data("Out of {`initial cohort`} patients, {`study cohort`} were eligible for inclusion on the basis of their age
but {`study cohort`-`qualifying patients`} were outside the age limits.
This left {`final set`} patients included in the final study (i.e. overall {`initial cohort`-`final set`} were removed).")
#> Out of 128 patients, 89 were eligible for inclusion on the basis of their age
#> but 15 were outside the age limits.
#> This left 74 patients included in the final study (i.e. overall 54 were removed).
Composing dtrackr
inclusion and exclusion criteria into
functions lets you reuse them across different studies, but to be useful
these functions need to be parametrised. As dtrackr
uses
formulae to specify the inclusion and exclusion criteria, to construct
an appropriate formulae for dtrackr
using values that are
parametrised from a helper function requires injection support, using
rlang::inject()
. This is an advanced R topic but the
following example gives a sense of what is possible.
# This is a reusable function to restrict ages
age_restrict = function(df, age_col, min_age = 18, max_age = 65) {
age_col = rlang::ensym(age_col)
message = sprintf("{.included} between\n%d and %d years", min_age, max_age)
dtrackr::include_any(df,
# injection support for parameters must be made explicit using
# rlang::inject in any functions using include_any or exclude_all
rlang::inject(min_age <= !!age_col & max_age >= !!age_col ~ !!message)
)
}
survival::cgd %>%
# the `age` column is in the cgd dataset:
age_restrict(age, max_age = 30) %>%
# demonstrating that this works in 2 stages
age_restrict(age, min_age = 20) %>%
flowchart()