| Title: | Track your Data Pipelines |
|---|---|
| Description: | Track and document 'dplyr' data pipelines. As you filter, mutate, and join your way through a data set, 'dtrackr' seamlessly keeps track of your data flow and makes publication ready documentation of a data pipeline simple. |
| Authors: | Robert Challen [aut, cre] (ORCID: <https://orcid.org/0000-0002-5504-7768>) |
| Maintainer: | Robert Challen <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.5.0 |
| Built: | 2026-06-01 08:25:09 UTC |
| Source: | https://github.com/terminological/dtrackr |
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
## S3 method for class 'trackr_df' add_count(x, ..., .messages = "", .headline = "", .tag = NULL) ## S3 method for class 'trackr_df' add_count(x, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' add_count(x, ..., .messages = "", .headline = "", .tag = NULL) ## S3 method for class 'trackr_df' add_count(x, ..., .messages = "", .headline = "", .tag = NULL)
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
<
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::add_count()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::add_tally()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::anti_join() for more details
on the underlying functions.
## S3 method for class 'trackr_df' anti_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"), .headline = "Semi join by {.keys}" )## S3 method for class 'trackr_df' anti_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"), .headline = "Semi join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods. |
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::anti_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Anti join join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Anti join join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
## S3 method for class 'trackr_df' arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::arrange()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # arrange # In this case we sort the data descending and show the first value # is the same as the maximum value. iris %>% track() %>% arrange( desc(Petal.Width), .messages="{.count} items, columns: {.cols}", .headline="Reordered dataframe:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # arrange # In this case we sort the data descending and show the first value # is the same as the maximum value. iris %>% track() %>% arrange( desc(Petal.Width), .messages="{.count} items, columns: {.cols}", .headline="Reordered dataframe:") %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
bind_cols( ..., .messages = "{.count.out} in combined set", .headline = "Bind columns" )bind_cols( ..., .messages = "{.count.out} in combined set", .headline = "Bind columns" )
... |
a collection of tracked data frames to combine
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::bind_cols()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")
... |
a collection of tracked data frames to combine
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::bind_rows()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Start capturing exclusions on a tracked dataframe.
capture_exclusions(.data, .capture = TRUE)capture_exclusions(.data, .capture = TRUE)
.data |
a tracked dataframe |
.capture |
Should we capture exclusions (things removed from the data
set). This is useful for debugging data issues but comes at a significant
cost. Defaults to the value of |
the .data dataframe with the exclusions flag set (or cleared if
.capture=FALSE).
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% filter(Species!="versicolor") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% filter(Species!="versicolor") %>% history()
A comment can be any kind of note and is added once for every current
grouping as defined by the .message field. It can be made context specific
by including variables such as {.count} and {.total} in .message which
refer to the grouped and ungrouped counts at this current stage of the
pipeline respectively. It can also pull in any global variable.
comment( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = (.type == "exclusion"), .tag = NULL )comment( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = (.type == "exclusion"), .tag = NULL )
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue specification can refer to any grouping variables of .data, or any variables defined in the calling environment, the {.total} of all rows, the {.count} variable which is the count in each group and {.strata} a description of the group |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable (which is |
.type |
one of "info","...,"exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the same .data dataframe with the history graph updated with the comment
library(dplyr) library(dtrackr) iris %>% track() %>% comment("hello {.total} rows") %>% history()library(dplyr) library(dtrackr) iris %>% track() %>% comment("hello {.total} rows") %>% history()
A frequent use case for more detailed description is to have a subgroup count within a flowchart. This works best for factor subgroup columns but other data will be converted to a factor automatically. The count of the items in each subgroup is added as a new stage in the flowchart.
count_subgroup( .data, .subgroup, ..., .messages = .defaultCountSubgroup(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL, .maxsubgroups = .defaultMaxSupportedGroupings() )count_subgroup( .data, .subgroup, ..., .messages = .defaultCountSubgroup(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL, .maxsubgroups = .defaultMaxSupportedGroupings() )
.data |
a dataframe which may be grouped |
.subgroup |
a column with a small number of levels (e.g. a factor) |
... |
passed to |
.messages |
a character vector of glue specifications. A glue specification can refer to anything from the calling environment, {.subgroup} for the subgroup column name and {.name} for the subgroup column value, {.count} for the subgroup column count, {.subtotal} for the current stratification grouping count and {.total} for the whole dataset count |
.headline |
a glue specification which can refer to grouping variables of .data, {.subtotal} for the current grouping count, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want to use the summary data from this step in the future then give it a name with .tag. |
.maxsubgroups |
the maximum number of discrete values allowed in
.subgroup is configurable with
|
the same .data dataframe with the history graph updated with a subgroup count as a new stage
library(dplyr) library(dtrackr) survival::cgd %>% track() %>% dplyr::group_by(treat) %>% count_subgroup(center) %>% history()library(dplyr) library(dtrackr) survival::cgd %>% track() %>% dplyr::group_by(treat) %>% count_subgroup(center) %>% history()
Distinct acts in the same way as in dplyr::distinct. Prior to the operation
the size of the group is calculated {.count.in} and after the operation the
output size {.count.out} The group {.strata} is also available (if
grouped) for reporting. See dplyr::distinct().
## S3 method for class 'trackr_df' distinct( .data, ..., .messages = "removing {.count.in-.count.out} duplicates", .headline = .defaultHeadline(), .tag = NULL )## S3 method for class 'trackr_df' distinct( .data, ..., .messages = "removing {.count.in-.count.out} duplicates", .headline = .defaultHeadline(), .tag = NULL )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe with distinct values and history graph updated.
dplyr::distinct()
library(dplyr) library(dtrackr) tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5)) tmp %>% dplyr::group_by(Species) %>% dplyr::distinct() %>% history()library(dplyr) library(dtrackr) tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5)) tmp %>% dplyr::group_by(Species) %>% dplyr::distinct() %>% history()
Graphviz dot content to a SVGConvert a graphviz dot digraph as string to SVG as string
dot2svg(dot)dot2svg(dot)
dot |
a |
the SVG as a string
dot2svg("digraph { A->B }")dot2svg("digraph { A->B }")
Apply a set of filters and summarise the actions of the filter to the dtrackr
history graph. Because of the ... filter specification, all parameters MUST BE
NAMED. The filters work in an combinatorial manner, i.e. the results EXCLUDE ALL
rows that match any of the criteria. If na.rm = TRUE they also remove
anything that cannot be evaluated by any criteria.
exclude_all( .data, ..., .headline = .defaultHeadline(), na.rm = FALSE, .type = "exclusion", .asOffshoot = TRUE, .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )exclude_all( .data, ..., .headline = .defaultHeadline(), na.rm = FALSE, .type = "exclusion", .asOffshoot = TRUE, .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match any of the predicates will be excluded. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate. This can refer to grouping variables variables from the environment and {.excluded} and {.matched} or {.missing} (excluded = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default FALSE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = TRUE). |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the filtered .data dataframe with the history graph updated with the summary of excluded items as a new offshoot stage
library(dplyr) library(dtrackr) iris %>% track() %>% capture_exclusions() %>% exclude_all( Petal.Length > 5 ~ "{.excluded} long ones", Petal.Length < 2 ~ "{.excluded} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% exclude_all( # These two criteria identify the same value and one item is excluded a > 9 ~ "{.excluded} value > 9", a == max(a) ~ "{.excluded} max value", ) %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9, a != max(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. exclude_all(a > 9 ~ "{.excluded} value > 9") %>% exclude_all(a == max(a) ~ "{.excluded} max value") %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9) %>% dplyr::filter(a != max(a)) %>% nrow()library(dplyr) library(dtrackr) iris %>% track() %>% capture_exclusions() %>% exclude_all( Petal.Length > 5 ~ "{.excluded} long ones", Petal.Length < 2 ~ "{.excluded} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% exclude_all( # These two criteria identify the same value and one item is excluded a > 9 ~ "{.excluded} value > 9", a == max(a) ~ "{.excluded} max value", ) %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9, a != max(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. exclude_all(a > 9 ~ "{.excluded} value > 9") %>% exclude_all(a == max(a) ~ "{.excluded} max value") %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9) %>% dplyr::filter(a != max(a)) %>% nrow()
Get the dtrackr excluded data record
excluded(.data, simplify = TRUE)excluded(.data, simplify = TRUE)
.data |
a dataframe which may be grouped |
simplify |
return a single summary dataframe of all exclusions. |
a new dataframe of the excluded data up to this point in the workflow. This dataframe is by default flattened, but if .simplify=FALSE has a nested structure containing records excluded at each part of the pipeline.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% exclude_all( Petal.Length > 5.8 ~ "{.excluded} long ones", Petal.Length < 1.3 ~ "{.excluded} short ones", .stage = "petal length exclusion" ) %>% excluded()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% exclude_all( Petal.Length > 5.8 ~ "{.excluded} long ones", Petal.Length < 1.3 ~ "{.excluded} short ones", .stage = "petal length exclusion" ) %>% excluded()
Filter acts in the same way as in dplyr where predicates which evaluate to
TRUE act to select items to include, and items for which the predicate cannot
be evaluated are excluded. For tracking prior to the filter operation the
size of each group is calculated {.count.in} and after the operation the
output size of each group {.count.out}. The grouping {.strata} is also
available (if grouped) for reporting. See dplyr::filter().
## S3 method for class 'trackr_df' filter( .data, ..., .messages = "excluded {.excluded} items", .headline = .defaultHeadline(), .type = "exclusion", .asOffshoot = (.type == "exclusion"), .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )## S3 method for class 'trackr_df' filter( .data, ..., .messages = "excluded {.excluded} items", .headline = .defaultHeadline(), .type = "exclusion", .asOffshoot = (.type == "exclusion"), .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
the format type of the action typically an exclusion |
.asOffshoot |
if the type is exclusion, |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then
give it a name with |
the filtered .data dataframe with history graph updated
dplyr::filter()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% filter(Petal.Length > 5) %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% filter(Petal.Length > 5) %>% history()
Generate a flowchart of the history of the dataframe(s), with all the tracked data pipeline as stages in the flowchart. Multiple dataframes can be plotted together in which case an attempt is made to determine which parts are common.
flowchart( .data, filename = NULL, size = std_size$full, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), defaultToHTML = TRUE, landscape = size$rot != 0, ... )flowchart( .data, filename = NULL, size = std_size$full, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), defaultToHTML = TRUE, landscape = size$rot != 0, ... )
.data |
the tracked dataframe(s) either as a single dataframe or as a list of dataframes. |
filename |
a file name which will be where the formatted flowcharts are
saved. If no extension is specified the output formats are determined by
the |
size |
a named list with 3 elements, length and width in inches and rotation. A predefined set of standard sizes are available in the std_size object. |
maxWidth |
a width (on the paper) in inches if |
maxHeight |
a height (on the paper) in inches if |
formats |
some of |
defaultToHTML |
if the correct output format is not easy to determine
from the context, default providing |
landscape |
rotate the output by 270 degrees into a landscape format.
|
... |
ignored
Named arguments passed on to
|
the nature of the flowchart output depends on the context in which
the function is called. It will be some form of browse-able html output if
called from an interactive session or a PNG/PDF link if in knitr and
knitting latex or word type outputs, if file name is specified the output
will also be saved at the given location.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") tmp %>% dplyr::group_by(Species) %>% comment(.tag="step2") %>% flowchart()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") tmp %>% dplyr::group_by(Species) %>% comment(.tag="step2") %>% flowchart()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::full_join() for more details
on the underlying functions.
## S3 method for class 'trackr_df' full_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Full join by {.keys}" )## S3 method for class 'trackr_df' full_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Full join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::full_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
Grouping a data set acts in the normal way. When tracking a dataframe
sometimes a group_by() operation will create a lot of groups. This happens
for example if you are doing a group_by(), summarise() step that is
aggregating data on a fine scale, e.g. by day in a time-series. This is
generally a terrible idea when tracking a dataframe as the resulting
flowchart will have many many branches and be illegible. dtrackr will
detect this issue and pause tracking the dataframe with a warning. It is up
to the user to the resume() tracking when the large number of groups have
been resolved e.g. using a dplyr::ungroup(). This limit is configurable
with options("dtrackr.max_supported_groupings"=XX). The default is 16. See
dplyr::group_by().
## S3 method for class 'trackr_df' group_by( .data, ..., .messages = "stratify by {.cols}", .headline = NULL, .tag = NULL, .maxgroups = .defaultMaxSupportedGroupings() )## S3 method for class 'trackr_df' group_by( .data, ..., .messages = "stratify by {.cols}", .headline = NULL, .tag = NULL, .maxgroups = .defaultMaxSupportedGroupings() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
In
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.cols} which is the columns that are being grouped by. |
.headline |
a headline glue spec. The glue code can use any global variable, or {.cols}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
.maxgroups |
the maximum number of subgroups allowed before the tracking is paused. |
the .data but grouped.
dplyr::group_by()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species, .messages="stratify by {.cols}") tmp %>% comment("{.strata}") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species, .messages="stratify by {.cols}") tmp %>% comment("{.strata}") %>% history()
Group modifying a data set acts in the normal way. The internal mechanics of
the modify function are opaque to the history. This means these can be used
to wrap any unsupported operation without losing the history (e.g. df %>% track() %>% group_modify(function(d,...) { d %>% unsupported_operation() })
) Prior to the operation the size of the group is calculated {.count.in}
and after the operation the output size {.count.out} The group {.strata}
is also available (if grouped) for reporting See dplyr::group_modify().
## S3 method for class 'trackr_df' group_modify( .data, ..., .messages = NULL, .headline = .defaultHeadline(), .type = "modify", .tag = NULL )## S3 method for class 'trackr_df' group_modify( .data, ..., .messages = NULL, .headline = .defaultHeadline(), .type = "modify", .tag = NULL )
.data |
A grouped tibble |
... |
Additional arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
default "modify": used to define formatting |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the transformed .data dataframe with the history graph updated.
dplyr::group_modify()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% dplyr::group_modify( function(d,g,...) { return(tibble::tibble(x=stats::runif(10))) }, .messages="{.count.in} in, {.count.out} out" ) %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% dplyr::group_modify( function(d,g,...) { return(tibble::tibble(x=stats::runif(10))) }, .messages="{.count.in} in, {.count.out} out" ) %>% history()
This provides the raw history graph and is not really intended for mainstream use. The internal structure of the graph is explained below. print and plot S3 methods exist for the dtrackr history graph.
history(.data)history(.data)
.data |
a dataframe which may be grouped |
the history graph. This is a list, of class trackr_graph, containing the following named items:
excluded - the data items that have been excluded thus far as a nested dataframe
tags - a dataframe of tag-value pairs containing the summary of the data at named points in the data flow (see tagged())
nodes - a dataframe of the nodes of the flow chart
edges - an edge list (as a dataframe) of the relationships between the nodes in the flow chart
head - the current most recent nodes added into the graph as a dataframe.
The format of this data may grow over time but these fields are unlikely to be changed.
library(dplyr) library(dtrackr) graph = iris %>% track() %>% comment("A comment") %>% history() print(graph)library(dplyr) library(dtrackr) graph = iris %>% track() %>% comment("A comment") %>% history() print(graph)
Apply a set of inclusion criteria and record the actions of the
filter to the dtrackr history graph. Because of the ... filter specification,
all parameters MUST BE NAMED. This function is the opposite of
exclude_all() and the filtering criteria work to identify rows to
include i.e. the results include anything that match any of the criteria. If
na.rm=TRUE they also keep anything that cannot be evaluated by the criteria.
include_any( .data, ..., .headline = .defaultHeadline(), na.rm = TRUE, .type = "inclusion", .asOffshoot = FALSE, .tag = NULL )include_any( .data, ..., .headline = .defaultHeadline(), na.rm = TRUE, .type = "inclusion", .asOffshoot = FALSE, .tag = NULL )
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match at least one of the predicates will be included. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate matched. This can refer to grouping variables, variables from the environment and {.included} and {.matched} or {.missing} (included = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default TRUE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "inclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the filtered .data dataframe with the history graph updated with the summary of included items as a new stage
library(dplyr) library(dtrackr) iris %>% track() %>% dplyr::group_by(Species) %>% include_any( Petal.Length > 5 ~ "{.included} long ones", Petal.Length < 2 ~ "{.included} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% include_any( # These two criteria identify the same value and one item is excluded a > 1 ~ "{.included} value > 1", a != min(a) ~ "{.included} everything but the smallest value", ) %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1, a != min(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. include_any(a > 1 ~ "{.included} value > 1") %>% include_any(a != min(a) ~ "{.included} everything but the smallest value") %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1) %>% dplyr::filter(a != min(a)) %>% nrow()library(dplyr) library(dtrackr) iris %>% track() %>% dplyr::group_by(Species) %>% include_any( Petal.Length > 5 ~ "{.included} long ones", Petal.Length < 2 ~ "{.included} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% include_any( # These two criteria identify the same value and one item is excluded a > 1 ~ "{.included} value > 1", a != min(a) ~ "{.included} everything but the smallest value", ) %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1, a != min(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. include_any(a > 1 ~ "{.included} value > 1") %>% include_any(a != min(a) ~ "{.included} everything but the smallest value") %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1) %>% dplyr::filter(a != min(a)) %>% nrow()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::inner_join() for more details
on the underlying functions.
## S3 method for class 'trackr_df' inner_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Inner join by {.keys}" )## S3 method for class 'trackr_df' inner_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Inner join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::inner_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Inner join join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Inner join join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
## S3 method for class 'trackr_df' intersect( x, y, ..., .messages = "{.count.out} in intersection", .headline = "Intersection" )## S3 method for class 'trackr_df' intersect( x, y, ..., .messages = "{.count.out} in intersection", .headline = "Intersection" )
x, y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
generics::intersect()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::left_join() for more details
on the underlying functions.
## S3 method for class 'trackr_df' left_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Left join by {.keys}" )## S3 method for class 'trackr_df' left_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Left join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::left_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Left join join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Left join join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
## S3 method for class 'trackr_df' mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::mutate()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # mutate # In this example we compare the column names of the input and the # output to identify the new columns created by the mutate operation as # the `.new_cols` variable iris %>% track() %>% mutate(extra_col = NA_real_, .messages="{.new_cols}", .headline="Extra columns from mutate:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # mutate # In this example we compare the column names of the input and the # output to identify the new columns created by the mutate operation as # the `.new_cols` variable iris %>% track() %>% mutate(extra_col = NA_real_, .messages="{.new_cols}", .headline="Extra columns from mutate:") %>% history()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::nest_join() for more details
on the underlying functions.
## S3 method for class 'trackr_df' nest_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"), .headline = "Nest join by {.keys}" )## S3 method for class 'trackr_df' nest_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"), .headline = "Nest join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::nest_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Nest join join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Nest join join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
tidyr::nest
A drop in replacement for tidyr::nest() which optionally takes a
message and headline to store in the history graph.
## S3 method for class 'trackr_df' nest( .data, ..., .by = NULL, .key = NULL, .names_sep = NULL, .messages = c("{.count.out} items"), .headline = "" )## S3 method for class 'trackr_df' nest( .data, ..., .by = NULL, .key = NULL, .names_sep = NULL, .messages = c("{.count.out} items"), .headline = "" )
.data |
A data frame. |
... |
< Specified using name-variable pairs of the form
If not supplied, then
|
.by |
<
If not supplied, then |
.key |
The name of the resulting nested column. Only applicable when
If |
.names_sep |
If |
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
the data dataframe result of the tidyr::nest function but with
a history graph updated.
tidyr::nest()
library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
p_add_count(x, ..., .messages = "", .headline = "", .tag = NULL)p_add_count(x, ..., .messages = "", .headline = "", .tag = NULL)
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::add_count()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
p_add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)p_add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::add_tally()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::anti_join() for more details
on the underlying functions.
p_anti_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"), .headline = "Semi join by {.keys}" )p_anti_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"), .headline = "Semi join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::anti_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Anti join join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Anti join join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
p_arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)p_arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::arrange()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # arrange # In this case we sort the data descending and show the first value # is the same as the maximum value. iris %>% track() %>% arrange( desc(Petal.Width), .messages="{.count} items, columns: {.cols}", .headline="Reordered dataframe:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # arrange # In this case we sort the data descending and show the first value # is the same as the maximum value. iris %>% track() %>% arrange( desc(Petal.Width), .messages="{.count} items, columns: {.cols}", .headline="Reordered dataframe:") %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
p_bind_cols( ..., .messages = "{.count.out} in combined set", .headline = "Bind columns" )p_bind_cols( ..., .messages = "{.count.out} in combined set", .headline = "Bind columns" )
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::bind_cols()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
p_bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")p_bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::bind_rows()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Start capturing exclusions on a tracked dataframe.
p_capture_exclusions(.data, .capture = TRUE)p_capture_exclusions(.data, .capture = TRUE)
.data |
a tracked dataframe |
.capture |
Should we capture exclusions (things removed from the data
set). This is useful for debugging data issues but comes at a significant
cost. Defaults to the value of |
the .data dataframe with the exclusions flag set (or cleared if
.capture=FALSE).
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% filter(Species!="versicolor") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% filter(Species!="versicolor") %>% history()
This is unlikely to be needed directly and is mostly and internal function
p_clear(.data)p_clear(.data)
.data |
a dataframe which may be grouped |
the .data dataframe with the history graph removed
library(dplyr) library(dtrackr) mtcars %>% track() %>% comment("A comment") %>% p_clear() %>% history()library(dplyr) library(dtrackr) mtcars %>% track() %>% comment("A comment") %>% p_clear() %>% history()
A comment can be any kind of note and is added once for every current
grouping as defined by the .message field. It can be made context specific
by including variables such as {.count} and {.total} in .message which
refer to the grouped and ungrouped counts at this current stage of the
pipeline respectively. It can also pull in any global variable.
p_comment( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = (.type == "exclusion"), .tag = NULL )p_comment( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = (.type == "exclusion"), .tag = NULL )
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue specification can refer to any grouping variables of .data, or any variables defined in the calling environment, the {.total} of all rows, the {.count} variable which is the count in each group and {.strata} a description of the group |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable (which is |
.type |
one of "info","...,"exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the same .data dataframe with the history graph updated with the comment
library(dplyr) library(dtrackr) iris %>% track() %>% comment("hello {.total} rows") %>% history()library(dplyr) library(dtrackr) iris %>% track() %>% comment("hello {.total} rows") %>% history()
Copy the dtrackr history graph from one dataframe to another
p_copy(.data, from)p_copy(.data, from)
.data |
a dataframe which may be grouped |
from |
the dataframe to copy the history graph from |
the .data dataframe with the history graph of "from"
mtcars %>% p_copy(iris %>% comment("A comment")) %>% history()mtcars %>% p_copy(iris %>% comment("A comment")) %>% history()
Simple count_if dplyr summary function
p_count_if(..., na.rm = TRUE)p_count_if(..., na.rm = TRUE)
... |
expression to be evaluated |
na.rm |
ignore NA values? |
a count of the number of times the expression evaluated to true, in the current context
library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) tmp %>% dplyr::summarise(long_ones = p_count_if(Petal.Length > 4))library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) tmp %>% dplyr::summarise(long_ones = p_count_if(Petal.Length > 4))
A frequent use case for more detailed description is to have a subgroup count within a flowchart. This works best for factor subgroup columns but other data will be converted to a factor automatically. The count of the items in each subgroup is added as a new stage in the flowchart.
p_count_subgroup( .data, .subgroup, ..., .messages = .defaultCountSubgroup(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL, .maxsubgroups = .defaultMaxSupportedGroupings() )p_count_subgroup( .data, .subgroup, ..., .messages = .defaultCountSubgroup(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL, .maxsubgroups = .defaultMaxSupportedGroupings() )
.data |
a dataframe which may be grouped |
.subgroup |
a column with a small number of levels (e.g. a factor) |
... |
passed to |
.messages |
a character vector of glue specifications. A glue specification can refer to anything from the calling environment, {.subgroup} for the subgroup column name and {.name} for the subgroup column value, {.count} for the subgroup column count, {.subtotal} for the current stratification grouping count and {.total} for the whole dataset count |
.headline |
a glue specification which can refer to grouping variables of .data, {.subtotal} for the current grouping count, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want to use the summary data from this step in the future then give it a name with .tag. |
.maxsubgroups |
the maximum number of discrete values allowed in
.subgroup is configurable with
|
the same .data dataframe with the history graph updated with a subgroup count as a new stage
library(dplyr) library(dtrackr) survival::cgd %>% track() %>% dplyr::group_by(treat) %>% count_subgroup(center) %>% history()library(dplyr) library(dtrackr) survival::cgd %>% track() %>% dplyr::group_by(treat) %>% count_subgroup(center) %>% history()
Distinct acts in the same way as in dplyr::distinct. Prior to the operation
the size of the group is calculated {.count.in} and after the operation the
output size {.count.out} The group {.strata} is also available (if
grouped) for reporting. See dplyr::distinct().
p_distinct( .data, ..., .messages = "removing {.count.in-.count.out} duplicates", .headline = .defaultHeadline(), .tag = NULL )p_distinct( .data, ..., .messages = "removing {.count.in-.count.out} duplicates", .headline = .defaultHeadline(), .tag = NULL )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe with distinct values and history graph updated.
dplyr::distinct()
library(dplyr) library(dtrackr) tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5)) tmp %>% dplyr::group_by(Species) %>% dplyr::distinct() %>% history()library(dplyr) library(dtrackr) tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5)) tmp %>% dplyr::group_by(Species) %>% dplyr::distinct() %>% history()
Apply a set of filters and summarise the actions of the filter to the dtrackr
history graph. Because of the ... filter specification, all parameters MUST BE
NAMED. The filters work in an combinatorial manner, i.e. the results EXCLUDE ALL
rows that match any of the criteria. If na.rm = TRUE they also remove
anything that cannot be evaluated by any criteria.
p_exclude_all( .data, ..., .headline = .defaultHeadline(), na.rm = FALSE, .type = "exclusion", .asOffshoot = TRUE, .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )p_exclude_all( .data, ..., .headline = .defaultHeadline(), na.rm = FALSE, .type = "exclusion", .asOffshoot = TRUE, .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match any of the predicates will be excluded. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate. This can refer to grouping variables variables from the environment and {.excluded} and {.matched} or {.missing} (excluded = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default FALSE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = TRUE). |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the filtered .data dataframe with the history graph updated with the summary of excluded items as a new offshoot stage
library(dplyr) library(dtrackr) iris %>% track() %>% capture_exclusions() %>% exclude_all( Petal.Length > 5 ~ "{.excluded} long ones", Petal.Length < 2 ~ "{.excluded} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% exclude_all( # These two criteria identify the same value and one item is excluded a > 9 ~ "{.excluded} value > 9", a == max(a) ~ "{.excluded} max value", ) %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9, a != max(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. exclude_all(a > 9 ~ "{.excluded} value > 9") %>% exclude_all(a == max(a) ~ "{.excluded} max value") %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9) %>% dplyr::filter(a != max(a)) %>% nrow()library(dplyr) library(dtrackr) iris %>% track() %>% capture_exclusions() %>% exclude_all( Petal.Length > 5 ~ "{.excluded} long ones", Petal.Length < 2 ~ "{.excluded} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% exclude_all( # These two criteria identify the same value and one item is excluded a > 9 ~ "{.excluded} value > 9", a == max(a) ~ "{.excluded} max value", ) %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9, a != max(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. exclude_all(a > 9 ~ "{.excluded} value > 9") %>% exclude_all(a == max(a) ~ "{.excluded} max value") %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9) %>% dplyr::filter(a != max(a)) %>% nrow()
Get the dtrackr excluded data record
p_excluded(.data, simplify = TRUE)p_excluded(.data, simplify = TRUE)
.data |
a dataframe which may be grouped |
simplify |
return a single summary dataframe of all exclusions. |
a new dataframe of the excluded data up to this point in the workflow. This dataframe is by default flattened, but if .simplify=FALSE has a nested structure containing records excluded at each part of the pipeline.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% exclude_all( Petal.Length > 5.8 ~ "{.excluded} long ones", Petal.Length < 1.3 ~ "{.excluded} short ones", .stage = "petal length exclusion" ) %>% excluded()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% exclude_all( Petal.Length > 5.8 ~ "{.excluded} long ones", Petal.Length < 1.3 ~ "{.excluded} short ones", .stage = "petal length exclusion" ) %>% excluded()
Filter acts in the same way as in dplyr where predicates which evaluate to
TRUE act to select items to include, and items for which the predicate cannot
be evaluated are excluded. For tracking prior to the filter operation the
size of each group is calculated {.count.in} and after the operation the
output size of each group {.count.out}. The grouping {.strata} is also
available (if grouped) for reporting. See dplyr::filter().
p_filter( .data, ..., .messages = "excluded {.excluded} items", .headline = .defaultHeadline(), .type = "exclusion", .asOffshoot = (.type == "exclusion"), .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )p_filter( .data, ..., .messages = "excluded {.excluded} items", .headline = .defaultHeadline(), .type = "exclusion", .asOffshoot = (.type == "exclusion"), .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
the format type of the action typically an exclusion |
.asOffshoot |
if the type is exclusion, |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then
give it a name with |
the filtered .data dataframe with history graph updated
dplyr::filter()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% filter(Petal.Length > 5) %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% filter(Petal.Length > 5) %>% history()
Generate a flowchart of the history of the dataframe(s), with all the tracked data pipeline as stages in the flowchart. Multiple dataframes can be plotted together in which case an attempt is made to determine which parts are common.
p_flowchart( .data, filename = NULL, size = std_size$full, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), defaultToHTML = TRUE, landscape = size$rot != 0, ... )p_flowchart( .data, filename = NULL, size = std_size$full, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), defaultToHTML = TRUE, landscape = size$rot != 0, ... )
.data |
the tracked dataframe(s) either as a single dataframe or as a list of dataframes. |
filename |
a file name which will be where the formatted flowcharts are
saved. If no extension is specified the output formats are determined by
the |
size |
a named list with 3 elements, length and width in inches and rotation. A predefined set of standard sizes are available in the std_size object. |
maxWidth |
a width (on the paper) in inches if |
maxHeight |
a height (on the paper) in inches if |
formats |
some of |
defaultToHTML |
if the correct output format is not easy to determine
from the context, default providing |
landscape |
rotate the output by 270 degrees into a landscape format.
|
... |
ignored
Named arguments passed on to
|
the nature of the flowchart output depends on the context in which
the function is called. It will be some form of browse-able html output if
called from an interactive session or a PNG/PDF link if in knitr and
knitting latex or word type outputs, if file name is specified the output
will also be saved at the given location.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") tmp %>% dplyr::group_by(Species) %>% comment(.tag="step2") %>% flowchart()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") tmp %>% dplyr::group_by(Species) %>% comment(.tag="step2") %>% flowchart()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::full_join() for more details
on the underlying functions.
p_full_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Full join by {.keys}" )p_full_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Full join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::full_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
This provides the raw history graph and is not really intended for mainstream use. The internal structure of the graph is explained below. print and plot S3 methods exist for the dtrackr history graph.
p_get(.data)p_get(.data)
.data |
a dataframe which may be grouped |
the history graph. This is a list, of class trackr_graph, containing the following named items:
excluded - the data items that have been excluded thus far as a nested dataframe
tags - a dataframe of tag-value pairs containing the summary of the data at named points in the data flow (see tagged())
nodes - a dataframe of the nodes of the flow chart
edges - an edge list (as a dataframe) of the relationships between the nodes in the flow chart
head - the current most recent nodes added into the graph as a dataframe.
The format of this data may grow over time but these fields are unlikely to be changed.
library(dplyr) library(dtrackr) graph = iris %>% track() %>% comment("A comment") %>% history() print(graph)library(dplyr) library(dtrackr) graph = iris %>% track() %>% comment("A comment") %>% history() print(graph)
(advance usage) outputs a dtrackr history graph as a DOT string for rendering with Graphviz
p_get_as_dot( .data, fill = .defaultFill(), fontsize = .defaultFontSize(), colour = .defaultColour(), rankdir = .defaultDirection(), rounded = .defaultRounded(), fontname = .defaultFontName(), bgcolour = .defaultBgColour(), ... )p_get_as_dot( .data, fill = .defaultFill(), fontsize = .defaultFontSize(), colour = .defaultColour(), rankdir = .defaultDirection(), rounded = .defaultRounded(), fontname = .defaultFontName(), bgcolour = .defaultBgColour(), ... )
.data |
the tracked dataframe |
fill |
the default node fill colour, any R colour or hex value |
fontsize |
the default font size in points |
colour |
the default font colour, any R colour or hex value |
rankdir |
the dot rank direction (one of |
rounded |
should the node corners be rounded? |
fontname |
the font to use. Must exist on the system. |
bgcolour |
the background, may be "transparent", any R colour or hex value |
... |
not used |
a representation of the history graph in Graphviz dot format.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") dot = tmp %>% dplyr::group_by(Species) %>% comment(.tag="step2") %>% p_get_as_dot() cat(dot)library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") dot = tmp %>% dplyr::group_by(Species) %>% comment(.tag="step2") %>% p_get_as_dot() cat(dot)
Grouping a data set acts in the normal way. When tracking a dataframe
sometimes a group_by() operation will create a lot of groups. This happens
for example if you are doing a group_by(), summarise() step that is
aggregating data on a fine scale, e.g. by day in a time-series. This is
generally a terrible idea when tracking a dataframe as the resulting
flowchart will have many many branches and be illegible. dtrackr will
detect this issue and pause tracking the dataframe with a warning. It is up
to the user to the resume() tracking when the large number of groups have
been resolved e.g. using a dplyr::ungroup(). This limit is configurable
with options("dtrackr.max_supported_groupings"=XX). The default is 16. See
dplyr::group_by().
p_group_by( .data, ..., .messages = "stratify by {.cols}", .headline = NULL, .tag = NULL, .maxgroups = .defaultMaxSupportedGroupings() )p_group_by( .data, ..., .messages = "stratify by {.cols}", .headline = NULL, .tag = NULL, .maxgroups = .defaultMaxSupportedGroupings() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
In
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.cols} which is the columns that are being grouped by. |
.headline |
a headline glue spec. The glue code can use any global variable, or {.cols}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
.maxgroups |
the maximum number of subgroups allowed before the tracking is paused. |
the .data but grouped.
dplyr::group_by()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species, .messages="stratify by {.cols}") tmp %>% comment("{.strata}") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species, .messages="stratify by {.cols}") tmp %>% comment("{.strata}") %>% history()
Group modifying a data set acts in the normal way. The internal mechanics of
the modify function are opaque to the history. This means these can be used
to wrap any unsupported operation without losing the history (e.g. df %>% track() %>% group_modify(function(d,...) { d %>% unsupported_operation() })
) Prior to the operation the size of the group is calculated {.count.in}
and after the operation the output size {.count.out} The group {.strata}
is also available (if grouped) for reporting See dplyr::group_modify().
p_group_modify( .data, ..., .messages = NULL, .headline = .defaultHeadline(), .type = "modify", .tag = NULL )p_group_modify( .data, ..., .messages = NULL, .headline = .defaultHeadline(), .type = "modify", .tag = NULL )
.data |
A grouped tibble |
... |
Additional arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
default "modify": used to define formatting |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the transformed .data dataframe with the history graph updated.
dplyr::group_modify()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% dplyr::group_modify( function(d,g,...) { return(tibble::tibble(x=stats::runif(10))) }, .messages="{.count.in} in, {.count.out} out" ) %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% dplyr::group_modify( function(d,g,...) { return(tibble::tibble(x=stats::runif(10))) }, .messages="{.count.in} in, {.count.out} out" ) %>% history()
Apply a set of inclusion criteria and record the actions of the
filter to the dtrackr history graph. Because of the ... filter specification,
all parameters MUST BE NAMED. This function is the opposite of
exclude_all() and the filtering criteria work to identify rows to
include i.e. the results include anything that match any of the criteria. If
na.rm=TRUE they also keep anything that cannot be evaluated by the criteria.
p_include_any( .data, ..., .headline = .defaultHeadline(), na.rm = TRUE, .type = "inclusion", .asOffshoot = FALSE, .tag = NULL )p_include_any( .data, ..., .headline = .defaultHeadline(), na.rm = TRUE, .type = "inclusion", .asOffshoot = FALSE, .tag = NULL )
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match at least one of the predicates will be included. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate matched. This can refer to grouping variables, variables from the environment and {.included} and {.matched} or {.missing} (included = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default TRUE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "inclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the filtered .data dataframe with the history graph updated with the summary of included items as a new stage
library(dplyr) library(dtrackr) iris %>% track() %>% dplyr::group_by(Species) %>% include_any( Petal.Length > 5 ~ "{.included} long ones", Petal.Length < 2 ~ "{.included} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% include_any( # These two criteria identify the same value and one item is excluded a > 1 ~ "{.included} value > 1", a != min(a) ~ "{.included} everything but the smallest value", ) %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1, a != min(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. include_any(a > 1 ~ "{.included} value > 1") %>% include_any(a != min(a) ~ "{.included} everything but the smallest value") %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1) %>% dplyr::filter(a != min(a)) %>% nrow()library(dplyr) library(dtrackr) iris %>% track() %>% dplyr::group_by(Species) %>% include_any( Petal.Length > 5 ~ "{.included} long ones", Petal.Length < 2 ~ "{.included} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% include_any( # These two criteria identify the same value and one item is excluded a > 1 ~ "{.included} value > 1", a != min(a) ~ "{.included} everything but the smallest value", ) %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1, a != min(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. include_any(a > 1 ~ "{.included} value > 1") %>% include_any(a != min(a) ~ "{.included} everything but the smallest value") %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1) %>% dplyr::filter(a != min(a)) %>% nrow()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::inner_join() for more details
on the underlying functions.
p_inner_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Inner join by {.keys}" )p_inner_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Inner join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::inner_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Inner join join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Inner join join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
p_intersect( x, y, ..., .messages = "{.count.out} in intersection", .headline = "Intersection" )p_intersect( x, y, ..., .messages = "{.count.out} in intersection", .headline = "Intersection" )
x, y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
generics::intersect()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::left_join() for more details
on the underlying functions.
p_left_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Left join by {.keys}" )p_left_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Left join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::left_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Left join join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Left join join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
p_mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)p_mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::mutate()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # mutate # In this example we compare the column names of the input and the # output to identify the new columns created by the mutate operation as # the `.new_cols` variable iris %>% track() %>% mutate(extra_col = NA_real_, .messages="{.new_cols}", .headline="Extra columns from mutate:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # mutate # In this example we compare the column names of the input and the # output to identify the new columns created by the mutate operation as # the `.new_cols` variable iris %>% track() %>% mutate(extra_col = NA_real_, .messages="{.new_cols}", .headline="Extra columns from mutate:") %>% history()
tidyr::nest
A drop in replacement for tidyr::nest() which optionally takes a
message and headline to store in the history graph.
p_nest( .data, ..., .by = NULL, .key = NULL, .names_sep = NULL, .messages = c("{.count.out} items"), .headline = "" )p_nest( .data, ..., .by = NULL, .key = NULL, .names_sep = NULL, .messages = c("{.count.out} items"), .headline = "" )
.data |
A data frame. |
... |
< Specified using name-variable pairs of the form
If not supplied, then
|
.by |
<
If not supplied, then |
.key |
The name of the resulting nested column. Only applicable when
If |
.names_sep |
If |
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
the data dataframe result of the tidyr::nest function but with
a history graph updated.
tidyr::nest()
library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::nest_join() for more details
on the underlying functions.
p_nest_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"), .headline = "Nest join by {.keys}" )p_nest_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"), .headline = "Nest join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::nest_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Nest join join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Nest join join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
Pausing tracking of a data frame may be required if an operation is about to
be performed that creates a lot of groupings or that you otherwise don't
want to pollute the history graph (e.g. maybe selecting something using
an anti-join). Once paused the history is not updated until a resume() is
called, or when the data frame is ungrouped (if auto is enabled).
p_pause(.data, auto = FALSE)p_pause(.data, auto = FALSE)
.data |
a tracked dataframe |
auto |
if |
the .data dataframe with history graph tracking paused
iris %>% track() %>% pause() %>% history()iris %>% track() %>% pause() %>% history()
tidyr::pivot_longer
A drop in replacement for tidyr::pivot_longer() which optionally takes a
message and headline to store in the history graph.
p_pivot_longer( data, cols, ..., cols_vary = "fastest", names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, .messages = c("long format", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )p_pivot_longer( data, cols, ..., cols_vary = "fastest", names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, .messages = c("long format", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
data |
A data frame to pivot. |
cols |
< |
... |
Additional arguments passed on to methods. |
cols_vary |
When pivoting
|
names_to |
A character vector specifying the new column or columns to
create from the information stored in the column names of
|
names_prefix |
A regular expression used to remove matching text from the start of each variable name. |
names_sep, names_pattern
|
If
If these arguments do not give you enough control, use
|
names_ptypes, values_ptypes
|
Optionally, a list of column name-prototype
pairs. Alternatively, a single empty prototype can be supplied, which will
be applied to all columns. A prototype (or ptype for short) is a
zero-length vector (like |
names_transform, values_transform
|
Optionally, a list of column
name-function pairs. Alternatively, a single function can be supplied,
which will be applied to all columns. Use these arguments if you need to
change the types of specific columns. For example, If not specified, the type of the columns generated from |
names_repair |
What happens if the output has invalid column names?
The default, |
values_to |
A string specifying the name of the column to create
from the data stored in cell values. If |
values_drop_na |
If |
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
the result of the tidyr::pivot_longer but with a history graph
updated.
tidyr::pivot_longer()
library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }
tidyr::pivot_wider
A drop in replacement for tidyr::pivot_wider() which optionally takes a
message and headline to store in the history graph.
p_pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL, .messages = c("wide format", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )p_pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL, .messages = c("wide format", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods. |
id_cols |
< Defaults to all columns in |
id_expand |
Should the values in the |
names_from, values_from
|
< If |
names_prefix |
String added to the start of every variable name. This is
particularly useful if |
names_sep |
If |
names_glue |
Instead of |
names_sort |
Should the column names be sorted? If |
names_vary |
When
|
names_expand |
Should the values in the |
names_repair |
What happens if the output has invalid column names?
The default, |
values_fill |
Optionally, a (scalar) value that specifies what each
This can be a named list if you want to apply different fill values to different value columns. |
values_fn |
Optionally, a function applied to the value in each cell
in the output. You will typically use this when the combination of
This can be a named list if you want to apply different aggregations
to different |
unused_fn |
Optionally, a function applied to summarize the values from
the unused columns (i.e. columns not identified by The default drops all unused columns from the result. This can be a named list if you want to apply different aggregations to different unused columns.
This is similar to grouping by the |
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
the data dataframe result of the tidyr::pivot_wider function but with
a history graph updated.
tidyr::pivot_wider()
library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }
Summarising a data set acts in the normal dplyr manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise().
p_reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)p_reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
dplyr::reframe()
library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% track() tmp %>% dplyr::reframe(dplyr::tibble( param = c("mean","min","max"), value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length)) ), .messages="length {param}: {value}") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% track() tmp %>% dplyr::reframe(dplyr::tibble( param = c("mean","min","max"), value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length)) ), .messages="length {param}: {value}") %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
p_relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)p_relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::relocate()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # relocate, this shows how the columns can be reordered iris %>% track() %>% group_by(Species) %>% relocate( tidyselect::starts_with("Sepal"), .after=Species, .messages="{.cols}", .headline="Order of columns from relocate:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # relocate, this shows how the columns can be reordered iris %>% track() %>% group_by(Species) %>% relocate( tidyselect::starts_with("Sepal"), .after=Species, .messages="{.cols}", .headline="Order of columns from relocate:") %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
p_rename(.data, ..., .messages = "", .headline = "", .tag = NULL)p_rename(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::rename()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
p_rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)p_rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::rename_with()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
This may reset the grouping of the tracked data if the grouping structure
has changed since the data frame was paused. If you try and resume tracking a
data frame with too many groups (as defined by options("dtrackr.max_supported_groupings"=XX))
then the resume will fail and the data frame will still be paused. This can
be overridden by specifying a value for the .maxgroups parameter.
p_resume(.data, ...)p_resume(.data, ...)
.data |
a tracked dataframe |
... |
Named arguments passed on to
|
the .data data frame with history graph tracking resumed
library(dplyr) library(dtrackr) iris %>% track() %>% pause() %>% resume() %>% history()library(dplyr) library(dtrackr) iris %>% track() %>% pause() %>% resume() %>% history()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::right_join() for more details
on the underlying functions.
p_right_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Right join by {.keys}" )p_right_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Right join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::right_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
p_select(.data, ..., .messages = "", .headline = "", .tag = NULL)p_select(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::select()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # select # The output of the select verb (here using tidyselect syntax) can be captured # and here all column names are being reported with the .cols variable. iris %>% track() %>% group_by(Species) %>% select( tidyselect::starts_with("Sepal"), .messages="{.cols}", .headline="Output columns from select:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # select # The output of the select verb (here using tidyselect syntax) can be captured # and here all column names are being reported with the .cols variable. iris %>% track() %>% group_by(Species) %>% select( tidyselect::starts_with("Sepal"), .messages="{.cols}", .headline="Output columns from select:") %>% history()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::semi_join() for more details
on the underlying functions.
p_semi_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in intersection"), .headline = "Semi join by {.keys}" )p_semi_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in intersection"), .headline = "Semi join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::semi_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Semi join join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Semi join join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
This is unlikely to be useful to an end user and is called automatically by many of the other functions here. On the off chance you need to copy history metadata from one dataframe to another
p_set(.data, .graph)p_set(.data, .graph)
.data |
a dataframe which may be grouped |
.graph |
a history graph list (consisting of nodes, edges, and head) see examples |
the .data dataframe with the history graph metadata set to the provided value
library(dplyr) library(dtrackr) mtcars %>% p_set(iris %>% comment("A comment") %>% p_get()) %>% history()library(dplyr) library(dtrackr) mtcars %>% p_set(iris %>% comment("A comment") %>% p_get()) %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
p_setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" )p_setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" )
x, y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::setdiff()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
p_slice( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )p_slice( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice()
library(dplyr) library(dtrackr) # an arbitrary 50 items from the iris dataframe is selected. The # history is tracked iris %>% track() %>% slice(51:100) %>% history()library(dplyr) library(dtrackr) # an arbitrary 50 items from the iris dataframe is selected. The # history is tracked iris %>% track() %>% slice(51:100) %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
p_slice_head( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )p_slice_head( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_head()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
p_slice_max( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )p_slice_max( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_max()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
p_slice_min( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )p_slice_min( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_min()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
p_slice_sample( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )p_slice_sample( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_sample()
library(dplyr) library(dtrackr) # In this example the iris dataframe is resampled 100 times with replacement # within each group and the iris %>% track() %>% group_by(Species) %>% slice_sample(n=100, replace=TRUE, .messages="{.count.out} / {.count.in} = {n}", .headline="100 {Species}") %>% history()library(dplyr) library(dtrackr) # In this example the iris dataframe is resampled 100 times with replacement # within each group and the iris %>% track() %>% group_by(Species) %>% slice_sample(n=100, replace=TRUE, .messages="{.count.out} / {.count.in} = {n}", .headline="100 {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
p_slice_tail( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )p_slice_tail( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_tail()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
In the middle of a pipeline you may wish to document something about the data
that is more complex than the simple counts. status is essentially a
dplyr summarisation step which is connected to a glue specification
output, that is recorded in the data frame history. This means you can do an
arbitrary interim summarisation and put the result into the flowchart without
disrupting the pipeline flow.
p_status( .data, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL )p_status( .data, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL )
.data |
a dataframe which may be grouped |
... |
any normal dplyr::summarise specification, e.g. |
.messages |
a character vector of glue specifications. A glue specification can refer to the summary outputs, any grouping variables of .data, the {.strata}, or any variables defined in the calling environment |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Because of the ... summary specification parameters MUST BE NAMED.
the same .data dataframe with the history metadata updated with the status inserted as a new stage
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% status( long = p_count_if(Petal.Length>5), short = p_count_if(Petal.Length<2), .messages="{Species}: {long} long ones & {short} short ones" ) %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% status( long = p_count_if(Petal.Length>5), short = p_count_if(Petal.Length<2), .messages="{Species}: {long} long ones & {short} short ones" ) %>% history()
Summarising a data set acts in the normal dplyr manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise().
p_summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)p_summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
dplyr::summarise()
library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% track() tmp %>% dplyr::summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% track() tmp %>% dplyr::summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()
Any counts at the individual stages that was stored with a .tag option in a pipeline step can be recovered here. The idea here is to provide a quick way to access a single value
for the counts or other details tagged in a pipeline into a format that can be reported in text of a document. (e.g. for a results section). For more examples the consort statement vignette
has some examples of use.
p_tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)p_tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)
.data |
the tracked dataframe. |
.tag |
(optional) the tag to retrieve. |
.strata |
(optional) filter the tagged data by the strata. set to "" to filter just the top level ungrouped data. |
.glue |
(optional) a glue specification which will be applied to the tagged content to generate a |
... |
(optional) any other named parameters will be passed to |
various things depending on what is requested.
By default a tibble with a .tag column and all associated summary values in a nested .content column.
If a .strata column is specified the results are filtered to just those that match a given .strata grouping (i.e. this will be the grouping label on the flowchart). Ungrouped content will have an empty "" as .strata
If .tag is specified the result will be for a single tag and .content will be automatically un-nested to give a single un-nested dataframe of the content captured at the .tag tagged step.
This could be single or multiple rows depending on whether the original data was grouped at the point of tagging.
If both the .tag and .glue is specified a .label column will be computed from .glue and the tagged content. If the result of this is a single row then just the string value of .label is returned.
If just the .glue is specified, an un-nested dataframe with .tag,.strata and .label columns with a label for each tag in each strata.
If this seems complex then the best thing is to experiment until you get the output you want, leaving any .glue options until you think you know what you are doing. It made sense at the time.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") tmp = tmp %>% filter(Species!="versicolor") %>% dplyr::group_by(Species) tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") tmp = tmp %>% filter(Species!="versicolor") %>% dplyr::group_by(Species) tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")
Start tracking the dtrackr history graph
p_track( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )p_track( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue
specification can refer to any grouping variables of .data, or any
variables defined in the calling environment, the {.total} variable which
is the count of all rows, the {.count} variable which is the count of
rows in the current group and the {.strata} which describes the current
group. Defaults to the value of |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable which is |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe with additional history graph metadata, to allow tracking.
library(dplyr) library(dtrackr) iris %>% track() %>% history()library(dplyr) library(dtrackr) iris %>% track() %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
p_transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)p_transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::transmute()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # In this example we compare the column names of the input and the # output to identify the new columns created by the transmute operation as # the `.new_cols` variable # Here we do the same for a transmute() iris %>% track() %>% group_by(Species, .add=TRUE) %>% transmute( sepal.w = Sepal.Width-1, sepal.l = Sepal.Length+1, .messages="{.new_cols}", .headline="New columns from transmute:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # In this example we compare the column names of the input and the # output to identify the new columns created by the transmute operation as # the `.new_cols` variable # Here we do the same for a transmute() iris %>% track() %>% group_by(Species, .add=TRUE) %>% transmute( sepal.w = Sepal.Width-1, sepal.l = Sepal.Length+1, .messages="{.new_cols}", .headline="New columns from transmute:") %>% history()
Un-grouping a data set logically combines the different arms. In the history
this joins any stratified branches and acts as a specific type of status(),
allowing you to generate some summary statistics about the un-grouped data.
See dplyr::ungroup().
p_ungroup( x, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )p_ungroup( x, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
x |
A |
... |
variables to remove from the grouping. |
.messages |
a set of glue specs. The glue code can use any any global variable, or {.count}. the default is "total {.count} items" |
.headline |
a headline glue spec. The glue code can use {.count} and {.strata}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe but ungrouped with the history graph updated showing the ungroup operation as a new stage.
dplyr::ungroup()
library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% comment("A test") tmp %>% dplyr::ungroup(.messages="{.count} items in combined") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% comment("A test") tmp %>% dplyr::ungroup(.messages="{.count} items in combined") %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
p_union( x, y, ..., .messages = "{.count.out} unique items in union", .headline = "Distinct union" )p_union( x, y, ..., .messages = "{.count.out} unique items in union", .headline = "Distinct union" )
x, y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
generics::union()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
p_union_all( x, y, ..., .messages = "{.count.out} items in union", .headline = "Union" )p_union_all( x, y, ..., .messages = "{.count.out} items in union", .headline = "Union" )
x, y
|
Pair of compatible data frames. A pair of data frames is compatible if they have the same column names (possibly in different orders) and compatible types. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::union_all()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
tidyr::unnest
A drop in replacement for tidyr::unnest() which optionally takes a
message and headline to store in the history graph.
p_unnest( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop = deprecated(), .id = deprecated(), .sep = deprecated(), .preserve = deprecated(), .messages = "", .headline = "" )p_unnest( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop = deprecated(), .id = deprecated(), .sep = deprecated(), .preserve = deprecated(), .messages = "", .headline = "" )
data |
A data frame. |
cols |
< When selecting multiple columns, values from the same row will be recycled to their common size. |
... |
|
keep_empty |
By default, you get one row of output for each element
of the list that you are unchopping/unnesting. This means that if there's a
size-0 element (like |
ptype |
Optionally, a named list of column name-prototype pairs to
coerce |
names_sep |
If |
names_repair |
Used to check that output data frame has valid names. Must be one of the following options:
See |
.drop, .preserve
|
|
.id |
|
.sep |
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
the result of the tidyr::unnest but with a history graph
updated.
tidyr::unnest()
library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }
Remove tracking from the dataframe
p_untrack(.data)p_untrack(.data)
.data |
a tracked dataframe |
the .data dataframe with history graph metadata removed.
library(dplyr) library(dtrackr) iris %>% track() %>% untrack() %>% class()library(dplyr) library(dtrackr) iris %>% track() %>% untrack() %>% class()
Pausing tracking of a data frame may be required if an operation is about to
be performed that creates a lot of groupings or that you otherwise don't
want to pollute the history graph (e.g. maybe selecting something using
an anti-join). Once paused the history is not updated until a resume() is
called, or when the data frame is ungrouped (if auto is enabled).
pause(.data, auto = FALSE)pause(.data, auto = FALSE)
.data |
a tracked dataframe |
auto |
if |
the .data dataframe with history graph tracking paused
iris %>% track() %>% pause() %>% history()iris %>% track() %>% pause() %>% history()
tidyr::pivot_longer
A drop in replacement for tidyr::pivot_longer() which optionally takes a
message and headline to store in the history graph.
## S3 method for class 'trackr_df' pivot_longer( data, cols, ..., cols_vary = "fastest", names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, .messages = c("long format", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )## S3 method for class 'trackr_df' pivot_longer( data, cols, ..., cols_vary = "fastest", names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, .messages = c("long format", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
data |
A data frame to pivot. |
cols |
< |
... |
Additional arguments passed on to methods. |
cols_vary |
When pivoting
|
names_to |
A character vector specifying the new column or columns to
create from the information stored in the column names of
|
names_prefix |
A regular expression used to remove matching text from the start of each variable name. |
names_sep, names_pattern
|
If
If these arguments do not give you enough control, use
|
names_ptypes, values_ptypes
|
Optionally, a list of column name-prototype
pairs. Alternatively, a single empty prototype can be supplied, which will
be applied to all columns. A prototype (or ptype for short) is a
zero-length vector (like |
names_transform, values_transform
|
Optionally, a list of column
name-function pairs. Alternatively, a single function can be supplied,
which will be applied to all columns. Use these arguments if you need to
change the types of specific columns. For example, If not specified, the type of the columns generated from |
names_repair |
What happens if the output has invalid column names?
The default, |
values_to |
A string specifying the name of the column to create
from the data stored in cell values. If |
values_drop_na |
If |
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
the result of the tidyr::pivot_longer but with a history graph
updated.
tidyr::pivot_longer()
library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }
tidyr::pivot_wider
A drop in replacement for tidyr::pivot_wider() which optionally takes a
message and headline to store in the history graph.
## S3 method for class 'trackr_df' pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL, .messages = c("wide format", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )## S3 method for class 'trackr_df' pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL, .messages = c("wide format", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods. |
id_cols |
< Defaults to all columns in |
id_expand |
Should the values in the |
names_from, values_from
|
< If |
names_prefix |
String added to the start of every variable name. This is
particularly useful if |
names_sep |
If |
names_glue |
Instead of |
names_sort |
Should the column names be sorted? If |
names_vary |
When
|
names_expand |
Should the values in the |
names_repair |
What happens if the output has invalid column names?
The default, |
values_fill |
Optionally, a (scalar) value that specifies what each
This can be a named list if you want to apply different fill values to different value columns. |
values_fn |
Optionally, a function applied to the value in each cell
in the output. You will typically use this when the combination of
This can be a named list if you want to apply different aggregations
to different |
unused_fn |
Optionally, a function applied to summarize the values from
the unused columns (i.e. columns not identified by The default drops all unused columns from the result. This can be a named list if you want to apply different aggregations to different unused columns.
This is similar to grouping by the |
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
the data dataframe result of the tidyr::pivot_wider function but with
a history graph updated.
tidyr::pivot_wider()
library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }
Plots a history graph as html
## S3 method for class 'trackr_graph' plot(x, ...)## S3 method for class 'trackr_graph' plot(x, ...)
x |
a dtrackr history graph (e.g. output from |
... |
Named arguments passed on to
|
HTML displayed
library(dplyr) library(dtrackr) iris %>% comment("hello {.total} rows") %>% history() %>% plot()library(dplyr) library(dtrackr) iris %>% comment("hello {.total} rows") %>% history() %>% plot()
Print a history graph to the console
## S3 method for class 'trackr_graph' print(x, ...)## S3 method for class 'trackr_graph' print(x, ...)
x |
a dtrackr history graph (e.g. output from |
... |
not used |
nothing
library(dplyr) library(dtrackr) iris %>% comment("hello {.total} rows") %>% history() %>% print()library(dplyr) library(dtrackr) iris %>% comment("hello {.total} rows") %>% history() %>% print()
Summarising a data set acts in the normal dplyr manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise().
## S3 method for class 'trackr_df' reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
dplyr::reframe()
library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% track() tmp %>% dplyr::reframe(dplyr::tibble( param = c("mean","min","max"), value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length)) ), .messages="length {param}: {value}") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% track() tmp %>% dplyr::reframe(dplyr::tibble( param = c("mean","min","max"), value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length)) ), .messages="length {param}: {value}") %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
## S3 method for class 'trackr_df' relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::relocate()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # relocate, this shows how the columns can be reordered iris %>% track() %>% group_by(Species) %>% relocate( tidyselect::starts_with("Sepal"), .after=Species, .messages="{.cols}", .headline="Order of columns from relocate:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # relocate, this shows how the columns can be reordered iris %>% track() %>% group_by(Species) %>% relocate( tidyselect::starts_with("Sepal"), .after=Species, .messages="{.cols}", .headline="Order of columns from relocate:") %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
## S3 method for class 'trackr_df' rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::rename_with()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
## S3 method for class 'trackr_df' rename(.data, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' rename(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::rename()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
This may reset the grouping of the tracked data if the grouping structure
has changed since the data frame was paused. If you try and resume tracking a
data frame with too many groups (as defined by options("dtrackr.max_supported_groupings"=XX))
then the resume will fail and the data frame will still be paused. This can
be overridden by specifying a value for the .maxgroups parameter.
resume(.data, ...)resume(.data, ...)
.data |
a tracked dataframe |
... |
Named arguments passed on to
|
the .data data frame with history graph tracking resumed
library(dplyr) library(dtrackr) iris %>% track() %>% pause() %>% resume() %>% history()library(dplyr) library(dtrackr) iris %>% track() %>% pause() %>% resume() %>% history()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::right_join() for more details
on the underlying functions.
## S3 method for class 'trackr_df' right_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Right join by {.keys}" )## S3 method for class 'trackr_df' right_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Right join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::right_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
Convert a digraph in dot format to SVG and save it to a range of output file types
save_dot( dot, filename, size = std_size$half, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), landscape = size$rot != 0, ... )save_dot( dot, filename, size = std_size$half, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), landscape = size$rot != 0, ... )
dot |
a |
filename |
the full path of the file name (minus extension for multiple formats) |
size |
a named list with 3 elements, length and width in inches and rotation. A predefined set of standard sizes are available in the std_size object. |
maxWidth |
a width (on the paper) in inches if |
maxHeight |
a height (on the paper) in inches if |
formats |
some of |
landscape |
rotate the output by 270 degrees into a landscape format.
|
... |
ignored |
a list with items paths with the absolute paths of the saved files
as a named list, and svg as the SVG string of the rendered dot file.
save_dot("digraph {A->B}",tempfile())save_dot("digraph {A->B}",tempfile())
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
## S3 method for class 'trackr_df' select(.data, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' select(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::select()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # select # The output of the select verb (here using tidyselect syntax) can be captured # and here all column names are being reported with the .cols variable. iris %>% track() %>% group_by(Species) %>% select( tidyselect::starts_with("Sepal"), .messages="{.cols}", .headline="Output columns from select:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # select # The output of the select verb (here using tidyselect syntax) can be captured # and here all column names are being reported with the .cols variable. iris %>% track() %>% group_by(Species) %>% select( tidyselect::starts_with("Sepal"), .messages="{.cols}", .headline="Output columns from select:") %>% history()
Mutating joins behave as dplyr joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::semi_join() for more details
on the underlying functions.
## S3 method for class 'trackr_df' semi_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in intersection"), .headline = "Semi join by {.keys}" )## S3 method for class 'trackr_df' semi_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in intersection"), .headline = "Semi join by {.keys}" )
x, y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::semi_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Semi join join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Semi join join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
## S3 method for class 'trackr_df' setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" ) ## S3 method for class 'trackr_df' setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" )## S3 method for class 'trackr_df' setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" ) ## S3 method for class 'trackr_df' setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" )
x, y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::setdiff()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_head( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )## S3 method for class 'trackr_df' slice_head( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_head()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_max( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )## S3 method for class 'trackr_df' slice_max( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_max()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_min( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )## S3 method for class 'trackr_df' slice_min( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_min()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_sample( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )## S3 method for class 'trackr_df' slice_sample( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_sample()
library(dplyr) library(dtrackr) # In this example the iris dataframe is resampled 100 times with replacement # within each group and the iris %>% track() %>% group_by(Species) %>% slice_sample(n=100, replace=TRUE, .messages="{.count.out} / {.count.in} = {n}", .headline="100 {Species}") %>% history()library(dplyr) library(dtrackr) # In this example the iris dataframe is resampled 100 times with replacement # within each group and the iris %>% track() %>% group_by(Species) %>% slice_sample(n=100, replace=TRUE, .messages="{.count.out} / {.count.in} = {n}", .headline="100 {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_tail( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )## S3 method for class 'trackr_df' slice_tail( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_tail()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice(), dplyr::slice_head(), dplyr::slice_tail(),
dplyr::slice_min(), dplyr::slice_max(), dplyr::slice_sample(),
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )## S3 method for class 'trackr_df' slice( .data, ..., .messages = c("subset data", "{.count.in} before", "{.count.out} after"), .headline = .defaultHeadline() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice()
library(dplyr) library(dtrackr) # an arbitrary 50 items from the iris dataframe is selected. The # history is tracked iris %>% track() %>% slice(51:100) %>% history()library(dplyr) library(dtrackr) # an arbitrary 50 items from the iris dataframe is selected. The # history is tracked iris %>% track() %>% slice(51:100) %>% history()
In the middle of a pipeline you may wish to document something about the data
that is more complex than the simple counts. status is essentially a
dplyr summarisation step which is connected to a glue specification
output, that is recorded in the data frame history. This means you can do an
arbitrary interim summarisation and put the result into the flowchart without
disrupting the pipeline flow.
status( .data, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL )status( .data, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL )
.data |
a dataframe which may be grouped |
... |
any normal dplyr::summarise specification, e.g. |
.messages |
a character vector of glue specifications. A glue specification can refer to the summary outputs, any grouping variables of .data, the {.strata}, or any variables defined in the calling environment |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Because of the ... summary specification parameters MUST BE NAMED.
the same .data dataframe with the history metadata updated with the status inserted as a new stage
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% status( long = p_count_if(Petal.Length>5), short = p_count_if(Petal.Length<2), .messages="{Species}: {long} long ones & {short} short ones" ) %>% history()library(dplyr) library(dtrackr) tmp = iris %>% track() %>% dplyr::group_by(Species) tmp %>% status( long = p_count_if(Petal.Length>5), short = p_count_if(Petal.Length<2), .messages="{Species}: {long} long ones & {short} short ones" ) %>% history()
A list of standard paper sizes for outputting flowcharts or other dot
graphs. These include width and height dimensions in inches and can be
used as one way to specify the output size of a dot graph, including
flowcharts (see the size parameter of flowchart()).
std_sizestd_size
An object of class list of length 12.
The sizes available are A4, A5, full (fits a portrait A4 with margins), half (half an
A4 with margins), third, two_third, quarter, sixth (all with reference to
an A4 page with margins). There are 2 landscape sizes A4_landscape and full_landscape which
fit an A4 page with or without margins. There are also 2 slide dimensions,
to fit with standard presentation software dimensions.
This is just a convenience. Similar effects can be achieved by providing width and height
parameters to flowchart() directly.
Summarising a data set acts in the normal dplyr manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise().
## S3 method for class 'trackr_df' summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
dplyr::summarise()
library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% track() tmp %>% dplyr::summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% track() tmp %>% dplyr::summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()
Any counts at the individual stages that was stored with a .tag option in a pipeline step can be recovered here. The idea here is to provide a quick way to access a single value
for the counts or other details tagged in a pipeline into a format that can be reported in text of a document. (e.g. for a results section). For more examples the consort statement vignette
has some examples of use.
tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)
.data |
the tracked dataframe. |
.tag |
(optional) the tag to retrieve. |
.strata |
(optional) filter the tagged data by the strata. set to "" to filter just the top level ungrouped data. |
.glue |
(optional) a glue specification which will be applied to the tagged content to generate a |
... |
(optional) any other named parameters will be passed to |
various things depending on what is requested.
By default a tibble with a .tag column and all associated summary values in a nested .content column.
If a .strata column is specified the results are filtered to just those that match a given .strata grouping (i.e. this will be the grouping label on the flowchart). Ungrouped content will have an empty "" as .strata
If .tag is specified the result will be for a single tag and .content will be automatically un-nested to give a single un-nested dataframe of the content captured at the .tag tagged step.
This could be single or multiple rows depending on whether the original data was grouped at the point of tagging.
If both the .tag and .glue is specified a .label column will be computed from .glue and the tagged content. If the result of this is a single row then just the string value of .label is returned.
If just the .glue is specified, an un-nested dataframe with .tag,.strata and .label columns with a label for each tag in each strata.
If this seems complex then the best thing is to experiment until you get the output you want, leaving any .glue options until you think you know what you are doing. It made sense at the time.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") tmp = tmp %>% filter(Species!="versicolor") %>% dplyr::group_by(Species) tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") tmp = tmp %>% filter(Species!="versicolor") %>% dplyr::group_by(Species) tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")
Start tracking the dtrackr history graph
track( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )track( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue
specification can refer to any grouping variables of .data, or any
variables defined in the calling environment, the {.total} variable which
is the count of all rows, the {.count} variable which is the count of
rows in the current group and the {.strata} which describes the current
group. Defaults to the value of |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable which is |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe with additional history graph metadata, to allow tracking.
library(dplyr) library(dtrackr) iris %>% track() %>% history()library(dplyr) library(dtrackr) iris %>% track() %>% history()
See dplyr::mutate(), dplyr::add_count(), dplyr::add_tally(),
dplyr::transmute(), dplyr::select(), dplyr::relocate(),
dplyr::rename() dplyr::rename_with(), dplyr::arrange() for more details
on underlying functions. dtrackr provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr. mutate / select / rename generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr history. This can be overridden with the .messages, or
.headline values in which case they behave just like a comment().
## S3 method for class 'trackr_df' transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)## S3 method for class 'trackr_df' transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe after being modified by the dplyr equivalent
function, but with the history graph updated with a new stage if the
.messages or .headline parameter is not empty.
dplyr::transmute()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # In this example we compare the column names of the input and the # output to identify the new columns created by the transmute operation as # the `.new_cols` variable # Here we do the same for a transmute() iris %>% track() %>% group_by(Species, .add=TRUE) %>% transmute( sepal.w = Sepal.Width-1, sepal.l = Sepal.Length+1, .messages="{.new_cols}", .headline="New columns from transmute:") %>% history()library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # In this example we compare the column names of the input and the # output to identify the new columns created by the transmute operation as # the `.new_cols` variable # Here we do the same for a transmute() iris %>% track() %>% group_by(Species, .add=TRUE) %>% transmute( sepal.w = Sepal.Width-1, sepal.l = Sepal.Length+1, .messages="{.new_cols}", .headline="New columns from transmute:") %>% history()
Un-grouping a data set logically combines the different arms. In the history
this joins any stratified branches and acts as a specific type of status(),
allowing you to generate some summary statistics about the un-grouped data.
See dplyr::ungroup().
## S3 method for class 'trackr_df' ungroup( x, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )## S3 method for class 'trackr_df' ungroup( x, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
x |
A |
... |
variables to remove from the grouping. |
.messages |
a set of glue specs. The glue code can use any any global variable, or {.count}. the default is "total {.count} items" |
.headline |
a headline glue spec. The glue code can use {.count} and {.strata}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe but ungrouped with the history graph updated showing the ungroup operation as a new stage.
dplyr::ungroup()
library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% comment("A test") tmp %>% dplyr::ungroup(.messages="{.count} items in combined") %>% history()library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) %>% comment("A test") tmp %>% dplyr::ungroup(.messages="{.count} items in combined") %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
## S3 method for class 'trackr_df' union_all( x, y, ..., .messages = "{.count.out} items in union", .headline = "Union" )## S3 method for class 'trackr_df' union_all( x, y, ..., .messages = "{.count.out} items in union", .headline = "Union" )
x, y
|
Pair of compatible data frames. A pair of data frames is compatible if they have the same column names (possibly in different orders) and compatible types. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::union_all()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr operation. See dplyr::bind_rows(),
dplyr::bind_cols(), dplyr::intersect(), dplyr::union(),
dplyr::setdiff(),dplyr::intersect(), or dplyr::union_all() for the
underlying function details.
## S3 method for class 'trackr_df' union( x, y, ..., .messages = "{.count.out} unique items in union", .headline = "Distinct union" )## S3 method for class 'trackr_df' union( x, y, ..., .messages = "{.count.out} unique items in union", .headline = "Distinct union" )
x, y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
generics::union()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
tidyr::unnest
A drop in replacement for tidyr::unnest() which optionally takes a
message and headline to store in the history graph.
## S3 method for class 'trackr_df' unnest( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop = deprecated(), .id = deprecated(), .sep = deprecated(), .preserve = deprecated(), .messages = "", .headline = "" )## S3 method for class 'trackr_df' unnest( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop = deprecated(), .id = deprecated(), .sep = deprecated(), .preserve = deprecated(), .messages = "", .headline = "" )
data |
A data frame. |
cols |
< When selecting multiple columns, values from the same row will be recycled to their common size. |
... |
|
keep_empty |
By default, you get one row of output for each element
of the list that you are unchopping/unnesting. This means that if there's a
size-0 element (like |
ptype |
Optionally, a named list of column name-prototype pairs to
coerce |
names_sep |
If |
names_repair |
Used to check that output data frame has valid names. Must be one of the following options:
See |
.drop, .preserve
|
|
.id |
|
.sep |
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
the result of the tidyr::unnest but with a history graph
updated.
tidyr::unnest()
library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }library(dplyr) library(dtrackr) starwars %>% track() %>% tidyr::unnest(starships, keep_empty = TRUE) %>% tidyr::nest(world_data = c(-homeworld)) %>% history() # There is a problem with `tidyr::unnest` that means if you want to override the # `.messages` option at the moment it will most likely fail. Forcing the use of # the specific `dtrackr::p_unnest` version solves this problem, until hopefully it is # resolved in `tidyr`: starwars %>% track() %>% p_unnest( films, .messages = c("{.count.in} characters", "{.count.out} appearances") ) %>% dplyr::group_by(gender) %>% tidyr::nest( people = c(-gender, -species, -homeworld), .messages = c("{.count.in} appearances", "{.count.out} planets") ) %>% status() %>% history() # This example includes pivoting and nesting. The CMS patient care data # has multiple tests per institution in a long format, and observed / # denominator types. Firstly we pivot the data to allow us to easily calculate # a total percentage for each institution. This is duplicated for every test # so we nest the tests to get to one row per institution. Those institutions # with invalid scores are excluded. cms_history = tidyr::cms_patient_care %>% track() %>% tidyr::pivot_wider(names_from = type, values_from = score) %>% dplyr::mutate( percentage = sum(observed) / sum(denominator) * 100, .by = c(ccn, facility_name) ) %>% tidyr::nest( results = c(measure_abbr, observed, denominator), .messages = c("{.count.in} test results", "{.count.out} facilities") ) %>% exclude_all( percentage > 100 ~ "{.excluded} facilities with anomalous percentages", na.rm = TRUE ) print(cms_history %>% dtrackr::history()) # not run in examples: if (interactive()) { cms_history %>% flowchart() }
Remove tracking from the dataframe
untrack(.data)untrack(.data)
.data |
a tracked dataframe |
the .data dataframe with history graph metadata removed.
library(dplyr) library(dtrackr) iris %>% track() %>% untrack() %>% class()library(dplyr) library(dtrackr) iris %>% track() %>% untrack() %>% class()