Title: | Track your Data Pipelines |
---|---|
Description: | Track and document 'dplyr' data pipelines. As you filter, mutate, and join your way through a data set, 'dtrackr' seamlessly keeps track of your data flow and makes publication ready documentation of a data pipeline simple. |
Authors: | Robert Challen [aut, cre] |
Maintainer: | Robert Challen <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.6 |
Built: | 2025-01-17 03:38:18 UTC |
Source: | https://github.com/terminological/dtrackr |
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
## S3 method for class 'trackr_df' add_count(x, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' add_count(x, ..., .messages = "", .headline = "", .tag = NULL)
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::add_count()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)
add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::add_tally()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::anti_join()
for more details
on the underlying functions.
## S3 method for class 'trackr_df' anti_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"), .headline = "Semi join by {.keys}" )
## S3 method for class 'trackr_df' anti_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"), .headline = "Semi join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods. |
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::anti_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Anti join join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Anti join join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
## S3 method for class 'trackr_df' arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::arrange()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # arrange # In this case we sort the data descending and show the first value # is the same as the maximum value. iris %>% track() %>% arrange( desc(Petal.Width), .messages="{.count} items, columns: {.cols}", .headline="Reordered dataframe:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # arrange # In this case we sort the data descending and show the first value # is the same as the maximum value. iris %>% track() %>% arrange( desc(Petal.Width), .messages="{.count} items, columns: {.cols}", .headline="Reordered dataframe:") %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
bind_cols( ..., .messages = "{.count.out} in combined set", .headline = "Bind columns" )
bind_cols( ..., .messages = "{.count.out} in combined set", .headline = "Bind columns" )
... |
a collection of tracked data frames to combine
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::bind_cols()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")
bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")
... |
a collection of tracked data frames to combine
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::bind_rows()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Start capturing exclusions on a tracked dataframe.
capture_exclusions(.data, .capture = TRUE)
capture_exclusions(.data, .capture = TRUE)
.data |
a tracked dataframe |
.capture |
Should we capture exclusions (things removed from the data
set). This is useful for debugging data issues but comes at a significant
cost. Defaults to the value of |
the .data dataframe with the exclusions flag set (or cleared if
.capture=FALSE
).
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% filter(Species!="versicolor") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% filter(Species!="versicolor") %>% history()
A comment can be any kind of note and is added once for every current
grouping as defined by the .message
field. It can be made context specific
by including variables such as {.count} and {.total} in .message
which
refer to the grouped and ungrouped counts at this current stage of the
pipeline respectively. It can also pull in any global variable.
comment( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = (.type == "exclusion"), .tag = NULL )
comment( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = (.type == "exclusion"), .tag = NULL )
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue specification can refer to any grouping variables of .data, or any variables defined in the calling environment, the {.total} of all rows, the {.count} variable which is the count in each group and {.strata} a description of the group |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable (which is |
.type |
one of "info","...,"exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the same .data dataframe with the history graph updated with the comment
library(dplyr) library(dtrackr) iris %>% track() %>% comment("hello {.total} rows") %>% history()
library(dplyr) library(dtrackr) iris %>% track() %>% comment("hello {.total} rows") %>% history()
A frequent use case for more detailed description is to have a subgroup count within a flowchart. This works best for factor subgroup columns but other data will be converted to a factor automatically. The count of the items in each subgroup is added as a new stage in the flowchart.
count_subgroup( .data, .subgroup, ..., .messages = .defaultCountSubgroup(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL, .maxsubgroups = .defaultMaxSupportedGroupings() )
count_subgroup( .data, .subgroup, ..., .messages = .defaultCountSubgroup(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL, .maxsubgroups = .defaultMaxSupportedGroupings() )
.data |
a dataframe which may be grouped |
.subgroup |
a column with a small number of levels (e.g. a factor) |
... |
passed to |
.messages |
a character vector of glue specifications. A glue specification can refer to anything from the calling environment, {.subgroup} for the subgroup column name and {.name} for the subgroup column value, {.count} for the subgroup column count, {.subtotal} for the current stratification grouping count and {.total} for the whole dataset count |
.headline |
a glue specification which can refer to grouping variables of .data, {.subtotal} for the current grouping count, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want to use the summary data from this step in the future then give it a name with .tag. |
.maxsubgroups |
the maximum number of discrete values allowed in
.subgroup is configurable with
|
the same .data dataframe with the history graph updated with a subgroup count as a new stage
library(dplyr) library(dtrackr) survival::cgd %>% track() %>% group_by(treat) %>% count_subgroup(center) %>% history()
library(dplyr) library(dtrackr) survival::cgd %>% track() %>% group_by(treat) %>% count_subgroup(center) %>% history()
Distinct acts in the same way as in dplyr::distinct
. Prior to the operation
the size of the group is calculated {.count.in} and after the operation the
output size {.count.out} The group {.strata} is also available (if
grouped) for reporting. See dplyr::distinct()
.
## S3 method for class 'trackr_df' distinct( .data, ..., .messages = "removing {.count.in-.count.out} duplicates", .headline = .defaultHeadline(), .tag = NULL )
## S3 method for class 'trackr_df' distinct( .data, ..., .messages = "removing {.count.in-.count.out} duplicates", .headline = .defaultHeadline(), .tag = NULL )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe with distinct values and history graph updated.
dplyr::distinct()
library(dplyr) library(dtrackr) tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5)) tmp %>% group_by(Species) %>% distinct() %>% history()
library(dplyr) library(dtrackr) tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5)) tmp %>% group_by(Species) %>% distinct() %>% history()
Graphviz
dot content to a SVGConvert a graphviz
dot digraph as string to SVG
as string
dot2svg(dot)
dot2svg(dot)
dot |
a |
the SVG as a string
dot2svg("digraph { A->B }")
dot2svg("digraph { A->B }")
Apply a set of filters and summarise the actions of the filter to the dtrackr
history graph. Because of the ... filter specification, all parameters MUST BE
NAMED. The filters work in an combinatorial manner, i.e. the results EXCLUDE ALL
rows that match any of the criteria. If na.rm = TRUE
they also remove
anything that cannot be evaluated by any criteria.
exclude_all( .data, ..., .headline = .defaultHeadline(), na.rm = FALSE, .type = "exclusion", .asOffshoot = TRUE, .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
exclude_all( .data, ..., .headline = .defaultHeadline(), na.rm = FALSE, .type = "exclusion", .asOffshoot = TRUE, .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match any of the predicates will be excluded. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate. This can refer to grouping variables variables from the environment and {.excluded} and {.matched} or {.missing} (excluded = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default FALSE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = TRUE). |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the filtered .data dataframe with the history graph updated with the summary of excluded items as a new offshoot stage
library(dplyr) library(dtrackr) iris %>% track() %>% capture_exclusions() %>% exclude_all( Petal.Length > 5 ~ "{.excluded} long ones", Petal.Length < 2 ~ "{.excluded} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% exclude_all( # These two criteria identify the same value and one item is excluded a > 9 ~ "{.excluded} value > 9", a == max(a) ~ "{.excluded} max value", ) %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9, a != max(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. exclude_all(a > 9 ~ "{.excluded} value > 9") %>% exclude_all(a == max(a) ~ "{.excluded} max value") %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9) %>% dplyr::filter(a != max(a)) %>% nrow()
library(dplyr) library(dtrackr) iris %>% track() %>% capture_exclusions() %>% exclude_all( Petal.Length > 5 ~ "{.excluded} long ones", Petal.Length < 2 ~ "{.excluded} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% exclude_all( # These two criteria identify the same value and one item is excluded a > 9 ~ "{.excluded} value > 9", a == max(a) ~ "{.excluded} max value", ) %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9, a != max(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. exclude_all(a > 9 ~ "{.excluded} value > 9") %>% exclude_all(a == max(a) ~ "{.excluded} max value") %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9) %>% dplyr::filter(a != max(a)) %>% nrow()
Get the dtrackr excluded data record
excluded(.data, simplify = TRUE)
excluded(.data, simplify = TRUE)
.data |
a dataframe which may be grouped |
simplify |
return a single summary dataframe of all exclusions. |
a new dataframe of the excluded data up to this point in the workflow. This dataframe is by default flattened, but if .simplify=FALSE
has a nested structure containing records excluded at each part of the pipeline.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% exclude_all( Petal.Length > 5.8 ~ "{.excluded} long ones", Petal.Length < 1.3 ~ "{.excluded} short ones", .stage = "petal length exclusion" ) %>% excluded()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% exclude_all( Petal.Length > 5.8 ~ "{.excluded} long ones", Petal.Length < 1.3 ~ "{.excluded} short ones", .stage = "petal length exclusion" ) %>% excluded()
Filter acts in the same way as in dplyr
where predicates which evaluate to
TRUE act to select items to include, and items for which the predicate cannot
be evaluated are excluded. For tracking prior to the filter operation the
size of each group is calculated {.count.in} and after the operation the
output size of each group {.count.out}. The grouping {.strata} is also
available (if grouped) for reporting. See dplyr::filter()
.
## S3 method for class 'trackr_df' filter( .data, ..., .messages = "excluded {.excluded} items", .headline = .defaultHeadline(), .type = "exclusion", .asOffshoot = (.type == "exclusion"), .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
## S3 method for class 'trackr_df' filter( .data, ..., .messages = "excluded {.excluded} items", .headline = .defaultHeadline(), .type = "exclusion", .asOffshoot = (.type == "exclusion"), .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
the format type of the action typically an exclusion |
.asOffshoot |
if the type is exclusion, |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then
give it a name with |
the filtered .data
dataframe with history graph updated
dplyr::filter()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% filter(Petal.Length > 5) %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% filter(Petal.Length > 5) %>% history()
Generate a flowchart of the history of the dataframe(s), with all the tracked data pipeline as stages in the flowchart. Multiple dataframes can be plotted together in which case an attempt is made to determine which parts are common.
flowchart( .data, filename = NULL, size = std_size$full, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), defaultToHTML = TRUE, landscape = size$rot != 0, ... )
flowchart( .data, filename = NULL, size = std_size$full, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), defaultToHTML = TRUE, landscape = size$rot != 0, ... )
.data |
the tracked dataframe(s) either as a single dataframe or as a list of dataframes. |
filename |
a file name which will be where the formatted flowcharts are
saved. If no extension is specified the output formats are determined by
the |
size |
a named list with 3 elements, length and width in inches and rotation. A predefined set of standard sizes are available in the std_size object. |
maxWidth |
a width (on the paper) in inches if |
maxHeight |
a height (on the paper) in inches if |
formats |
some of |
defaultToHTML |
if the correct output format is not easy to determine
from the context, default providing |
landscape |
rotate the output by 270 degrees into a landscape format.
|
... |
other parameters passed onto either |
the nature of the flowchart output depends on the context in which
the function is called. It will be some form of browse-able html output if
called from an interactive session or a PNG
/PDF
link if in knitr
and
knitting latex or word type outputs, if file name is specified the output
will also be saved at the given location.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") tmp %>% group_by(Species) %>% comment(.tag="step2") %>% flowchart()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") tmp %>% group_by(Species) %>% comment(.tag="step2") %>% flowchart()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::full_join()
for more details
on the underlying functions.
## S3 method for class 'trackr_df' full_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Full join by {.keys}" )
## S3 method for class 'trackr_df' full_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Full join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::full_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
Grouping a data set acts in the normal way. When tracking a dataframe
sometimes a group_by()
operation will create a lot of groups. This happens
for example if you are doing a group_by()
, summarise()
step that is
aggregating data on a fine scale, e.g. by day in a time-series. This is
generally a terrible idea when tracking a dataframe as the resulting
flowchart will have many many branches and be illegible. dtrackr
will detect this issue and
pause tracking the dataframe with a warning. It is up to the user to the
resume()
tracking when the large number of groups have been resolved e.g.
using a dplyr::ungroup()
. This limit is configurable with
options("dtrackr.max_supported_groupings"=XX)
. The default is 16. See
dplyr::group_by()
.
## S3 method for class 'trackr_df' group_by( .data, ..., .messages = "stratify by {.cols}", .headline = NULL, .tag = NULL, .maxgroups = .defaultMaxSupportedGroupings() )
## S3 method for class 'trackr_df' group_by( .data, ..., .messages = "stratify by {.cols}", .headline = NULL, .tag = NULL, .maxgroups = .defaultMaxSupportedGroupings() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
In
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.cols} which is the columns that are being grouped by. |
.headline |
a headline glue spec. The glue code can use any global variable, or {.cols}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
.maxgroups |
the maximum number of subgroups allowed before the tracking is paused. |
the .data but grouped.
dplyr::group_by()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species, .messages="stratify by {.cols}") tmp %>% comment("{.strata}") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species, .messages="stratify by {.cols}") tmp %>% comment("{.strata}") %>% history()
Group modifying a data set acts in the normal way. The internal mechanics of
the modify function are opaque to the history. This means these can be used
to wrap any unsupported operation without losing the history (e.g. df %>% track() %>% group_modify(function(d,...) { d %>% unsupported_operation() })
) Prior to the operation the size of the group is calculated {.count.in}
and after the operation the output size {.count.out} The group {.strata}
is also available (if grouped) for reporting See dplyr::group_modify()
.
## S3 method for class 'trackr_df' group_modify( .data, ..., .messages = NULL, .headline = .defaultHeadline(), .type = "modify", .tag = NULL )
## S3 method for class 'trackr_df' group_modify( .data, ..., .messages = NULL, .headline = .defaultHeadline(), .type = "modify", .tag = NULL )
.data |
A grouped tibble |
... |
Additional arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
default "modify": used to define formatting |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the transformed .data dataframe with the history graph updated.
dplyr::group_modify()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% group_modify( function(d,g,...) { return(tibble::tibble(x=runif(10))) }, .messages="{.count.in} in, {.count.out} out" ) %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% group_modify( function(d,g,...) { return(tibble::tibble(x=runif(10))) }, .messages="{.count.in} in, {.count.out} out" ) %>% history()
This provides the raw history graph and is not really intended for mainstream use. The internal structure of the graph is explained below. print and plot S3 methods exist for the dtrackr history graph.
history(.data)
history(.data)
.data |
a dataframe which may be grouped |
the history graph. This is a list, of class trackr_graph
, containing the following named items:
excluded - the data items that have been excluded thus far as a nested dataframe
tags - a dataframe of tag-value pairs containing the summary of the data at named points in the data flow (see tagged()
)
nodes - a dataframe of the nodes of the flow chart
edges - an edge list (as a dataframe) of the relationships between the nodes in the flow chart
head - the current most recent nodes added into the graph as a dataframe.
The format of this data may grow over time but these fields are unlikely to be changed.
library(dplyr) library(dtrackr) graph = iris %>% track() %>% comment("A comment") %>% history() print(graph)
library(dplyr) library(dtrackr) graph = iris %>% track() %>% comment("A comment") %>% history() print(graph)
Apply a set of inclusion criteria and record the actions of the
filter to the dtrackr
history graph. Because of the ... filter specification,
all parameters MUST BE NAMED. This function is the opposite of
exclude_all()
and the filtering criteria work to identify rows to
include i.e. the results include anything that match any of the criteria. If
na.rm=TRUE
they also keep anything that cannot be evaluated by the criteria.
include_any( .data, ..., .headline = .defaultHeadline(), na.rm = TRUE, .type = "inclusion", .asOffshoot = FALSE, .tag = NULL )
include_any( .data, ..., .headline = .defaultHeadline(), na.rm = TRUE, .type = "inclusion", .asOffshoot = FALSE, .tag = NULL )
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match at least one of the predicates will be included. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate matched. This can refer to grouping variables, variables from the environment and {.included} and {.matched} or {.missing} (included = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default TRUE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "inclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the filtered .data dataframe with the history graph updated with the summary of included items as a new stage
library(dplyr) library(dtrackr) iris %>% track() %>% group_by(Species) %>% include_any( Petal.Length > 5 ~ "{.included} long ones", Petal.Length < 2 ~ "{.included} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% include_any( # These two criteria identify the same value and one item is excluded a > 1 ~ "{.included} value > 1", a != min(a) ~ "{.included} everything but the smallest value", ) %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1, a != min(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. include_any(a > 1 ~ "{.included} value > 1") %>% include_any(a != min(a) ~ "{.included} everything but the smallest value") %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1) %>% dplyr::filter(a != min(a)) %>% nrow()
library(dplyr) library(dtrackr) iris %>% track() %>% group_by(Species) %>% include_any( Petal.Length > 5 ~ "{.included} long ones", Petal.Length < 2 ~ "{.included} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% include_any( # These two criteria identify the same value and one item is excluded a > 1 ~ "{.included} value > 1", a != min(a) ~ "{.included} everything but the smallest value", ) %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1, a != min(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. include_any(a > 1 ~ "{.included} value > 1") %>% include_any(a != min(a) ~ "{.included} everything but the smallest value") %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1) %>% dplyr::filter(a != min(a)) %>% nrow()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::inner_join()
for more details
on the underlying functions.
## S3 method for class 'trackr_df' inner_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Inner join by {.keys}" )
## S3 method for class 'trackr_df' inner_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Inner join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::inner_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Inner join join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Inner join join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
## S3 method for class 'trackr_df' intersect( x, y, ..., .messages = "{.count.out} in intersection", .headline = "Intersection" )
## S3 method for class 'trackr_df' intersect( x, y, ..., .messages = "{.count.out} in intersection", .headline = "Intersection" )
x , y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
generics::intersect()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::left_join()
for more details
on the underlying functions.
## S3 method for class 'trackr_df' left_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Left join by {.keys}" )
## S3 method for class 'trackr_df' left_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Left join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::left_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Left join join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Left join join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
## S3 method for class 'trackr_df' mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::mutate()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # mutate # In this example we compare the column names of the input and the # output to identify the new columns created by the mutate operation as # the `.new_cols` variable iris %>% track() %>% mutate(extra_col = NA_real_, .messages="{.new_cols}", .headline="Extra columns from mutate:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # mutate # In this example we compare the column names of the input and the # output to identify the new columns created by the mutate operation as # the `.new_cols` variable iris %>% track() %>% mutate(extra_col = NA_real_, .messages="{.new_cols}", .headline="Extra columns from mutate:") %>% history()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::nest_join()
for more details
on the underlying functions.
## S3 method for class 'trackr_df' nest_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"), .headline = "Nest join by {.keys}" )
## S3 method for class 'trackr_df' nest_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"), .headline = "Nest join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::nest_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Nest join join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Nest join join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
p_add_count(x, ..., .messages = "", .headline = "", .tag = NULL)
p_add_count(x, ..., .messages = "", .headline = "", .tag = NULL)
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::add_count()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
p_add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)
p_add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::add_tally()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # add_count # adding in a count or tally column as a new column iris %>% track() %>% add_count(Species, name="new_count_total", .messages="{.new_cols}", # .messages="{.cols}", .headline="New columns from add_count:") %>% history() # add_tally iris %>% track() %>% group_by(Species) %>% dtrackr::add_tally(wt=Petal.Length, name="new_tally_total", .messages="{.new_cols}", .headline="New columns from add_tally:") %>% history()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::anti_join()
for more details
on the underlying functions.
p_anti_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"), .headline = "Semi join by {.keys}" )
p_anti_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"), .headline = "Semi join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::anti_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Anti join join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Anti join join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
p_arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)
p_arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::arrange()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # arrange # In this case we sort the data descending and show the first value # is the same as the maximum value. iris %>% track() %>% arrange( desc(Petal.Width), .messages="{.count} items, columns: {.cols}", .headline="Reordered dataframe:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # arrange # In this case we sort the data descending and show the first value # is the same as the maximum value. iris %>% track() %>% arrange( desc(Petal.Width), .messages="{.count} items, columns: {.cols}", .headline="Reordered dataframe:") %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
p_bind_cols( ..., .messages = "{.count.out} in combined set", .headline = "Bind columns" )
p_bind_cols( ..., .messages = "{.count.out} in combined set", .headline = "Bind columns" )
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::bind_cols()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
p_bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")
p_bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::bind_rows()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Start capturing exclusions on a tracked dataframe.
p_capture_exclusions(.data, .capture = TRUE)
p_capture_exclusions(.data, .capture = TRUE)
.data |
a tracked dataframe |
.capture |
Should we capture exclusions (things removed from the data
set). This is useful for debugging data issues but comes at a significant
cost. Defaults to the value of |
the .data dataframe with the exclusions flag set (or cleared if
.capture=FALSE
).
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% filter(Species!="versicolor") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% filter(Species!="versicolor") %>% history()
This is unlikely to be needed directly and is mostly and internal function
p_clear(.data)
p_clear(.data)
.data |
a dataframe which may be grouped |
the .data dataframe with the history graph removed
library(dplyr) library(dtrackr) mtcars %>% track() %>% comment("A comment") %>% p_clear() %>% history()
library(dplyr) library(dtrackr) mtcars %>% track() %>% comment("A comment") %>% p_clear() %>% history()
A comment can be any kind of note and is added once for every current
grouping as defined by the .message
field. It can be made context specific
by including variables such as {.count} and {.total} in .message
which
refer to the grouped and ungrouped counts at this current stage of the
pipeline respectively. It can also pull in any global variable.
p_comment( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = (.type == "exclusion"), .tag = NULL )
p_comment( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = (.type == "exclusion"), .tag = NULL )
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue specification can refer to any grouping variables of .data, or any variables defined in the calling environment, the {.total} of all rows, the {.count} variable which is the count in each group and {.strata} a description of the group |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable (which is |
.type |
one of "info","...,"exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the same .data dataframe with the history graph updated with the comment
library(dplyr) library(dtrackr) iris %>% track() %>% comment("hello {.total} rows") %>% history()
library(dplyr) library(dtrackr) iris %>% track() %>% comment("hello {.total} rows") %>% history()
Copy the dtrackr history graph from one dataframe to another
p_copy(.data, from)
p_copy(.data, from)
.data |
a dataframe which may be grouped |
from |
the dataframe to copy the history graph from |
the .data dataframe with the history graph of "from"
mtcars %>% p_copy(iris %>% comment("A comment")) %>% history()
mtcars %>% p_copy(iris %>% comment("A comment")) %>% history()
Simple count_if dplyr summary function
p_count_if(..., na.rm = TRUE)
p_count_if(..., na.rm = TRUE)
... |
expression to be evaluated |
na.rm |
ignore NA values? |
a count of the number of times the expression evaluated to true, in the current context
library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) tmp %>% dplyr::summarise(long_ones = p_count_if(Petal.Length > 4))
library(dplyr) library(dtrackr) tmp = iris %>% dplyr::group_by(Species) tmp %>% dplyr::summarise(long_ones = p_count_if(Petal.Length > 4))
A frequent use case for more detailed description is to have a subgroup count within a flowchart. This works best for factor subgroup columns but other data will be converted to a factor automatically. The count of the items in each subgroup is added as a new stage in the flowchart.
p_count_subgroup( .data, .subgroup, ..., .messages = .defaultCountSubgroup(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL, .maxsubgroups = .defaultMaxSupportedGroupings() )
p_count_subgroup( .data, .subgroup, ..., .messages = .defaultCountSubgroup(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL, .maxsubgroups = .defaultMaxSupportedGroupings() )
.data |
a dataframe which may be grouped |
.subgroup |
a column with a small number of levels (e.g. a factor) |
... |
passed to |
.messages |
a character vector of glue specifications. A glue specification can refer to anything from the calling environment, {.subgroup} for the subgroup column name and {.name} for the subgroup column value, {.count} for the subgroup column count, {.subtotal} for the current stratification grouping count and {.total} for the whole dataset count |
.headline |
a glue specification which can refer to grouping variables of .data, {.subtotal} for the current grouping count, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want to use the summary data from this step in the future then give it a name with .tag. |
.maxsubgroups |
the maximum number of discrete values allowed in
.subgroup is configurable with
|
the same .data dataframe with the history graph updated with a subgroup count as a new stage
library(dplyr) library(dtrackr) survival::cgd %>% track() %>% group_by(treat) %>% count_subgroup(center) %>% history()
library(dplyr) library(dtrackr) survival::cgd %>% track() %>% group_by(treat) %>% count_subgroup(center) %>% history()
Distinct acts in the same way as in dplyr::distinct
. Prior to the operation
the size of the group is calculated {.count.in} and after the operation the
output size {.count.out} The group {.strata} is also available (if
grouped) for reporting. See dplyr::distinct()
.
p_distinct( .data, ..., .messages = "removing {.count.in-.count.out} duplicates", .headline = .defaultHeadline(), .tag = NULL )
p_distinct( .data, ..., .messages = "removing {.count.in-.count.out} duplicates", .headline = .defaultHeadline(), .tag = NULL )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe with distinct values and history graph updated.
dplyr::distinct()
library(dplyr) library(dtrackr) tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5)) tmp %>% group_by(Species) %>% distinct() %>% history()
library(dplyr) library(dtrackr) tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5)) tmp %>% group_by(Species) %>% distinct() %>% history()
Apply a set of filters and summarise the actions of the filter to the dtrackr
history graph. Because of the ... filter specification, all parameters MUST BE
NAMED. The filters work in an combinatorial manner, i.e. the results EXCLUDE ALL
rows that match any of the criteria. If na.rm = TRUE
they also remove
anything that cannot be evaluated by any criteria.
p_exclude_all( .data, ..., .headline = .defaultHeadline(), na.rm = FALSE, .type = "exclusion", .asOffshoot = TRUE, .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
p_exclude_all( .data, ..., .headline = .defaultHeadline(), na.rm = FALSE, .type = "exclusion", .asOffshoot = TRUE, .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match any of the predicates will be excluded. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate. This can refer to grouping variables variables from the environment and {.excluded} and {.matched} or {.missing} (excluded = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default FALSE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = TRUE). |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the filtered .data dataframe with the history graph updated with the summary of excluded items as a new offshoot stage
library(dplyr) library(dtrackr) iris %>% track() %>% capture_exclusions() %>% exclude_all( Petal.Length > 5 ~ "{.excluded} long ones", Petal.Length < 2 ~ "{.excluded} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% exclude_all( # These two criteria identify the same value and one item is excluded a > 9 ~ "{.excluded} value > 9", a == max(a) ~ "{.excluded} max value", ) %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9, a != max(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. exclude_all(a > 9 ~ "{.excluded} value > 9") %>% exclude_all(a == max(a) ~ "{.excluded} max value") %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9) %>% dplyr::filter(a != max(a)) %>% nrow()
library(dplyr) library(dtrackr) iris %>% track() %>% capture_exclusions() %>% exclude_all( Petal.Length > 5 ~ "{.excluded} long ones", Petal.Length < 2 ~ "{.excluded} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% exclude_all( # These two criteria identify the same value and one item is excluded a > 9 ~ "{.excluded} value > 9", a == max(a) ~ "{.excluded} max value", ) %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9, a != max(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. exclude_all(a > 9 ~ "{.excluded} value > 9") %>% exclude_all(a == max(a) ~ "{.excluded} max value") %>% status() %>% history() # the behaviour is equivalent to the inverse of dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a <= 9) %>% dplyr::filter(a != max(a)) %>% nrow()
Get the dtrackr excluded data record
p_excluded(.data, simplify = TRUE)
p_excluded(.data, simplify = TRUE)
.data |
a dataframe which may be grouped |
simplify |
return a single summary dataframe of all exclusions. |
a new dataframe of the excluded data up to this point in the workflow. This dataframe is by default flattened, but if .simplify=FALSE
has a nested structure containing records excluded at each part of the pipeline.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% exclude_all( Petal.Length > 5.8 ~ "{.excluded} long ones", Petal.Length < 1.3 ~ "{.excluded} short ones", .stage = "petal length exclusion" ) %>% excluded()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% capture_exclusions() tmp %>% exclude_all( Petal.Length > 5.8 ~ "{.excluded} long ones", Petal.Length < 1.3 ~ "{.excluded} short ones", .stage = "petal length exclusion" ) %>% excluded()
Filter acts in the same way as in dplyr
where predicates which evaluate to
TRUE act to select items to include, and items for which the predicate cannot
be evaluated are excluded. For tracking prior to the filter operation the
size of each group is calculated {.count.in} and after the operation the
output size of each group {.count.out}. The grouping {.strata} is also
available (if grouped) for reporting. See dplyr::filter()
.
p_filter( .data, ..., .messages = "excluded {.excluded} items", .headline = .defaultHeadline(), .type = "exclusion", .asOffshoot = (.type == "exclusion"), .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
p_filter( .data, ..., .messages = "excluded {.excluded} items", .headline = .defaultHeadline(), .type = "exclusion", .asOffshoot = (.type == "exclusion"), .stage = (if (is.null(.tag)) "" else .tag), .tag = NULL )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
the format type of the action typically an exclusion |
.asOffshoot |
if the type is exclusion, |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then
give it a name with |
the filtered .data
dataframe with history graph updated
dplyr::filter()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% filter(Petal.Length > 5) %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% filter(Petal.Length > 5) %>% history()
Generate a flowchart of the history of the dataframe(s), with all the tracked data pipeline as stages in the flowchart. Multiple dataframes can be plotted together in which case an attempt is made to determine which parts are common.
p_flowchart( .data, filename = NULL, size = std_size$full, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), defaultToHTML = TRUE, landscape = size$rot != 0, ... )
p_flowchart( .data, filename = NULL, size = std_size$full, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), defaultToHTML = TRUE, landscape = size$rot != 0, ... )
.data |
the tracked dataframe(s) either as a single dataframe or as a list of dataframes. |
filename |
a file name which will be where the formatted flowcharts are
saved. If no extension is specified the output formats are determined by
the |
size |
a named list with 3 elements, length and width in inches and rotation. A predefined set of standard sizes are available in the std_size object. |
maxWidth |
a width (on the paper) in inches if |
maxHeight |
a height (on the paper) in inches if |
formats |
some of |
defaultToHTML |
if the correct output format is not easy to determine
from the context, default providing |
landscape |
rotate the output by 270 degrees into a landscape format.
|
... |
other parameters passed onto either |
the nature of the flowchart output depends on the context in which
the function is called. It will be some form of browse-able html output if
called from an interactive session or a PNG
/PDF
link if in knitr
and
knitting latex or word type outputs, if file name is specified the output
will also be saved at the given location.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") tmp %>% group_by(Species) %>% comment(.tag="step2") %>% flowchart()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") tmp %>% group_by(Species) %>% comment(.tag="step2") %>% flowchart()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::full_join()
for more details
on the underlying functions.
p_full_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Full join by {.keys}" )
p_full_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Full join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::full_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
This provides the raw history graph and is not really intended for mainstream use. The internal structure of the graph is explained below. print and plot S3 methods exist for the dtrackr history graph.
p_get(.data)
p_get(.data)
.data |
a dataframe which may be grouped |
the history graph. This is a list, of class trackr_graph
, containing the following named items:
excluded - the data items that have been excluded thus far as a nested dataframe
tags - a dataframe of tag-value pairs containing the summary of the data at named points in the data flow (see tagged()
)
nodes - a dataframe of the nodes of the flow chart
edges - an edge list (as a dataframe) of the relationships between the nodes in the flow chart
head - the current most recent nodes added into the graph as a dataframe.
The format of this data may grow over time but these fields are unlikely to be changed.
library(dplyr) library(dtrackr) graph = iris %>% track() %>% comment("A comment") %>% history() print(graph)
library(dplyr) library(dtrackr) graph = iris %>% track() %>% comment("A comment") %>% history() print(graph)
(advance usage) outputs a dtrackr
history graph as a DOT string for rendering with Graphviz
p_get_as_dot(.data, fill = "lightgrey", fontsize = "8", colour = "black", ...)
p_get_as_dot(.data, fill = "lightgrey", fontsize = "8", colour = "black", ...)
.data |
the tracked dataframe |
fill |
the default node fill colour |
fontsize |
the default font size |
colour |
the default font colour |
... |
not used |
a representation of the history graph in Graphviz
dot format.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") dot = tmp %>% group_by(Species) %>% comment(.tag="step2") %>% p_get_as_dot() cat(dot)
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor") dot = tmp %>% group_by(Species) %>% comment(.tag="step2") %>% p_get_as_dot() cat(dot)
Grouping a data set acts in the normal way. When tracking a dataframe
sometimes a group_by()
operation will create a lot of groups. This happens
for example if you are doing a group_by()
, summarise()
step that is
aggregating data on a fine scale, e.g. by day in a time-series. This is
generally a terrible idea when tracking a dataframe as the resulting
flowchart will have many many branches and be illegible. dtrackr
will detect this issue and
pause tracking the dataframe with a warning. It is up to the user to the
resume()
tracking when the large number of groups have been resolved e.g.
using a dplyr::ungroup()
. This limit is configurable with
options("dtrackr.max_supported_groupings"=XX)
. The default is 16. See
dplyr::group_by()
.
p_group_by( .data, ..., .messages = "stratify by {.cols}", .headline = NULL, .tag = NULL, .maxgroups = .defaultMaxSupportedGroupings() )
p_group_by( .data, ..., .messages = "stratify by {.cols}", .headline = NULL, .tag = NULL, .maxgroups = .defaultMaxSupportedGroupings() )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
In
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.cols} which is the columns that are being grouped by. |
.headline |
a headline glue spec. The glue code can use any global variable, or {.cols}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
.maxgroups |
the maximum number of subgroups allowed before the tracking is paused. |
the .data but grouped.
dplyr::group_by()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species, .messages="stratify by {.cols}") tmp %>% comment("{.strata}") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species, .messages="stratify by {.cols}") tmp %>% comment("{.strata}") %>% history()
Group modifying a data set acts in the normal way. The internal mechanics of
the modify function are opaque to the history. This means these can be used
to wrap any unsupported operation without losing the history (e.g. df %>% track() %>% group_modify(function(d,...) { d %>% unsupported_operation() })
) Prior to the operation the size of the group is calculated {.count.in}
and after the operation the output size {.count.out} The group {.strata}
is also available (if grouped) for reporting See dplyr::group_modify()
.
p_group_modify( .data, ..., .messages = NULL, .headline = .defaultHeadline(), .type = "modify", .tag = NULL )
p_group_modify( .data, ..., .messages = NULL, .headline = .defaultHeadline(), .type = "modify", .tag = NULL )
.data |
A grouped tibble |
... |
Additional arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
default "modify": used to define formatting |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the transformed .data dataframe with the history graph updated.
dplyr::group_modify()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% group_modify( function(d,g,...) { return(tibble::tibble(x=runif(10))) }, .messages="{.count.in} in, {.count.out} out" ) %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% group_modify( function(d,g,...) { return(tibble::tibble(x=runif(10))) }, .messages="{.count.in} in, {.count.out} out" ) %>% history()
Apply a set of inclusion criteria and record the actions of the
filter to the dtrackr
history graph. Because of the ... filter specification,
all parameters MUST BE NAMED. This function is the opposite of
exclude_all()
and the filtering criteria work to identify rows to
include i.e. the results include anything that match any of the criteria. If
na.rm=TRUE
they also keep anything that cannot be evaluated by the criteria.
p_include_any( .data, ..., .headline = .defaultHeadline(), na.rm = TRUE, .type = "inclusion", .asOffshoot = FALSE, .tag = NULL )
p_include_any( .data, ..., .headline = .defaultHeadline(), na.rm = TRUE, .type = "inclusion", .asOffshoot = FALSE, .tag = NULL )
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match at least one of the predicates will be included. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate matched. This can refer to grouping variables, variables from the environment and {.included} and {.matched} or {.missing} (included = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default TRUE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "inclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the filtered .data dataframe with the history graph updated with the summary of included items as a new stage
library(dplyr) library(dtrackr) iris %>% track() %>% group_by(Species) %>% include_any( Petal.Length > 5 ~ "{.included} long ones", Petal.Length < 2 ~ "{.included} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% include_any( # These two criteria identify the same value and one item is excluded a > 1 ~ "{.included} value > 1", a != min(a) ~ "{.included} everything but the smallest value", ) %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1, a != min(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. include_any(a > 1 ~ "{.included} value > 1") %>% include_any(a != min(a) ~ "{.included} everything but the smallest value") %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1) %>% dplyr::filter(a != min(a)) %>% nrow()
library(dplyr) library(dtrackr) iris %>% track() %>% group_by(Species) %>% include_any( Petal.Length > 5 ~ "{.included} long ones", Petal.Length < 2 ~ "{.included} short ones" ) %>% history() # simultaneous evaluation of criteria: data.frame(a = 1:10) %>% track() %>% include_any( # These two criteria identify the same value and one item is excluded a > 1 ~ "{.included} value > 1", a != min(a) ~ "{.included} everything but the smallest value", ) %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1, a != min(a)) %>% nrow() # step-wise evaluation of criteria results in a different output data.frame(a = 1:10) %>% track() %>% # Performing the same exclusion sequentially results in 2 items # being excluded as the criteria no longer identify the same # item. include_any(a > 1 ~ "{.included} value > 1") %>% include_any(a != min(a) ~ "{.included} everything but the smallest value") %>% status() %>% history() # the behaviour is equivalent to dplyr's filter function: data.frame(a=1:10) %>% dplyr::filter(a > 1) %>% dplyr::filter(a != min(a)) %>% nrow()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::inner_join()
for more details
on the underlying functions.
p_inner_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Inner join by {.keys}" )
p_inner_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Inner join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::inner_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Inner join join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Inner join join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
p_intersect( x, y, ..., .messages = "{.count.out} in intersection", .headline = "Intersection" )
p_intersect( x, y, ..., .messages = "{.count.out} in intersection", .headline = "Intersection" )
x , y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
generics::intersect()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::left_join()
for more details
on the underlying functions.
p_left_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Left join by {.keys}" )
p_left_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Left join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::left_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Left join join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Left join join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
p_mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)
p_mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::mutate()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # mutate # In this example we compare the column names of the input and the # output to identify the new columns created by the mutate operation as # the `.new_cols` variable iris %>% track() %>% mutate(extra_col = NA_real_, .messages="{.new_cols}", .headline="Extra columns from mutate:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # mutate # In this example we compare the column names of the input and the # output to identify the new columns created by the mutate operation as # the `.new_cols` variable iris %>% track() %>% mutate(extra_col = NA_real_, .messages="{.new_cols}", .headline="Extra columns from mutate:") %>% history()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::nest_join()
for more details
on the underlying functions.
p_nest_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"), .headline = "Nest join by {.keys}" )
p_nest_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"), .headline = "Nest join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::nest_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Nest join join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Nest join join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
Pausing tracking of a data frame may be required if an operation is about to
be performed that creates a lot of groupings or that you otherwise don't
want to pollute the history graph (e.g. maybe selecting something using
an anti-join). Once paused the history is not updated until a resume()
is
called, or when the data frame is ungrouped (if auto
is enabled).
p_pause(.data, auto = FALSE)
p_pause(.data, auto = FALSE)
.data |
a tracked dataframe |
auto |
if |
the .data dataframe with history graph tracking paused
iris %>% track() %>% pause() %>% history()
iris %>% track() %>% pause() %>% history()
tidyr::pivot_longer
A drop in replacement for tidyr::pivot_longer()
which optionally takes a
message and headline to store in the history graph.
p_pivot_longer(data, ..., .messages = "", .headline = "", .tag = NULL)
p_pivot_longer(data, ..., .messages = "", .headline = "", .tag = NULL)
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the result of the tidyr::pivot_longer
but with a history graph
updated.
tidyr::pivot_longer()
tidyr::pivot_wider
A drop in replacement for tidyr::pivot_wider()
which optionally takes a
message and headline to store in the history graph.
p_pivot_wider(data, ..., .messages = "", .headline = "", .tag = NULL)
p_pivot_wider(data, ..., .messages = "", .headline = "", .tag = NULL)
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the data dataframe result of the tidyr::pivot_wider
function but with
a history graph updated with a .message
if requested.
tidyr::pivot_wider()
Summarising a data set acts in the normal dplyr
manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise()
.
p_reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)
p_reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Returning values with size 0 or >1 was
deprecated as of 1.1.0. Please use
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
dplyr::reframe()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% track() tmp %>% reframe(tibble( param = c("mean","min","max"), value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length)) ), .messages="length {param}: {value}") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% track() tmp %>% reframe(tibble( param = c("mean","min","max"), value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length)) ), .messages="length {param}: {value}") %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
p_relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)
p_relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::relocate()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # relocate, this shows how the columns can be reordered iris %>% track() %>% group_by(Species) %>% relocate( tidyselect::starts_with("Sepal"), .after=Species, .messages="{.cols}", .headline="Order of columns from relocate:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # relocate, this shows how the columns can be reordered iris %>% track() %>% group_by(Species) %>% relocate( tidyselect::starts_with("Sepal"), .after=Species, .messages="{.cols}", .headline="Order of columns from relocate:") %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
p_rename(.data, ..., .messages = "", .headline = "", .tag = NULL)
p_rename(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::rename()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
p_rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)
p_rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::rename_with()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
This may reset the grouping of the tracked data if the grouping structure
has changed since the data frame was paused. If you try and resume tracking a
data frame with too many groups (as defined by options("dtrackr.max_supported_groupings"=XX)
)
then the resume will fail and the data frame will still be paused. This can
be overridden by specifying a value for the .maxgroups
parameter.
p_resume(.data, ...)
p_resume(.data, ...)
.data |
a tracked dataframe |
... |
Named arguments passed on to
|
the .data data frame with history graph tracking resumed
library(dplyr) library(dtrackr) iris %>% track() %>% pause() %>% resume() %>% history()
library(dplyr) library(dtrackr) iris %>% track() %>% pause() %>% resume() %>% history()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::right_join()
for more details
on the underlying functions.
p_right_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Right join by {.keys}" )
p_right_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Right join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::right_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
p_select(.data, ..., .messages = "", .headline = "", .tag = NULL)
p_select(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::select()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # select # The output of the select verb (here using tidyselect syntax) can be captured # and here all column names are being reported with the .cols variable. iris %>% track() %>% group_by(Species) %>% select( tidyselect::starts_with("Sepal"), .messages="{.cols}", .headline="Output columns from select:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # select # The output of the select verb (here using tidyselect syntax) can be captured # and here all column names are being reported with the .cols variable. iris %>% track() %>% group_by(Species) %>% select( tidyselect::starts_with("Sepal"), .messages="{.cols}", .headline="Output columns from select:") %>% history()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::semi_join()
for more details
on the underlying functions.
p_semi_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in intersection"), .headline = "Semi join by {.keys}" )
p_semi_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in intersection"), .headline = "Semi join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::semi_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Semi join join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Semi join join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
This is unlikely to be useful to an end user and is called automatically by many of the other functions here. On the off chance you need to copy history metadata from one dataframe to another
p_set(.data, .graph)
p_set(.data, .graph)
.data |
a dataframe which may be grouped |
.graph |
a history graph list (consisting of nodes, edges, and head) see examples |
the .data dataframe with the history graph metadata set to the provided value
library(dplyr) library(dtrackr) mtcars %>% p_set(iris %>% comment("A comment") %>% p_get()) %>% history()
library(dplyr) library(dtrackr) mtcars %>% p_set(iris %>% comment("A comment") %>% p_get()) %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
p_setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" )
p_setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" )
x , y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::setdiff()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
p_slice( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
p_slice( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice()
library(dplyr) library(dtrackr) # an arbitrary 50 items from the iris dataframe is selected. The # history is tracked iris %>% track() %>% slice(51:100) %>% history()
library(dplyr) library(dtrackr) # an arbitrary 50 items from the iris dataframe is selected. The # history is tracked iris %>% track() %>% slice(51:100) %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
p_slice_head( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
p_slice_head( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_head()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
p_slice_max( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
p_slice_max( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_max()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
p_slice_min( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
p_slice_min( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_min()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
p_slice_sample( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
p_slice_sample( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_sample()
library(dplyr) library(dtrackr) # In this example the iris dataframe is resampled 100 times with replacement # within each group and the iris %>% track() %>% group_by(Species) %>% slice_sample(n=100, replace=TRUE, .messages="{.count.out} / {.count.in} = {n}", .headline="100 {Species}") %>% history()
library(dplyr) library(dtrackr) # In this example the iris dataframe is resampled 100 times with replacement # within each group and the iris %>% track() %>% group_by(Species) %>% slice_sample(n=100, replace=TRUE, .messages="{.count.out} / {.count.in} = {n}", .headline="100 {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
p_slice_tail( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
p_slice_tail( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_tail()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
In the middle of a pipeline you may wish to document something about the data
that is more complex than the simple counts. status
is essentially a
dplyr
summarisation step which is connected to a glue
specification
output, that is recorded in the data frame history. This means you can do an
arbitrary interim summarisation and put the result into the flowchart without
disrupting the pipeline flow.
p_status( .data, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL )
p_status( .data, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL )
.data |
a dataframe which may be grouped |
... |
any normal dplyr::summarise specification, e.g. |
.messages |
a character vector of glue specifications. A glue specification can refer to the summary outputs, any grouping variables of .data, the {.strata}, or any variables defined in the calling environment |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Because of the ... summary specification parameters MUST BE NAMED.
the same .data dataframe with the history metadata updated with the status inserted as a new stage
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% status( long = p_count_if(Petal.Length>5), short = p_count_if(Petal.Length<2), .messages="{Species}: {long} long ones & {short} short ones" ) %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% status( long = p_count_if(Petal.Length>5), short = p_count_if(Petal.Length<2), .messages="{Species}: {long} long ones & {short} short ones" ) %>% history()
Summarising a data set acts in the normal dplyr
manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise()
.
p_summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)
p_summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Returning values with size 0 or >1 was
deprecated as of 1.1.0. Please use
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
dplyr::summarise()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% track() tmp %>% summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% track() tmp %>% summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()
Any counts at the individual stages that was stored with a .tag
option in a pipeline step can be recovered here. The idea here is to provide a quick way to access a single value
for the counts or other details tagged in a pipeline into a format that can be reported in text of a document. (e.g. for a results section). For more examples the consort statement vignette
has some examples of use.
p_tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)
p_tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)
.data |
the tracked dataframe. |
.tag |
(optional) the tag to retrieve. |
.strata |
(optional) filter the tagged data by the strata. set to "" to filter just the top level ungrouped data. |
.glue |
(optional) a glue specification which will be applied to the tagged content to generate a |
... |
(optional) any other named parameters will be passed to |
various things depending on what is requested.
By default a tibble with a .tag
column and all associated summary values in a nested .content
column.
If a .strata
column is specified the results are filtered to just those that match a given .strata
grouping (i.e. this will be the grouping label on the flowchart). Ungrouped content will have an empty "" as .strata
If .tag
is specified the result will be for a single tag and .content
will be automatically un-nested to give a single un-nested dataframe of the content captured at the .tag
tagged step.
This could be single or multiple rows depending on whether the original data was grouped at the point of tagging.
If both the .tag
and .glue
is specified a .label
column will be computed from .glue
and the tagged content. If the result of this is a single row then just the string value of .label
is returned.
If just the .glue
is specified, an un-nested dataframe with .tag
,.strata
and .label
columns with a label for each tag in each strata.
If this seems complex then the best thing is to experiment until you get the output you want, leaving any .glue
options until you think you know what you are doing. It made sense at the time.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") tmp = tmp %>% filter(Species!="versicolor") %>% group_by(Species) tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") tmp = tmp %>% filter(Species!="versicolor") %>% group_by(Species) tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")
Start tracking the dtrackr history graph
p_track( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
p_track( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue
specification can refer to any grouping variables of .data, or any
variables defined in the calling environment, the {.total} variable which
is the count of all rows, the {.count} variable which is the count of
rows in the current group and the {.strata} which describes the current
group. Defaults to the value of |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable which is |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe with additional history graph metadata, to allow tracking.
library(dplyr) library(dtrackr) iris %>% track() %>% history()
library(dplyr) library(dtrackr) iris %>% track() %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
p_transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)
p_transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::transmute()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # In this example we compare the column names of the input and the # output to identify the new columns created by the transmute operation as # the `.new_cols` variable # Here we do the same for a transmute() iris %>% track() %>% group_by(Species, .add=TRUE) %>% transmute( sepal.w = Sepal.Width-1, sepal.l = Sepal.Length+1, .messages="{.new_cols}", .headline="New columns from transmute:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # In this example we compare the column names of the input and the # output to identify the new columns created by the transmute operation as # the `.new_cols` variable # Here we do the same for a transmute() iris %>% track() %>% group_by(Species, .add=TRUE) %>% transmute( sepal.w = Sepal.Width-1, sepal.l = Sepal.Length+1, .messages="{.new_cols}", .headline="New columns from transmute:") %>% history()
Un-grouping a data set logically combines the different arms. In the history
this joins any stratified branches and acts as a specific type of status()
,
allowing you to generate some summary statistics about the un-grouped data.
See dplyr::ungroup()
.
p_ungroup( x, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
p_ungroup( x, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
x |
A |
... |
variables to remove from the grouping. |
.messages |
a set of glue specs. The glue code can use any any global variable, or {.count}. the default is "total {.count} items" |
.headline |
a headline glue spec. The glue code can use {.count} and {.strata}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe but ungrouped with the history graph updated showing the ungroup operation as a new stage.
dplyr::ungroup()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% comment("A test") tmp %>% ungroup(.messages="{.count} items in combined") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% comment("A test") tmp %>% ungroup(.messages="{.count} items in combined") %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
p_union( x, y, ..., .messages = "{.count.out} unique items in union", .headline = "Distinct union" )
p_union( x, y, ..., .messages = "{.count.out} unique items in union", .headline = "Distinct union" )
x , y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
generics::union()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
p_union_all( x, y, ..., .messages = "{.count.out} items in union", .headline = "Union" )
p_union_all( x, y, ..., .messages = "{.count.out} items in union", .headline = "Union" )
x , y
|
Pair of compatible data frames. A pair of data frames is compatible if they have the same column names (possibly in different orders) and compatible types. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::union_all()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Remove tracking from the dataframe
p_untrack(.data)
p_untrack(.data)
.data |
a tracked dataframe |
the .data dataframe with history graph metadata removed.
library(dplyr) library(dtrackr) iris %>% track() %>% untrack() %>% class()
library(dplyr) library(dtrackr) iris %>% track() %>% untrack() %>% class()
Pausing tracking of a data frame may be required if an operation is about to
be performed that creates a lot of groupings or that you otherwise don't
want to pollute the history graph (e.g. maybe selecting something using
an anti-join). Once paused the history is not updated until a resume()
is
called, or when the data frame is ungrouped (if auto
is enabled).
pause(.data, auto = FALSE)
pause(.data, auto = FALSE)
.data |
a tracked dataframe |
auto |
if |
the .data dataframe with history graph tracking paused
iris %>% track() %>% pause() %>% history()
iris %>% track() %>% pause() %>% history()
tidyr::pivot_longer
A drop in replacement for tidyr::pivot_longer()
which optionally takes a
message and headline to store in the history graph.
## S3 method for class 'trackr_df' pivot_longer(data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' pivot_longer(data, ..., .messages = "", .headline = "", .tag = NULL)
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the result of the tidyr::pivot_longer
but with a history graph
updated.
tidyr::pivot_longer()
tidyr::pivot_wider
A drop in replacement for tidyr::pivot_wider()
which optionally takes a
message and headline to store in the history graph.
## S3 method for class 'trackr_df' pivot_wider(data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' pivot_wider(data, ..., .messages = "", .headline = "", .tag = NULL)
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the data dataframe result of the tidyr::pivot_wider
function but with
a history graph updated with a .message
if requested.
tidyr::pivot_wider()
Plots a history graph as html
## S3 method for class 'trackr_graph' plot(x, fill = "lightgrey", fontsize = "8", colour = "black", ...)
## S3 method for class 'trackr_graph' plot(x, fill = "lightgrey", fontsize = "8", colour = "black", ...)
x |
a dtrackr history graph (e.g. output from |
fill |
the default node fill colour |
fontsize |
the default font size |
colour |
the default font colour |
... |
not used |
HTML displayed
library(dplyr) library(dtrackr) iris %>% comment("hello {.total} rows") %>% history() %>% plot()
library(dplyr) library(dtrackr) iris %>% comment("hello {.total} rows") %>% history() %>% plot()
Print a history graph to the console
## S3 method for class 'trackr_graph' print(x, ...)
## S3 method for class 'trackr_graph' print(x, ...)
x |
a dtrackr history graph (e.g. output from |
... |
not used |
nothing
library(dplyr) library(dtrackr) iris %>% comment("hello {.total} rows") %>% history() %>% print()
library(dplyr) library(dtrackr) iris %>% comment("hello {.total} rows") %>% history() %>% print()
Summarising a data set acts in the normal dplyr
manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise()
.
## S3 method for class 'trackr_df' reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Returning values with size 0 or >1 was
deprecated as of 1.1.0. Please use |
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
dplyr::reframe()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% track() tmp %>% reframe(tibble( param = c("mean","min","max"), value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length)) ), .messages="length {param}: {value}") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% track() tmp %>% reframe(tibble( param = c("mean","min","max"), value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length)) ), .messages="length {param}: {value}") %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
## S3 method for class 'trackr_df' relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::relocate()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # relocate, this shows how the columns can be reordered iris %>% track() %>% group_by(Species) %>% relocate( tidyselect::starts_with("Sepal"), .after=Species, .messages="{.cols}", .headline="Order of columns from relocate:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # relocate, this shows how the columns can be reordered iris %>% track() %>% group_by(Species) %>% relocate( tidyselect::starts_with("Sepal"), .after=Species, .messages="{.cols}", .headline="Order of columns from relocate:") %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
## S3 method for class 'trackr_df' rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::rename_with()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
## S3 method for class 'trackr_df' rename(.data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' rename(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::rename()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # rename can show us which columns are new and which have been # removed (with .dropped_cols) iris %>% track() %>% group_by(Species) %>% rename( Stamen.Width = Sepal.Width, Stamen.Length = Sepal.Length, .messages=c("added {.new_cols}","dropped {.dropped_cols}"), .headline="Renamed columns:") %>% history()
This may reset the grouping of the tracked data if the grouping structure
has changed since the data frame was paused. If you try and resume tracking a
data frame with too many groups (as defined by options("dtrackr.max_supported_groupings"=XX)
)
then the resume will fail and the data frame will still be paused. This can
be overridden by specifying a value for the .maxgroups
parameter.
resume(.data, ...)
resume(.data, ...)
.data |
a tracked dataframe |
... |
Named arguments passed on to
|
the .data data frame with history graph tracking resumed
library(dplyr) library(dtrackr) iris %>% track() %>% pause() %>% resume() %>% history()
library(dplyr) library(dtrackr) iris %>% track() %>% pause() %>% resume() %>% history()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::right_join()
for more details
on the underlying functions.
## S3 method for class 'trackr_df' right_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Right join by {.keys}" )
## S3 method for class 'trackr_df' right_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in linked set"), .headline = "Right join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::right_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Full join join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
Convert a digraph in dot format to SVG and save it to a range of output file types
save_dot( dot, filename, size = std_size$half, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), landscape = size$rot != 0, ... )
save_dot( dot, filename, size = std_size$half, maxWidth = size$width, maxHeight = size$height, formats = c("dot", "png", "pdf", "svg"), landscape = size$rot != 0, ... )
dot |
a |
filename |
the full path of the file name (minus extension for multiple formats) |
size |
a named list with 3 elements, length and width in inches and rotation. A predefined set of standard sizes are available in the std_size object. |
maxWidth |
a width (on the paper) in inches if |
maxHeight |
a height (on the paper) in inches if |
formats |
some of |
landscape |
rotate the output by 270 degrees into a landscape format.
|
... |
ignored |
a list with items paths
with the absolute paths of the saved files
as a named list, and svg
as the SVG string of the rendered dot file.
save_dot("digraph {A->B}",tempfile())
save_dot("digraph {A->B}",tempfile())
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
## S3 method for class 'trackr_df' select(.data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' select(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::select()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # select # The output of the select verb (here using tidyselect syntax) can be captured # and here all column names are being reported with the .cols variable. iris %>% track() %>% group_by(Species) %>% select( tidyselect::starts_with("Sepal"), .messages="{.cols}", .headline="Output columns from select:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # select # The output of the select verb (here using tidyselect syntax) can be captured # and here all column names are being reported with the .cols variable. iris %>% track() %>% group_by(Species) %>% select( tidyselect::starts_with("Sepal"), .messages="{.cols}", .headline="Output columns from select:") %>% history()
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::semi_join()
for more details
on the underlying functions.
## S3 method for class 'trackr_df' semi_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in intersection"), .headline = "Semi join by {.keys}" )
## S3 method for class 'trackr_df' semi_join( x, y, ..., .messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} in intersection"), .headline = "Semi join by {.keys}" )
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
the join of the two dataframes with the history graph updated.
dplyr::semi_join()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Semi join join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
library(dplyr) library(dtrackr) # Joins across data sets # example data uses the dplyr starways data people = starwars %>% select(-films, -vehicles, -starships) films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films)) lhs = people %>% track() %>% comment("People df {.total}") rhs = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment") # Semi join join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}") # See what the history of the graph is: join %>% history() %>% print() nrow(join) # Display the tracked graph (not run in examples) # join %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
## S3 method for class 'trackr_df' setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" )
## S3 method for class 'trackr_df' setdiff( x, y, ..., .messages = "{.count.out} items in difference", .headline = "Difference" )
x , y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::setdiff()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_head( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
## S3 method for class 'trackr_df' slice_head( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_head()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_max( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
## S3 method for class 'trackr_df' slice_max( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_max()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_min( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
## S3 method for class 'trackr_df' slice_min( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_min()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
library(dplyr) library(dtrackr) # Subset the data by the maximum of a given value iris %>% track() %>% group_by(Species) %>% slice_max(prop=0.5, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} = {prop} (with ties)", .headline="Widest 50% Sepals") %>% history() # The narrowest 25% of the iris data set by group can be calculated in the # slice_min() function. Recording this is a matter of tracking and # using glue specs. iris %>% track() %>% group_by(Species) %>% slice_min(prop=0.25, order_by = Sepal.Width, .messages="{.count.out} / {.count.in} (with ties)", .headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_sample( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
## S3 method for class 'trackr_df' slice_sample( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_sample()
library(dplyr) library(dtrackr) # In this example the iris dataframe is resampled 100 times with replacement # within each group and the iris %>% track() %>% group_by(Species) %>% slice_sample(n=100, replace=TRUE, .messages="{.count.out} / {.count.in} = {n}", .headline="100 {Species}") %>% history()
library(dplyr) library(dtrackr) # In this example the iris dataframe is resampled 100 times with replacement # within each group and the iris %>% track() %>% group_by(Species) %>% slice_sample(n=100, replace=TRUE, .messages="{.count.out} / {.count.in} = {n}", .headline="100 {Species}") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice_tail( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
## S3 method for class 'trackr_df' slice_tail( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice_tail()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
library(dplyr) library(dtrackr) # the first 50% of the data frame, is taken and the history tracked iris %>% track() %>% group_by(Species) %>% slice_head(prop=0.5,.messages="{.count.out} / {.count.in}", .headline="First {sprintf('%1.0f',prop*100)}%") %>% history() # The last 100 items: iris %>% track() %>% group_by(Species) %>% slice_tail(n=100,.messages="{.count.out} / {.count.in}", .headline="Last 100") %>% history()
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
## S3 method for class 'trackr_df' slice( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
## S3 method for class 'trackr_df' slice( .data, ..., .messages = c("{.count.in} before", "{.count.out} after"), .headline = "slice data" )
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
the sliced dataframe with the history graph updated.
dplyr::slice()
library(dplyr) library(dtrackr) # an arbitrary 50 items from the iris dataframe is selected. The # history is tracked iris %>% track() %>% slice(51:100) %>% history()
library(dplyr) library(dtrackr) # an arbitrary 50 items from the iris dataframe is selected. The # history is tracked iris %>% track() %>% slice(51:100) %>% history()
In the middle of a pipeline you may wish to document something about the data
that is more complex than the simple counts. status
is essentially a
dplyr
summarisation step which is connected to a glue
specification
output, that is recorded in the data frame history. This means you can do an
arbitrary interim summarisation and put the result into the flowchart without
disrupting the pipeline flow.
status( .data, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL )
status( .data, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .type = "info", .asOffshoot = FALSE, .tag = NULL )
.data |
a dataframe which may be grouped |
... |
any normal dplyr::summarise specification, e.g. |
.messages |
a character vector of glue specifications. A glue specification can refer to the summary outputs, any grouping variables of .data, the {.strata}, or any variables defined in the calling environment |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Because of the ... summary specification parameters MUST BE NAMED.
the same .data dataframe with the history metadata updated with the status inserted as a new stage
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% status( long = p_count_if(Petal.Length>5), short = p_count_if(Petal.Length<2), .messages="{Species}: {long} long ones & {short} short ones" ) %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% group_by(Species) tmp %>% status( long = p_count_if(Petal.Length>5), short = p_count_if(Petal.Length<2), .messages="{Species}: {long} long ones & {short} short ones" ) %>% history()
A list of standard paper sizes for outputting flowcharts or other dot
graphs. These include width and height dimensions in inches and can be
used as one way to specify the output size of a dot graph, including
flowcharts (see the size
parameter of flowchart()
).
std_size
std_size
An object of class list
of length 12.
The sizes available are A4
, A5
, full
(fits a portrait A4 with margins), half
(half an
A4 with margins), third
, two_third
, quarter
, sixth
(all with reference to
an A4 page with margins). There are 2 landscape sizes A4_landscape
and full_landscape
which
fit an A4 page with or without margins. There are also 2 slide dimensions,
to fit with standard presentation software dimensions.
This is just a convenience. Similar effects can be achieved by providing width
and height
parameters to flowchart()
directly.
Summarising a data set acts in the normal dplyr
manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise()
.
## S3 method for class 'trackr_df' summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Returning values with size 0 or >1 was
deprecated as of 1.1.0. Please use |
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
dplyr::summarise()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% track() tmp %>% summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% track() tmp %>% summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()
Any counts at the individual stages that was stored with a .tag
option in a pipeline step can be recovered here. The idea here is to provide a quick way to access a single value
for the counts or other details tagged in a pipeline into a format that can be reported in text of a document. (e.g. for a results section). For more examples the consort statement vignette
has some examples of use.
tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)
tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)
.data |
the tracked dataframe. |
.tag |
(optional) the tag to retrieve. |
.strata |
(optional) filter the tagged data by the strata. set to "" to filter just the top level ungrouped data. |
.glue |
(optional) a glue specification which will be applied to the tagged content to generate a |
... |
(optional) any other named parameters will be passed to |
various things depending on what is requested.
By default a tibble with a .tag
column and all associated summary values in a nested .content
column.
If a .strata
column is specified the results are filtered to just those that match a given .strata
grouping (i.e. this will be the grouping label on the flowchart). Ungrouped content will have an empty "" as .strata
If .tag
is specified the result will be for a single tag and .content
will be automatically un-nested to give a single un-nested dataframe of the content captured at the .tag
tagged step.
This could be single or multiple rows depending on whether the original data was grouped at the point of tagging.
If both the .tag
and .glue
is specified a .label
column will be computed from .glue
and the tagged content. If the result of this is a single row then just the string value of .label
is returned.
If just the .glue
is specified, an un-nested dataframe with .tag
,.strata
and .label
columns with a label for each tag in each strata.
If this seems complex then the best thing is to experiment until you get the output you want, leaving any .glue
options until you think you know what you are doing. It made sense at the time.
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") tmp = tmp %>% filter(Species!="versicolor") %>% group_by(Species) tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")
library(dplyr) library(dtrackr) tmp = iris %>% track() %>% comment(.tag = "step1") tmp = tmp %>% filter(Species!="versicolor") %>% group_by(Species) tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")
Start tracking the dtrackr history graph
track( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
track( .data, .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue
specification can refer to any grouping variables of .data, or any
variables defined in the calling environment, the {.total} variable which
is the count of all rows, the {.count} variable which is the count of
rows in the current group and the {.strata} which describes the current
group. Defaults to the value of |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable which is |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe with additional history graph metadata, to allow tracking.
library(dplyr) library(dtrackr) iris %>% track() %>% history()
library(dplyr) library(dtrackr) iris %>% track() %>% history()
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
## S3 method for class 'trackr_df' transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)
## S3 method for class 'trackr_df' transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
dplyr::transmute()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # In this example we compare the column names of the input and the # output to identify the new columns created by the transmute operation as # the `.new_cols` variable # Here we do the same for a transmute() iris %>% track() %>% group_by(Species, .add=TRUE) %>% transmute( sepal.w = Sepal.Width-1, sepal.l = Sepal.Length+1, .messages="{.new_cols}", .headline="New columns from transmute:") %>% history()
library(dplyr) library(dtrackr) # mutate and other functions are unitary operations that generally change # the structure but not size of a dataframe. In dtrackr these are by ignored # by default but we can change that so that their behaviour is obvious. # In this example we compare the column names of the input and the # output to identify the new columns created by the transmute operation as # the `.new_cols` variable # Here we do the same for a transmute() iris %>% track() %>% group_by(Species, .add=TRUE) %>% transmute( sepal.w = Sepal.Width-1, sepal.l = Sepal.Length+1, .messages="{.new_cols}", .headline="New columns from transmute:") %>% history()
Un-grouping a data set logically combines the different arms. In the history
this joins any stratified branches and acts as a specific type of status()
,
allowing you to generate some summary statistics about the un-grouped data.
See dplyr::ungroup()
.
## S3 method for class 'trackr_df' ungroup( x, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
## S3 method for class 'trackr_df' ungroup( x, ..., .messages = .defaultMessage(), .headline = .defaultHeadline(), .tag = NULL )
x |
A |
... |
variables to remove from the grouping. |
.messages |
a set of glue specs. The glue code can use any any global variable, or {.count}. the default is "total {.count} items" |
.headline |
a headline glue spec. The glue code can use {.count} and {.strata}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
the .data dataframe but ungrouped with the history graph updated showing the ungroup operation as a new stage.
dplyr::ungroup()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% comment("A test") tmp %>% ungroup(.messages="{.count} items in combined") %>% history()
library(dplyr) library(dtrackr) tmp = iris %>% group_by(Species) %>% comment("A test") tmp %>% ungroup(.messages="{.count} items in combined") %>% history()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
## S3 method for class 'trackr_df' union_all( x, y, ..., .messages = "{.count.out} items in union", .headline = "Union" )
## S3 method for class 'trackr_df' union_all( x, y, ..., .messages = "{.count.out} items in union", .headline = "Union" )
x , y
|
Pair of compatible data frames. A pair of data frames is compatible if they have the same column names (possibly in different orders) and compatible types. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
dplyr::union_all()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
## S3 method for class 'trackr_df' union( x, y, ..., .messages = "{.count.out} unique items in union", .headline = "Distinct union" )
## S3 method for class 'trackr_df' union( x, y, ..., .messages = "{.count.out} unique items in union", .headline = "Distinct union" )
x , y
|
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
the dplyr output with the history graph updated.
generics::union()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
library(dplyr) library(dtrackr) # Set operations people = starwars %>% select(-films, -vehicles, -starships) chrs = people %>% track("start") lhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Droid" ~ "{.included} droids" ) # these are different subsets of the same data rhs = chrs %>% include_any( species == "Human" ~ "{.included} humans", species == "Gungan" ~ "{.included} gungans" ) %>% comment("{.count} gungans & humans") # Unions set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() # Intersections and differences set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart() set = intersect(lhs,rhs) %>% comment("{.count} humans") # display the history of the result: set %>% history() nrow(set) # not run - display the flowchart: # set %>% flowchart()
Remove tracking from the dataframe
untrack(.data)
untrack(.data)
.data |
a tracked dataframe |
the .data dataframe with history graph metadata removed.
library(dplyr) library(dtrackr) iris %>% track() %>% untrack() %>% class()
library(dplyr) library(dtrackr) iris %>% track() %>% untrack() %>% class()