stats_by_group.RdA function to compute assorted univariate statistics for a specified variable in a data frame over desired grouping factors.
stats_by_group(
dtf,
column,
groupings = NULL,
statistics = c("M", "SD"),
method = "Student's T",
categories = 1,
width = 0.95,
na.rm = TRUE
)A data frame.
A character string, the column in dtf to
compute statistics for.
A character vector, the columns in dtf
to use as grouping factors (it is recommended that they all
be categorical variables).
A character vector, the set of different statistics to compute over groups.
A character string, the type of method to use
when computing uncertainty intervals. Options include:
"Student's T" for means or
"Beta-binomial" for proportions.
An optional vector of elements to match over when computing frequencies, proportions, or percentages.
A numeric value between 0 and 1, the width for uncertainty intervals.
A logical value; if TRUE removes
NA values.
A data frame with separate rows for each combination of grouping factors and separate columns for each statistic to compute.
Possible univariate statistics that can be computed:
'N' = Sample size;
'M' = Mean;
'Md' = Median;
'SD' = Standard deviation;
'SE' = Standard error of the mean;
'C' = Counts/frequencies;
'Pr' = Proportions;
'Pe' = Percentages;
'Mn' = Minimum;
'Mx' = Maximum;
'Q1' = First quartile;
'Q3' = Third quartile;
'Q___' = Quantile for specified percentage
(sub '001' to '100' for '___');
'CNA' Counts/frequencies for NA values;
'PrNA' Proportions for NA values;
'PeNA' Percentages for NA values.
Additionally, specifying 'UI' in combination with the
argument method will compute the lower and upper limits
of a desired uncertainty interval. The width of the interval
can be controlled by the argument width.
# Example data set
data("iris")
# Mean/SD for sepal length by species
iris |> stats_by_group( Sepal.Length, nq(Species) )
#> Species M SD
#> 1 setosa 5.006 0.3524897
#> 2 versicolor 5.936 0.5161711
#> 3 virginica 6.588 0.6358796
# Define categorical variable for long petals based on median split
iris$Long.Petal <- assign_by_interval(
iris$Petal.Length, median(iris$Petal.Length), values = c( 'No', 'Yes' )
)
# Sample size, mean, and confidence intervals using Student's T
# distribution by species and whether petals are long
iris |> stats_by_group(
Sepal.Length, nq( Species, Long.Petal ), nq( N, M, UI )
)
#> Species Long.Petal N M UI_LB UI_UB
#> 1 setosa No 50 5.006 4.905824 5.106176
#> 2 versicolor No 25 5.616 5.462622 5.769378
#> 3 versicolor Yes 25 6.256 6.074862 6.437138
#> 4 virginica Yes 50 6.588 6.407285 6.768715
# Define categorical variable for long sepal based on median split
iris$Long.Sepal <- assign_by_interval(
iris$Sepal.Length, median(iris$Sepal.Length), values = c( 'No', 'Yes' )
)
# Proportion and confidence intervals based on beta-binomial
# distribution for long sepal by long petal
iris |> stats_by_group(
Long.Sepal, nq( Long.Petal ), nq( Pr, UI ),
categories = 'Yes', method = 'Beta-binomial'
)
#> Long.Petal Pr UI_LB UI_UB
#> 1 No 0.06666667 0.0258918 0.1399343
#> 2 Yes 0.86666667 0.7763800 0.9293352
# Standard cut-offs for boxplots (min, Q1, median, Q3, max)
iris |> stats_by_group(
Sepal.Length, nq( Species ), nq( Mn, Q1, Md, Q3, Mx )
)
#> Species Md Mn Mx Q1 Q3
#> 1 setosa 5.0 4.3 5.8 4.800 5.2
#> 2 versicolor 5.9 4.9 7.0 5.600 6.3
#> 3 virginica 6.5 4.9 7.9 6.225 6.9
# Custom quantiles
iris |> stats_by_group(
Sepal.Length, nq( Species ), nq( Q005, Q025, Q050, Q075, Q095 )
)
#> Species Q005 Q025 Q050 Q075 Q095
#> 1 setosa 4.400 4.800 5.0 5.2 5.610
#> 2 versicolor 5.045 5.600 5.9 6.3 6.755
#> 3 virginica 5.745 6.225 6.5 6.9 7.700
# Missing values
iris$Sepal.Length[ c( 10, 30, 55:60, 120:123 ) ] <- NA
iris |> stats_by_group(
Sepal.Length, nq( Species ), nq( CNA, PrNA, PeNA )
)
#> Species CNA PrNA PeNA
#> 1 setosa 2 0.04 4
#> 2 versicolor 6 0.12 12
#> 3 virginica 4 0.08 8