A function to compute assorted univariate statistics for a specified variable in a data frame over desired grouping factors.

stats_by_group(
  dtf,
  column,
  groupings = NULL,
  statistics = c("M", "SD"),
  method = "Student's T",
  categories = 1,
  width = 0.95,
  na.rm = TRUE
)

Arguments

dtf

A data frame.

column

A character string, the column in dtf to compute statistics for.

groupings

A character vector, the columns in dtf to use as grouping factors (it is recommended that they all be categorical variables).

statistics

A character vector, the set of different statistics to compute over groups.

method

A character string, the type of method to use when computing uncertainty intervals. Options include: "Student's T" for means or "Beta-binomial" for proportions.

categories

An optional vector of elements to match over when computing frequencies, proportions, or percentages.

width

A numeric value between 0 and 1, the width for uncertainty intervals.

na.rm

A logical value; if TRUE removes NA values.

Value

A data frame with separate rows for each combination of grouping factors and separate columns for each statistic to compute.

Details

Possible univariate statistics that can be computed:

  • 'N' = Sample size;

  • 'M' = Mean;

  • 'Md' = Median;

  • 'SD' = Standard deviation;

  • 'SE' = Standard error of the mean;

  • 'C' = Counts/frequencies;

  • 'Pr' = Proportions;

  • 'Pe' = Percentages;

  • 'Mn' = Minimum;

  • 'Mx' = Maximum;

  • 'Q1' = First quartile;

  • 'Q3' = Third quartile;

  • 'Q___' = Quantile for specified percentage (sub '001' to '100' for '___');

  • 'CNA' Counts/frequencies for NA values;

  • 'PrNA' Proportions for NA values;

  • 'PeNA' Percentages for NA values.

Additionally, specifying 'UI' in combination with the argument method will compute the lower and upper limits of a desired uncertainty interval. The width of the interval can be controlled by the argument width.

Examples

# Example data set
data("iris")

# Mean/SD for sepal length by species
iris |> stats_by_group( Sepal.Length, nq(Species) )
#>      Species     M        SD
#> 1     setosa 5.006 0.3524897
#> 2 versicolor 5.936 0.5161711
#> 3  virginica 6.588 0.6358796

# Define categorical variable for long petals based on median split
iris$Long.Petal <- assign_by_interval(
  iris$Petal.Length, median(iris$Petal.Length), values = c( 'No', 'Yes' )
)
# Sample size, mean, and confidence intervals using Student's T
# distribution by species and whether petals are long
iris |> stats_by_group(
  Sepal.Length, nq( Species, Long.Petal ), nq( N, M, UI )
)
#>      Species Long.Petal  N     M    UI_LB    UI_UB
#> 1     setosa         No 50 5.006 4.905824 5.106176
#> 2 versicolor         No 25 5.616 5.462622 5.769378
#> 3 versicolor        Yes 25 6.256 6.074862 6.437138
#> 4  virginica        Yes 50 6.588 6.407285 6.768715

# Define categorical variable for long sepal based on median split
iris$Long.Sepal <- assign_by_interval(
  iris$Sepal.Length, median(iris$Sepal.Length), values = c( 'No', 'Yes' )
)
# Proportion and confidence intervals based on beta-binomial
# distribution for long sepal by long petal
iris |> stats_by_group(
  Long.Sepal, nq( Long.Petal ), nq( Pr, UI ),
  categories = 'Yes', method = 'Beta-binomial'
)
#>   Long.Petal         Pr     UI_LB     UI_UB
#> 1         No 0.06666667 0.0258918 0.1399343
#> 2        Yes 0.86666667 0.7763800 0.9293352

# Standard cut-offs for boxplots (min, Q1, median, Q3, max)
iris |> stats_by_group(
  Sepal.Length, nq( Species ), nq( Mn, Q1, Md, Q3, Mx )
)
#>      Species  Md  Mn  Mx    Q1  Q3
#> 1     setosa 5.0 4.3 5.8 4.800 5.2
#> 2 versicolor 5.9 4.9 7.0 5.600 6.3
#> 3  virginica 6.5 4.9 7.9 6.225 6.9

# Custom quantiles
iris |> stats_by_group(
  Sepal.Length, nq( Species ), nq( Q005, Q025, Q050, Q075, Q095 )
)
#>      Species  Q005  Q025 Q050 Q075  Q095
#> 1     setosa 4.400 4.800  5.0  5.2 5.610
#> 2 versicolor 5.045 5.600  5.9  6.3 6.755
#> 3  virginica 5.745 6.225  6.5  6.9 7.700

# Missing values
iris$Sepal.Length[ c( 10, 30, 55:60, 120:123 ) ] <- NA
iris |> stats_by_group(
  Sepal.Length, nq( Species ), nq( CNA, PrNA, PeNA )
)
#>      Species CNA PrNA PeNA
#> 1     setosa   2 0.04    4
#> 2 versicolor   6 0.12   12
#> 3  virginica   4 0.08    8