Supporting Information: Statistical analyses and plots

1 INFO

1.1 BACKGROUND

This supporting file containes the codes for the statistical analyses and plots in our paper, Lactococcus cell envelope proteases enable lactococcal growth in minimal growth media supplemented with high molecular weight proteins of plant and animal origin.

1.2 Figures

The codes for the figures and supporting figures referred to in the paper can be found in this supporting file by searching for Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure S2, Figure S4, Figure S5A, and Figure S5B.

2 SETUP

2.1 PACKAGES

R packages applied for:

data loading: readxl

main analyses: tidyverse, rstatix, vegan, and kableExtra

visualization: ggpubr, ggsci, scales, pheatmap, and ggrepel

2.2 DATA

The excel file Data_collection is included as supporting file of the paper and includes all the data for the analyses. Different data sets are divided into separate excel sheets.

2.3 VARIABLES

Some variables in the excel file Data_collection were named differently in the paper, as explained further below.

Strain is also known as Protease in the data file. Otherwise, the names of the strains are as follows:

  • Non-inoculated = “No”

  • MS22418 (PrtP-) = “pAK80”

  • MS22421 (PrtP/SK11) = “SK11”

  • MS22425 (PrtP/Wg2) = “Wg2”

  • MS22427 (PrtP/MS22333) = “MS22333” or “33”

Medium is also referred to as Protein in the data file. Otherwise, the names of the media are as follows:

  • CCDM = “Casein”

  • PCDM = “Potato”

3 DATA SET: pH and cell counts

3.1 DATA

3.1.1 LOAD

dat <- read_excel("./data/Data_collection.xlsx", sheet="pH_and_CFU", na = "NA")

3.1.2 LOOK

# Take a glimpse
glimpse(dat)
## Rows: 154
## Columns: 7
## $ Sample  <chr> "pAK80-D0", "Wg2-D0", "SK11-D0", "33-D0", "pAK80-P1-D1", "pAK8~
## $ Strain  <chr> "pAK80", "Wg2", "SK11", "33", "pAK80", "pAK80", "pAK80", "Wg2"~
## $ Protein <chr> "Inoculation", "Inoculation", "Inoculation", "Inoculation", "P~
## $ Day     <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,~
## $ pH      <dbl> 7.20, 7.20, 7.20, 7.20, 6.96, 6.96, 6.96, 6.84, 6.85, 6.84, 6.~
## $ CFU     <dbl> 6.20e+06, 6.54e+06, 6.00e+06, 6.43e+06, 1.70e+08, 2.42e+08, 1.~
## $ logN    <dbl> 6.792392, 6.815578, 6.778151, 6.808211, 8.230449, 8.383815, 8.~
# Make a explorative summary 
skimr::skim(dat)
Data summary
Name dat
Number of rows 154
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample 0 1 5 11 0 154 0
Strain 0 1 2 5 0 5 0
Protein 0 1 5 11 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Day 0 1.00 3.18 2.30 0.00 1.00e+00 2.00e+00 6.00e+00 6.00e+00 <U+2587><U+2585><U+2581><U+2581><U+2587>
pH 9 0.94 6.36 0.86 4.93 5.19e+00 6.91e+00 6.99e+00 7.20e+00 <U+2583><U+2581><U+2581><U+2581><U+2587>
CFU 71 0.54 193698566.78 429785198.34 0.00 1.30e+06 7.08e+06 4.85e+07 1.78e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
logN 71 0.54 6.25 2.63 0.00 6.04e+00 6.85e+00 7.68e+00 9.25e+00 <U+2582><U+2581><U+2582><U+2587><U+2585>

3.1.3 SAVE

# Save cleaned data
save(dat, file = "R_objects/stat_test_data_growth.RData")

# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())

3.2 TEST INDEPENDENCE

3.2.1 CATEGORICAL VARIABLES

Chi squared test of independence.

# load data 
load("R_objects/stat_test_data_growth.RData")

# This converts all character columns to factor (if only specific columns should be converted replace "where(is_character)" with "c(<col1>,<col2>,...)"
dat <- dat %>% mutate(across(where(is_character), as_factor))


# Create vector of categorical variables with more than two categories and less than half of the number of samples.
cat_var <- dat %>% 
  select_if(is.factor) %>% 
  select(where(~n_distinct(.) > 1)) %>% 
  select(where(~n_distinct(.) < (nrow(dat)/2))) %>%
  colnames()

length(cat_var) # If less than 2 variables the following code should not be run
## [1] 2
# Test the variables pairwise 
cat_var %>% 
  combn(2) %>% 
  t() %>% 
  as_tibble() %>% 
  rowwise %>% 
  mutate(chisq_test = list(
    table(dat[[V1]], dat[[V2]]) %>% chisq.test()
    ),
    chisq_pval = chisq_test$p.value
    )
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## i Using compatibility `.name_repair`.
## Warning: There was 1 warning in `mutate()`.
## i In argument: `chisq_test = list(table(dat[[V1]], dat[[V2]]) %>%
##   chisq.test())`.
## i In row 1.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## # A tibble: 1 x 4
## # Rowwise: 
##   V1     V2      chisq_test chisq_pval
##   <chr>  <chr>   <list>          <dbl>
## 1 Strain Protein <htest>          1.00

3.2.2 QUANTITATIVE VARIABLES

Spearman’s test without first testing the normality of the data

# create vector with the relevant variables 
CON.VARS <- dat %>% select(where(is.numeric)) %>% colnames()

# Plot all variables against each other
pairs(dat[,CON.VARS], pch = 19,  cex = 0.5,
      lower.panel=NULL)

# Run Spearman test
corrmat <- cor(dat[,CON.VARS], method = "spearman", use = "pairwise.complete.obs")

# Create heatmap
corrmat_rounded <- round(corrmat, 2)

melted_corrmat_rounded <- tibble(Var1 = rep(row.names(corrmat_rounded), length(row.names(corrmat_rounded))),
                                 Var2 = rep(row.names(corrmat_rounded), each = length(row.names(corrmat_rounded))),
                                 dist = as.numeric(matrix(corrmat_rounded)))

ggplot(melted_corrmat_rounded, aes(x = Var1, y = Var2, fill = dist)) + 
  labs(title = "Correlation between continous variables") + 
  geom_tile(color = "white") + 
  scale_fill_gradient2(low = "blue", mid = "white", 
                       high = "red", midpoint = 0, limit = c(-1,1), 
                       space = "Lab", name = "Correlation coefficient") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, size = 12, 
                                   hjust = 1, vjust = 1), 
        axis.text.y = element_text(size = 12, hjust = 1, vjust = 0.5),
        axis.title.x = element_blank(), axis.title.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(),
        axis.ticks = element_blank()) + 
  geom_text(aes(x = Var1, y = Var2, label = dist), 
            color = "black", size = 4) + 
  coord_fixed()

# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())

3.3 ALL - Strain, Protein and Day - pH

# load data 
load("R_objects/stat_test_data_growth.RData")

# Create subset
dat.clean <- dat %>% 
  filter(!is.na(pH) & Protein %in% c("Casein", "Potato", "Hydrolysate", "Water")) %>%
  mutate(Day = as.factor(Day))

3.3.1 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Day","Strain", "Protein")
OUTCOME <- "pH"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 52 x 7
##    Strain Protein     Day   variable     n  mean    sd
##    <chr>  <chr>       <fct> <fct>    <dbl> <dbl> <dbl>
##  1 33     Casein      1     pH           3  4.99 0.015
##  2 33     Hydrolysate 1     pH           2  6.97 0    
##  3 33     Potato      1     pH           3  6.74 0.032
##  4 33     Water       1     pH           2  7.02 0.007
##  5 No     Casein      1     pH           3  6.97 0.01 
##  6 No     Hydrolysate 1     pH           2  6.98 0.028
##  7 No     Potato      1     pH           3  7.00 0.023
##  8 No     Water       1     pH           2  7.04 0    
##  9 SK11   Casein      1     pH           3  6.96 0    
## 10 SK11   Hydrolysate 1     pH           2  7    0    
## # i 42 more rows
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

3.3.2 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 2 x 9
##   Strain Protein Day   Sample         pH       CFU  logN is.outlier is.extreme
##   <chr>  <chr>   <fct> <chr>       <dbl>     <dbl> <dbl> <lgl>      <lgl>     
## 1 Wg2    Potato  6     Wg2-P3-D6    5.02   6040000  6.78 TRUE       FALSE     
## 2 pAK80  Casein  6     pAK80-C4-D6  6.66 580000000  8.76 TRUE       FALSE

Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.The Shapiro-Wilk test is performed on all data groups when possible with n > 2 and only with unique values.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic  p.value
##   <chr>                <dbl>    <dbl>
## 1 residuals(model)     0.281 2.47e-23
# Compute Shapiro-Wilk test of normality for each data group
sets <- dat.clean %>%
  group_by(Day, Protein, Strain) %>% 
  summarise(n_unique = n_distinct(pH),
            n = n()) %>%
  filter(n > 2, n_unique > 1) %>% 
  transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
dat.clean %>% 
  mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
  filter(day_protein_strain %in% sets$day_protein_strain) %>%
  group_by(Day, Protein, Strain) %>% 
  shapiro_test(vars = "pH") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 7 x 7
##   Strain Protein Day   variable statistic       p   p.adj
##   <chr>  <chr>   <fct> <chr>        <dbl>   <dbl>   <dbl>
## 1 No     Potato  1     pH           0.75  0       0      
## 2 Wg2    Potato  1     pH           0.75  0       0      
## 3 SK11   Casein  2     pH           0.75  0       0      
## 4 Wg2    Casein  2     pH           0.75  0       0      
## 5 Wg2    Potato  2     pH           0.75  0       0      
## 6 pAK80  Potato  2     pH           0.75  0       0      
## 7 Wg2    Potato  6     pH           0.630 0.00124 0.00443

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic        p
##   <int> <int>     <dbl>    <dbl>
## 1    51    89      46.8 6.76e-47
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

3.3.3 PERFORM TEST

The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Strain, Day) %>%
  kruskal_test(pH ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 10 x 9
##    Strain Day   .y.       n statistic    df      p method          p.adj
##    <chr>  <fct> <chr> <int>     <dbl> <int>  <dbl> <chr>           <dbl>
##  1 33     1     pH       10      8.51     3 0.0366 Kruskal-Wallis 0.0683
##  2 33     2     pH        8      6.25     2 0.0439 Kruskal-Wallis 0.0683
##  3 33     6     pH       12      8.81     3 0.032  Kruskal-Wallis 0.0683
##  4 No     6     pH       12      7.85     3 0.0493 Kruskal-Wallis 0.0690
##  5 SK11   1     pH       10      8.18     3 0.0424 Kruskal-Wallis 0.0683
##  6 SK11   6     pH       12     10.2      3 0.0167 Kruskal-Wallis 0.0683
##  7 Wg2    1     pH       10      8.56     3 0.0358 Kruskal-Wallis 0.0683
##  8 Wg2    2     pH        8      6.40     2 0.0407 Kruskal-Wallis 0.0683
##  9 Wg2    6     pH       12      9.71     3 0.0212 Kruskal-Wallis 0.0683
## 10 pAK80  6     pH       12     10.2      3 0.0173 Kruskal-Wallis 0.0683
# for each strain
res.strain <- dat.clean %>% 
  group_by(Day, Protein) %>%
  kruskal_test(pH ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 6 x 9
##   Protein Day   .y.       n statistic    df       p method           p.adj
##   <chr>   <fct> <chr> <int>     <dbl> <int>   <dbl> <chr>            <dbl>
## 1 Casein  1     pH       15     11.9      4 0.0182  Kruskal-Wallis 0.0400 
## 2 Potato  1     pH       15     13.4      4 0.00943 Kruskal-Wallis 0.0346 
## 3 Casein  2     pH       12     10.5      3 0.0151  Kruskal-Wallis 0.0400 
## 4 Potato  2     pH       11      9.54     3 0.0229  Kruskal-Wallis 0.0420 
## 5 Casein  6     pH       20     17.6      4 0.00151 Kruskal-Wallis 0.00830
## 6 Potato  6     pH       20     18.4      4 0.00105 Kruskal-Wallis 0.00830
pwc.strain <- dat.clean %>%
  filter(Protein %in% c("Casein", "Potato"), Day==6) %>%
  group_by(Protein) %>%
  pairwise_wilcox_test(pH ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc.strain %>% filter(p.adj < 0.05)
## # A tibble: 19 x 10
##    Protein .y.   group1 group2    n1    n2 statistic     p  p.adj p.adj.signif
##    <chr>   <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>  <dbl> <chr>       
##  1 Casein  pH    33     No         4     4       0   0.028 0.0322 ns          
##  2 Casein  pH    33     pAK80      4     4       0   0.029 0.0322 ns          
##  3 Casein  pH    33     SK11       4     4       0   0.028 0.0322 ns          
##  4 Casein  pH    33     Wg2        4     4       0.5 0.041 0.0432 ns          
##  5 Casein  pH    No     pAK80      4     4      16   0.028 0.0322 ns          
##  6 Casein  pH    No     SK11       4     4      16   0.028 0.0322 ns          
##  7 Casein  pH    No     Wg2        4     4      16   0.028 0.0322 ns          
##  8 Casein  pH    pAK80  Wg2        4     4      16   0.029 0.0322 ns          
##  9 Casein  pH    SK11   Wg2        4     4      16   0.029 0.0322 ns          
## 10 Potato  pH    33     No         4     4       0   0.029 0.0322 ns          
## 11 Potato  pH    33     pAK80      4     4       0   0.029 0.0322 ns          
## 12 Potato  pH    33     SK11       4     4       0   0.029 0.0322 ns          
## 13 Potato  pH    33     Wg2        4     4       0   0.026 0.0322 ns          
## 14 Potato  pH    No     pAK80      4     4      16   0.029 0.0322 ns          
## 15 Potato  pH    No     SK11       4     4      16   0.028 0.0322 ns          
## 16 Potato  pH    No     Wg2        4     4      16   0.026 0.0322 ns          
## 17 Potato  pH    pAK80  SK11       4     4      16   0.029 0.0322 ns          
## 18 Potato  pH    pAK80  Wg2        4     4      16   0.026 0.0322 ns          
## 19 Potato  pH    SK11   Wg2        4     4      16   0.026 0.0322 ns

3.3.4 VISUALIZE

#Name order in plot
dat.clean$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain, 
                           levels = c("No", "pAK80", "SK11", "Wg2", "33"), 
                           labels =c ("Non-inoculated", 
                                      "MS22418 (PrtP-)", 
                                      "MS22421 (PrtP/SK11)", 
                                      "MS22425 (PrtP/Wg2)", 
                                      "MS22427 (PrtP/MS22333)"))
                           
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato", "Hydrolysate", "Water"))

#Boxplot
bxp.pH <- dat.clean %>%
  filter(., Protein  %in% c("Casein", "Potato", "Hydrolysate", "Water")) %>%
  filter (., Day %in% c(1, 2, 6)) %>%
  ggboxplot(.,
            x = "Day",
            y = "pH",
            color = "Strain", 
            facet.by = "Protein")+
  labs(x="Time [days]", y="pH")+
  scale_y_continuous(limits = c(4.8, 7.2))+
  facet_wrap(~Protein, 
             nrow=2, 
             labeller = as_labeller(c("Casein"="CCDM", 
                                      "Potato"="PCDM", 
                                      "Hydrolysate"="HCDM", 
                                      "Water"="CDM")))+
  scale_color_manual(values = c("Non-inoculated"="#E89242FF", 
                                  "MS22418 (PrtP-)"="#82491EFF", 
                                  "MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "MS22427 (PrtP/MS22333)"="#FB6467FF")
                     )+
  geom_hline(yintercept=7.0, linetype="dashed", color = "#7d7d7dff")+
  theme_light()+
  theme(legend.position="bottom")
 
save(bxp.pH, file = "R_objects/Fig1A.RData")

3.4 PH CHANGE

The pH_change analyses are not presented or discussed in the paper because they did not appear to be relevant to the main conclusion of the paper.

3.4.1 DATA

# load data 
load("R_objects/stat_test_data_growth.RData")

# Add replicate identifier
dat$replicate <- str_split(dat$Sample, "-", simplify = T)[,2]

#Calculate pH change
dat.wide <- dat %>%
  pivot_wider(values_from = pH, names_from = Day, names_prefix = "Day",id_cols = c(replicate, Strain, Protein), values_fill = NA)

dat.wide <- dat.wide %>%
  mutate(pH_1 = Day1 - 7.2,
         pH_2 = Day2 - Day1,
         pH_6 = Day6 - Day2,
         norm_1 = 100*Day1/7.2,
         norm_2 = 100*Day2/7.2,
         norm_6 = 100*Day6/7.2
         )
ggplot(dat.wide, aes(x = Protein, y = pH_1, color = Strain)) +
  geom_boxplot(outlier.colour = NA) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1,
                                             dodge.width = 0.7))
## Warning: Removed 14 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 14 rows containing missing values (`geom_point()`).

ggplot(dat.wide, aes(x = Protein, y = pH_2, color = Strain)) +
  geom_boxplot(outlier.colour = NA) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1,
                                             dodge.width = 0.7))
## Warning: Removed 33 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 33 rows containing missing values (`geom_point()`).

ggplot(dat.wide, aes(x = Protein, y = pH_6, color = Strain)) +
  geom_boxplot(outlier.colour = NA) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1,
                                             dodge.width = 0.7))
## Warning: Removed 33 rows containing non-finite values (`stat_boxplot()`).
## Removed 33 rows containing missing values (`geom_point()`).

# Create subset
dat.clean.wide <- dat.wide %>% pivot_longer(names_to = "Day",values_to = "pH_change",cols = c(pH_1, pH_2, pH_6))

ggplot(dat.clean.wide, aes(x = Day, y = pH_change, color = Protein, group = replicate)) +
  geom_point() + geom_line() +
  facet_wrap("Strain")
## Warning: Removed 80 rows containing missing values (`geom_point()`).
## Warning: Removed 46 rows containing missing values (`geom_line()`).

Save data

# Save cleaned data
save(dat.clean.wide, file = "R_objects/stat_test_data_pH_change.RData")

# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())

3.4.2 ALL - Strain & Protein - pH_change

# load data 
load("R_objects/stat_test_data_pH_change.RData")

# Create subset
dat.clean <- 
  dat.clean.wide %>% 
  filter(!is.na(pH_change))

3.4.2.1 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Strain", "Day")
OUTCOME <- "pH_change"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 44 x 7
##    Strain Protein Day   variable      n   mean    sd
##    <chr>  <chr>   <chr> <fct>     <dbl>  <dbl> <dbl>
##  1 33     Casein  pH_1  pH_change     3 -2.21  0.015
##  2 33     Casein  pH_2  pH_change     3  0.207 0.021
##  3 33     Casein  pH_6  pH_change     3 -0.2   0    
##  4 No     Casein  pH_1  pH_change     3 -0.23  0.01 
##  5 SK11   Casein  pH_1  pH_change     3 -0.24  0    
##  6 SK11   Casein  pH_2  pH_change     3  0.117 0.006
##  7 SK11   Casein  pH_6  pH_change     3 -0.283 0.015
##  8 Wg2    Casein  pH_1  pH_change     3 -2.17  0.026
##  9 Wg2    Casein  pH_2  pH_change     3  0.083 0.023
## 10 Wg2    Casein  pH_6  pH_change     3 -0.073 0.012
## # i 34 more rows
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

3.4.2.2 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
##  [1] Strain     Protein    Day        replicate  Day0       Day1      
##  [7] Day2       Day6       norm_1     norm_2     norm_6     pH_change 
## [13] is.outlier is.extreme
## <0 rækker> (eller 0-længde row.names)

Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic  p.value
##   <chr>                <dbl>    <dbl>
## 1 residuals(model)     0.363 6.80e-20
# Compute Shapiro-Wilk test of normality for each data group
sets <- dat.clean %>%
  group_by(Day, Protein, Strain) %>% 
  summarise(n_unique = n_distinct(pH_change),
            n = n()) %>%
  filter(n > 2, n_unique > 1) %>% 
  transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
dat.clean %>% 
  mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
  filter(day_protein_strain %in% sets$day_protein_strain) %>%
  group_by(Day, Protein, Strain) %>% 
  shapiro_test(vars = "pH_change") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 8 x 7
##   Strain Protein Day   variable  statistic     p p.adj
##   <chr>  <chr>   <chr> <chr>         <dbl> <dbl> <dbl>
## 1 No     Potato  pH_1  pH_change      0.75     0     0
## 2 Wg2    Potato  pH_1  pH_change      0.75     0     0
## 3 SK11   Casein  pH_2  pH_change      0.75     0     0
## 4 Wg2    Casein  pH_2  pH_change      0.75     0     0
## 5 Wg2    Potato  pH_2  pH_change      0.75     0     0
## 6 pAK80  Potato  pH_2  pH_change      0.75     0     0
## 7 Wg2    Casein  pH_6  pH_change      0.75     0     0
## 8 pAK80  Casein  pH_6  pH_change      0.75     0     0

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic        p
##   <int> <int>     <dbl>    <dbl>
## 1    43    68      75.4 1.48e-43
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

3.4.2.3 PERFORM TEST

The data are not normally distributed. At least one data group is not normally distributed, according to the Shapiro-Wilk normality test. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon pairwise test.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Strain, Day) %>%
  kruskal_test(pH_change ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 8 x 9
##   Strain Day   .y.           n statistic    df      p method          p.adj
##   <chr>  <chr> <chr>     <int>     <dbl> <int>  <dbl> <chr>           <dbl>
## 1 33     pH_1  pH_change    10      8.51     3 0.0366 Kruskal-Wallis 0.0713
## 2 33     pH_2  pH_change     8      6.25     2 0.0439 Kruskal-Wallis 0.0713
## 3 33     pH_6  pH_change     8      6.56     2 0.0376 Kruskal-Wallis 0.0713
## 4 SK11   pH_1  pH_change    10      8.18     3 0.0424 Kruskal-Wallis 0.0713
## 5 Wg2    pH_1  pH_change    10      8.56     3 0.0358 Kruskal-Wallis 0.0713
## 6 Wg2    pH_2  pH_change     8      6.40     2 0.0407 Kruskal-Wallis 0.0713
## 7 Wg2    pH_6  pH_change     8      6.33     2 0.0423 Kruskal-Wallis 0.0713
## 8 pAK80  pH_6  pH_change     8      6.33     2 0.0423 Kruskal-Wallis 0.0713
# for each strain
res.strain <- dat.clean %>% 
  group_by(Day, Protein) %>%
  kruskal_test(pH_change ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 6 x 9
##   Protein Day   .y.           n statistic    df       p method          p.adj
##   <chr>   <chr> <chr>     <int>     <dbl> <int>   <dbl> <chr>           <dbl>
## 1 Casein  pH_1  pH_change    15     11.9      4 0.0182  Kruskal-Wallis 0.0468
## 2 Potato  pH_1  pH_change    15     13.4      4 0.00943 Kruskal-Wallis 0.0468
## 3 Casein  pH_2  pH_change    12     10.3      3 0.0164  Kruskal-Wallis 0.0468
## 4 Potato  pH_2  pH_change    11      9.50     3 0.0234  Kruskal-Wallis 0.0468
## 5 Casein  pH_6  pH_change    12      9.84     3 0.02    Kruskal-Wallis 0.0468
## 6 Potato  pH_6  pH_change    11      8.77     3 0.0326  Kruskal-Wallis 0.0543
pwc.strain <- dat.clean %>%
  filter(Protein %in% c("Casein", "Potato") & Day %in% c("pH_1")) %>%
  group_by(Day, Protein) %>%
  pairwise_wilcox_test(pH_change ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc.strain %>% filter (p < 0.5)
## # A tibble: 19 x 11
##    Protein Day   .y.       group1 group2    n1    n2 statistic     p p.adj
##    <chr>   <chr> <chr>     <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl>
##  1 Casein  pH_1  pH_change 33     No         3     3       0   0.1   0.133
##  2 Casein  pH_1  pH_change 33     pAK80      3     3       0   0.1   0.133
##  3 Casein  pH_1  pH_change 33     SK11       3     3       0   0.064 0.133
##  4 Casein  pH_1  pH_change 33     Wg2        3     3       1   0.2   0.211
##  5 Casein  pH_1  pH_change No     SK11       3     3       7.5 0.197 0.211
##  6 Casein  pH_1  pH_change No     Wg2        3     3       9   0.1   0.133
##  7 Casein  pH_1  pH_change pAK80  SK11       3     3       7.5 0.197 0.211
##  8 Casein  pH_1  pH_change pAK80  Wg2        3     3       9   0.1   0.133
##  9 Casein  pH_1  pH_change SK11   Wg2        3     3       9   0.064 0.133
## 10 Potato  pH_1  pH_change 33     No         3     3       0   0.076 0.133
## 11 Potato  pH_1  pH_change 33     pAK80      3     3       0   0.064 0.133
## 12 Potato  pH_1  pH_change 33     SK11       3     3       0   0.1   0.133
## 13 Potato  pH_1  pH_change 33     Wg2        3     3       0   0.076 0.133
## 14 Potato  pH_1  pH_change No     pAK80      3     3       9   0.059 0.133
## 15 Potato  pH_1  pH_change No     SK11       3     3       9   0.076 0.133
## 16 Potato  pH_1  pH_change No     Wg2        3     3       9   0.072 0.133
## 17 Potato  pH_1  pH_change pAK80  SK11       3     3       7.5 0.197 0.211
## 18 Potato  pH_1  pH_change pAK80  Wg2        3     3       9   0.059 0.133
## 19 Potato  pH_1  pH_change SK11   Wg2        3     3       9   0.076 0.133
## # i 1 more variable: p.adj.signif <chr>

3.4.3 pH_1 - Strain & Protein - pH_change

This subset analysis focus on the first pH change, but the analysis did not contribute to the main conclusion of the paper. Therefore, this analysis is not presented or discussed in the paper.

# load data 
load("R_objects/stat_test_data_pH_change.RData")

# Create subset
dat.clean <- 
  dat.clean.wide %>% 
  filter(!is.na(pH_change) & Day == "pH_1")

3.4.3.1 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Strain")
OUTCOME <- "pH_change"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 20 x 6
##    Strain Protein     variable      n   mean    sd
##    <chr>  <chr>       <fct>     <dbl>  <dbl> <dbl>
##  1 33     Casein      pH_change     3 -2.21  0.015
##  2 No     Casein      pH_change     3 -0.23  0.01 
##  3 SK11   Casein      pH_change     3 -0.24  0    
##  4 Wg2    Casein      pH_change     3 -2.17  0.026
##  5 pAK80  Casein      pH_change     3 -0.207 0.049
##  6 33     Hydrolysate pH_change     2 -0.23  0    
##  7 No     Hydrolysate pH_change     2 -0.22  0.028
##  8 SK11   Hydrolysate pH_change     2 -0.2   0    
##  9 Wg2    Hydrolysate pH_change     2 -0.19  0.014
## 10 pAK80  Hydrolysate pH_change     2 -0.16  0.014
## 11 33     Potato      pH_change     3 -0.457 0.032
## 12 No     Potato      pH_change     3 -0.197 0.023
## 13 SK11   Potato      pH_change     3 -0.257 0.021
## 14 Wg2    Potato      pH_change     3 -0.357 0.006
## 15 pAK80  Potato      pH_change     3 -0.24  0    
## 16 33     Water       pH_change     2 -0.175 0.007
## 17 No     Water       pH_change     2 -0.16  0    
## 18 SK11   Water       pH_change     2 -0.155 0.021
## 19 Wg2    Water       pH_change     2 -0.12  0    
## 20 pAK80  Water       pH_change     2 -0.11  0.014
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

3.4.3.2 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
##  [1] Strain     Protein    replicate  Day0       Day1       Day2      
##  [7] Day6       norm_1     norm_2     norm_6     Day        pH_change 
## [13] is.outlier is.extreme
## <0 rækker> (eller 0-længde row.names)

Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.The Shapiro-Wilk test is performed on all data groups when possible with n > 2 and only with unique values.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic p.value
##   <chr>                <dbl>   <dbl>
## 1 residuals(model)     0.942  0.0156
# Compute Shapiro-Wilk test of normality for each data group
sets <- dat.clean %>%
  group_by(Day, Protein, Strain) %>% 
  summarise(n_unique = n_distinct(pH_change),
            n = n()) %>%
  filter(n > 2, n_unique > 1) %>% 
  transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
dat.clean %>% 
  mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
  filter(day_protein_strain %in% sets$day_protein_strain) %>%
  group_by(Day, Protein, Strain) %>% 
  shapiro_test(vars = "pH_change") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 2 x 7
##   Strain Protein Day   variable  statistic     p p.adj
##   <chr>  <chr>   <chr> <chr>         <dbl> <dbl> <dbl>
## 1 No     Potato  pH_1  pH_change      0.75     0     0
## 2 Wg2    Potato  pH_1  pH_change      0.75     0     0

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1    19    30     0.708 0.782
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

3.4.3.3 PERFORM TEST

The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Strain, Day) %>%
  kruskal_test(pH_change ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 3 x 9
##   Strain Day   .y.           n statistic    df      p method          p.adj
##   <chr>  <chr> <chr>     <int>     <dbl> <int>  <dbl> <chr>           <dbl>
## 1 33     pH_1  pH_change    10      8.51     3 0.0366 Kruskal-Wallis 0.0707
## 2 SK11   pH_1  pH_change    10      8.18     3 0.0424 Kruskal-Wallis 0.0707
## 3 Wg2    pH_1  pH_change    10      8.56     3 0.0358 Kruskal-Wallis 0.0707
pwc.protein <- dat.clean %>%
  filter(Protein %in% c("Casein", "Potato")) %>%
  group_by(Strain, Day) %>%
  pairwise_wilcox_test(pH_change ~ Protein) %>%
  adjust_pvalue(method = "fdr")
pwc.protein %>% filter(p < 0.05)
## # A tibble: 0 x 11
## # i 11 variables: Strain <chr>, Day <chr>, .y. <chr>, group1 <chr>,
## #   group2 <chr>, n1 <int>, n2 <int>, statistic <dbl>, p <dbl>, p.adj <dbl>,
## #   p.adj.signif <chr>
# for each strain
res.strain <- dat.clean %>% 
  group_by(Day, Protein) %>%
  kruskal_test(pH_change ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 2 x 9
##   Protein Day   .y.           n statistic    df       p method          p.adj
##   <chr>   <chr> <chr>     <int>     <dbl> <int>   <dbl> <chr>           <dbl>
## 1 Casein  pH_1  pH_change    15      11.9     4 0.0182  Kruskal-Wallis 0.0364
## 2 Potato  pH_1  pH_change    15      13.4     4 0.00943 Kruskal-Wallis 0.0364
pwc.strain <- dat.clean %>%
  filter(Protein %in% c("Casein", "Potato")) %>%
  group_by(Protein) %>%
  pairwise_wilcox_test(pH_change ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc.strain %>% filter(p < 0.05)
## # A tibble: 0 x 10
## # i 10 variables: Protein <chr>, .y. <chr>, group1 <chr>, group2 <chr>,
## #   n1 <int>, n2 <int>, statistic <dbl>, p <dbl>, p.adj <dbl>,
## #   p.adj.signif <chr>

3.5 CASEIN AND POTATO - Protein and Strain and Day - logN

3.5.1 DATA

# load data 
load("R_objects/stat_test_data_growth.RData")

# Create subset
dat.clean <- dat %>% 
  filter(!is.na(CFU) & Day != 0) %>%
  subset(Protein %in% c("Potato", "Casein"))%>%
  mutate(Day = as.factor(Day))

3.5.2 PREPARE VARIABLES

# Set names of variables

PREDICTOR <- c("Protein","Strain","Day")
OUTCOME <- "logN"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 16 x 7
##    Strain Protein Day   variable     n  mean    sd
##    <chr>  <chr>   <fct> <fct>    <dbl> <dbl> <dbl>
##  1 33     Casein  1     logN         3  9.21 0.056
##  2 33     Casein  6     logN         2  4.11 0.038
##  3 SK11   Casein  1     logN         3  7.79 0.077
##  4 SK11   Casein  6     logN         3  6.17 0.785
##  5 Wg2    Casein  1     logN         3  9.13 0.019
##  6 Wg2    Casein  6     logN         2  4.46 0.025
##  7 pAK80  Casein  1     logN         3  7.38 0.075
##  8 pAK80  Casein  6     logN         4  7.30 1.02 
##  9 33     Potato  1     logN         3  8.98 0.091
## 10 33     Potato  6     logN         4  6.82 0.034
## 11 SK11   Potato  1     logN         3  8.20 0.509
## 12 SK11   Potato  6     logN         4  7.15 0.14 
## 13 Wg2    Potato  1     logN         3  8.79 0.051
## 14 Wg2    Potato  6     logN         4  6.75 0.101
## 15 pAK80  Potato  1     logN         3  8.26 0.107
## 16 pAK80  Potato  6     logN         4  6.86 0.176
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

3.5.3 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 2 x 9
##   Strain Protein Day   Sample        pH     CFU  logN is.outlier is.extreme
##   <chr>  <chr>   <fct> <chr>      <dbl>   <dbl> <dbl> <lgl>      <lgl>     
## 1 SK11   Potato  6     SK11-P1-D6  5.99 8800000  6.94 TRUE       FALSE     
## 2 Wg2    Potato  6     Wg2-P2-D6   5    4020000  6.60 TRUE       FALSE

Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.The Shapiro-Wilk test is performed on all data groups when possible with n > 2 and only with unique values.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic      p.value
##   <chr>                <dbl>        <dbl>
## 1 residuals(model)     0.724 0.0000000182
# Compute Shapiro-Wilk test of normality for each data group
sets <- dat.clean %>%
  group_by(Day, Protein, Strain) %>% 
  summarise(n_unique = n_distinct(logN),
            n = n()) %>%
  filter(n > 2, n_unique > 1) %>% 
  transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
dat.clean %>% 
  mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
  filter(day_protein_strain %in% sets$day_protein_strain) %>%
  group_by(Day, Protein, Strain) %>% 
  shapiro_test(vars = "logN") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 1 x 7
##   Strain Protein Day   variable statistic     p p.adj
##   <chr>  <chr>   <fct> <chr>        <dbl> <dbl> <dbl>
## 1 SK11   Casein  1     logN          0.75     0     0

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1    15    35      1.47 0.171
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

3.5.4 PERFORM TEST

The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Day, Strain) %>%
  kruskal_test(logN ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 3 x 9
##   Strain Day   .y.       n statistic    df      p method         p.adj
##   <chr>  <fct> <chr> <int>     <dbl> <int>  <dbl> <chr>          <dbl>
## 1 33     1     logN      6      3.86     1 0.0495 Kruskal-Wallis 0.103
## 2 Wg2    1     logN      6      3.86     1 0.0495 Kruskal-Wallis 0.103
## 3 pAK80  1     logN      6      3.86     1 0.0495 Kruskal-Wallis 0.103
# for each strain
res.strain <- dat.clean %>% 
  group_by(Day, Protein) %>%
  kruskal_test(logN ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 4 x 9
##   Protein Day   .y.       n statistic    df      p method          p.adj
##   <chr>   <fct> <chr> <int>     <dbl> <int>  <dbl> <chr>           <dbl>
## 1 Casein  1     logN     12     10.2      3 0.0166 Kruskal-Wallis 0.0407
## 2 Potato  1     logN     12      9.46     3 0.0237 Kruskal-Wallis 0.0407
## 3 Casein  6     logN     11      8.21     3 0.0418 Kruskal-Wallis 0.0418
## 4 Potato  6     logN     16      8.91     3 0.0305 Kruskal-Wallis 0.0407
pwc <- dat.clean %>%
  group_by(Day, Protein) %>%
  pairwise_wilcox_test(logN ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 2 x 11
##   Protein Day   .y.   group1 group2    n1    n2 statistic     p p.adj
##   <chr>   <fct> <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl>
## 1 Potato  6     logN  33     SK11       4     4         0 0.029 0.200
## 2 Potato  6     logN  SK11   Wg2        4     4        16 0.029 0.200
## # i 1 more variable: p.adj.signif <chr>

3.5.5 VISUALIZE

All CFU data

# load data 
load("R_objects/stat_test_data_growth.RData")

# Create subset
dat.clean <- dat %>% filter(!is.na(CFU) & Day != 0) %>%
  mutate(Day = as.factor(Day))
# Set names of variables

PREDICTOR <- c("Protein","Strain","Day")
OUTCOME <- "logN"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 31 x 7
##    Strain Protein     Day   variable     n  mean    sd
##    <chr>  <chr>       <fct> <fct>    <dbl> <dbl> <dbl>
##  1 33     Casein      1     logN         3  9.21 0.056
##  2 33     Casein      6     logN         2  4.11 0.038
##  3 SK11   Casein      1     logN         3  7.79 0.077
##  4 SK11   Casein      6     logN         3  6.17 0.785
##  5 Wg2    Casein      1     logN         3  9.13 0.019
##  6 Wg2    Casein      6     logN         2  4.46 0.025
##  7 pAK80  Casein      1     logN         3  7.38 0.075
##  8 pAK80  Casein      6     logN         4  7.30 1.02 
##  9 33     Hydrolysate 6     logN         2  2.20 3.12 
## 10 SK11   Hydrolysate 1     logN         2  7.27 0.09 
## # i 21 more rows
#Name order in plot
dat.clean$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain, levels = c("pAK80", "SK11", "Wg2", "33"))
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato", "Hydrolysate", "Water"))

#Plot - boxplot
bxp.logN.sub <- dat.clean %>%
  filter(., Protein  %in% c("Casein", "Potato", "Water")) %>%
  filter (., Day %in% c(1, 2, 6)) %>%
  ggboxplot(.,
            x = "Day",
            y = "logN",
            color = "Strain", 
            facet.by = "Protein")+
  labs(x="Time [days]", y="log10(CFU/mL)")+
  scale_y_continuous(limits = c(0, 10))+
  geom_hline(yintercept=6.8, linetype="dashed", color = "#7d7d7dff")+
  facet_wrap(~Protein, 
             nrow=1, 
             labeller = as_labeller(c("Casein"="CCDM", 
                                      "Potato"="PCDM", 
                                      "Water"="CDM")))+
  scale_color_manual(values = pal_rickandmorty("schwifty")(5)[2:5], 
                    labels=c("MS22418 (PrtP-)", 
                             "MS22421 (PrtP/SK11)", 
                             "MS22425 (PrtP/Wg2)", 
                             "MS22427 (PrtP/MS22333)"))+ 

  theme_light()+
  theme(legend.position="bottom")

bxp.logN.sub 

#Save plot

ggsave(filename = "plots/bxp.logN_all_sub.svg", plot = bxp.logN.sub, width = 16.5, height = 10, units =c("cm"), dpi = 600)

Figure S4

4 DATA SET: Organic acids

4.1 DATA

4.1.1 LOAD

dat <- read_excel("./data/Data_collection.xlsx", sheet="Organic_compounds", na = "NA")

4.1.2 LOOK

# Take a glimpse
glimpse(dat)
## Rows: 93
## Columns: 12
## $ `First Injection` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ Sample            <chr> "pAK80-P1-D1", "pAK80-P2-D1", "pAK80-P3-D1", "pAK80-~
## $ Strain            <chr> "pAK80", "pAK80", "pAK80", "pAK80", "pAK80", "pAK80"~
## $ Protein           <chr> "Potato", "Potato", "Potato", "Casein", "Casein", "C~
## $ Day               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ Lactose           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Glucose           <dbl> 26.2580, 26.7594, 27.0690, 30.0366, 30.3484, 30.0170~
## $ Pyruvate          <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.00~
## $ Lactate           <dbl> 6.7332, 6.8342, 6.8214, 1.6452, 1.8118, 1.7192, 13.7~
## $ Formate           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Acetate           <dbl> 1.5994, 1.6274, 1.6640, 1.4928, 1.3780, 1.2972, 1.68~
## $ Ethanol           <dbl> 3.8348, 3.1974, 2.7324, 3.0098, 3.3522, 2.8502, 2.93~
# Make a explorative summary 
skimr::skim(dat)
Data summary
Name dat
Number of rows 93
Number of columns 12
_______________________
Column type frequency:
character 3
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample 0 1 8 11 0 93 0
Strain 0 1 2 5 0 5 0
Protein 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
First Injection 0 1 47.00 26.99 1 24.00 47.00 70.00 93.00 <U+2587><U+2587><U+2587><U+2587><U+2587>
Day 0 1 3.40 2.30 1 1.00 2.00 6.00 6.00 <U+2587><U+2581><U+2581><U+2581><U+2586>
Lactose 0 1 0.01 0.01 0 0.00 0.00 0.02 0.03 <U+2587><U+2581><U+2581><U+2582><U+2583>
Glucose 0 1 15.40 12.76 0 0.02 20.87 27.91 30.81 <U+2587><U+2582><U+2581><U+2583><U+2587>
Pyruvate 0 1 0.04 0.06 0 0.00 0.00 0.08 0.22 <U+2587><U+2582><U+2581><U+2581><U+2581>
Lactate 0 1 25.67 21.85 0 4.57 17.74 49.36 57.80 <U+2587><U+2583><U+2581><U+2582><U+2587>
Formate 0 1 0.07 0.16 0 0.00 0.00 0.00 0.60 <U+2587><U+2581><U+2581><U+2581><U+2581>
Acetate 0 1 1.54 0.24 1 1.41 1.51 1.67 2.16 <U+2582><U+2586><U+2587><U+2582><U+2582>
Ethanol 0 1 2.78 0.48 0 2.60 2.84 3.01 4.23 <U+2581><U+2581><U+2582><U+2587><U+2581>

4.1.3 CLEAN

#Reorganization of data
dat <- pivot_longer(dat, cols=6:12, names_to="compound", values_to = "concentration")

# Remove column
out.var <- c("First Injection")

dat <- dat %>% select(-one_of(out.var))

# Change variables
dat <- dat %>% mutate(Day = as.factor(Day))

# Look at cleaned data
skimr::skim(dat)
Data summary
Name dat
Number of rows 651
Number of columns 6
_______________________
Column type frequency:
character 4
factor 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample 0 1 8 11 0 93 0
Strain 0 1 2 5 0 5 0
Protein 0 1 6 6 0 2 0
compound 0 1 7 8 0 7 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Day 0 1 FALSE 3 6: 280, 1: 210, 2: 161

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
concentration 0 1 6.5 13.34 0 0 1.16 2.93 57.8 <U+2587><U+2581><U+2581><U+2581><U+2581>

4.1.4 SAVE

# Save cleaned data
save(dat, file = "R_objects/stat_test_data_OA.RData")

# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())

4.2 TEST INDEPENDENCE

4.2.1 CATEGORICAL VARIABLES

Chi squared test of independence.

# load data 
load("R_objects/stat_test_data_OA.RData")

# This converts all character columns to factor (if only specific columns should be converted replace "where(is_character)" with "c(<col1>,<col2>,...)"
dat <- dat %>% mutate(across(where(is_character), as_factor))


# Create vector of categorical variables with more than two categories and less than half of the number of samples.
cat_var <- dat %>% 
  select_if(is.factor) %>% 
  select(where(~n_distinct(.) > 1)) %>% 
  select(where(~n_distinct(.) < (nrow(dat)/2))) %>%
  colnames()

length(cat_var) # If less than 2 variables the following code should not be run
## [1] 5
# Test the variables pairwise 
cat_var %>% 
  combn(2) %>% 
  t() %>% 
  as_tibble() %>% 
  rowwise %>% 
  mutate(chisq_test = list(
    table(dat[[V1]], dat[[V2]]) %>% chisq.test()
    ),
    chisq_pval = chisq_test$p.value
    )
## Warning: There were 4 warnings in `mutate()`.
## The first warning was:
## i In argument: `chisq_test = list(table(dat[[V1]], dat[[V2]]) %>%
##   chisq.test())`.
## i In row 1.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## i Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 3 remaining warnings.
## # A tibble: 10 x 4
## # Rowwise: 
##    V1      V2       chisq_test chisq_pval
##    <chr>   <chr>    <list>          <dbl>
##  1 Sample  Strain   <htest>     0        
##  2 Sample  Protein  <htest>     4.88e- 85
##  3 Sample  Day      <htest>     1.76e-167
##  4 Sample  compound <htest>     1   e+  0
##  5 Strain  Protein  <htest>     9.90e-  1
##  6 Strain  Day      <htest>     5.72e-  6
##  7 Strain  compound <htest>     1   e+  0
##  8 Protein Day      <htest>     8.92e-  1
##  9 Protein compound <htest>     1   e+  0
## 10 Day     compound <htest>     1   e+  0

4.2.2 QUANTITATIVE VARIABLES

# create vector with the relevant variables 
CON.VARS <- dat %>% select(where(is.numeric)) %>% colnames()

CON.VARS
## [1] "concentration"

4.3 Glucose - Strain and Protein - Concentration

# load data 
load("R_objects/stat_test_data_OA.RData")

# Create subset
dat.clean <- dat %>% filter(compound == "Glucose", Day %in% c("1", "2", "6")) 

4.3.1 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Strain", "Day")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 28 x 7
##    Strain Protein Day   variable          n   mean    sd
##    <chr>  <chr>   <fct> <fct>         <dbl>  <dbl> <dbl>
##  1 33     Casein  1     concentration     3  0     0    
##  2 33     Casein  2     concentration     3  0.002 0.004
##  3 33     Casein  6     concentration     4  0.005 0.005
##  4 No     Casein  1     concentration     3 30.6   0.13 
##  5 No     Casein  6     concentration     4 29.4   0.592
##  6 SK11   Casein  1     concentration     3 29.3   1.10 
##  7 SK11   Casein  2     concentration     3 27.8   0.212
##  8 SK11   Casein  6     concentration     4 22.3   0.382
##  9 Wg2    Casein  1     concentration     3  0     0    
## 10 Wg2    Casein  2     concentration     3  0.009 0.007
## # i 18 more rows
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

4.3.2 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 5 x 8
##   Strain Protein Day   Sample      compound concentration is.outlier is.extreme
##   <chr>  <chr>   <fct> <chr>       <chr>            <dbl> <lgl>      <lgl>     
## 1 No     Casein  6     No-C1-D6    Glucose          30.2  TRUE       FALSE     
## 2 Wg2    Casein  6     Wg2-C3-D6   Glucose          29.8  TRUE       FALSE     
## 3 pAK80  Casein  6     pAK80-C4-D6 Glucose          16.7  TRUE       FALSE     
## 4 No     Potato  6     No-P4-D6    Glucose           0    TRUE       FALSE     
## 5 pAK80  Potato  6     pAK80-P4-D6 Glucose           7.92 TRUE       FALSE

Look at outliers
Outliers are derfined with a coefficient (coef=1.5), which specify how far the outlier is from the edge of their boxes.

Extreme outliers are defined with a coef=3.0. However, two outliers appear to be extreme in the boxplot (Samples Wg2-C3-D6 and No-P4-D6) why they are tested with different coefficients. Both points are outliers with high coefficient values that are close to 3. Therefore, we defined outliers as extreme with coef above 2.59. The extreme outliers are removed.

#Test coef for outlier Wg2-C3-D6 
dat.clean %>%
  filter(Strain == "Wg2" & Protein == "Casein" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.995)
## [1] FALSE FALSE  TRUE FALSE
#Test coef for outlier No-P4-D6
dat.clean %>%
  filter(Strain == "No" & Protein == "Potato" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.88)
## [1] FALSE FALSE FALSE  TRUE
#Test coef for outlier pAK80-C4-D6
dat.clean %>%
  filter(Strain == "pAK80" & Protein == "Casein" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE FALSE
#Remove extreme outliers

extreme_outliers <- c("Wg2-C3-D6", "No-P4-D6")

dat.clean <- subset(dat.clean, !Sample %in% extreme_outliers) 
# Create plot without extreme outliers
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

Check normality

QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.The Shapiro-Wilk test is performed on all data groups when possible with n > 2 and only with unique values.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic  p.value
##   <chr>                <dbl>    <dbl>
## 1 residuals(model)     0.468 1.50e-16
# Compute Shapiro-Wilk test of normality for each data group
sets <- dat.clean %>%
  group_by(Day, Protein, Strain) %>% 
  summarise(n_unique = n_distinct(concentration),
            n = n()) %>%
  filter(n > 2, n_unique > 1) %>% 
  transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
dat.clean %>% 
  mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
  filter(day_protein_strain %in% sets$day_protein_strain) %>%
  group_by(Day, Protein, Strain) %>% 
  shapiro_test(vars = "concentration") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 4 x 7
##   Strain Protein Day   variable      statistic       p  p.adj
##   <chr>  <chr>   <fct> <chr>             <dbl>   <dbl>  <dbl>
## 1 33     Casein  2     concentration     0.75  0       0     
## 2 Wg2    Casein  2     concentration     0.75  0       0     
## 3 No     Casein  6     concentration     0.665 0.00415 0.0259
## 4 Wg2    Casein  6     concentration     0.75  0       0

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1    27    63      1.36 0.156
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

4.3.3 PERFORM TEST

The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Day, Strain) %>%
  kruskal_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 11 x 9
##    Strain Day   .y.               n statistic    df      p method         p.adj
##    <chr>  <fct> <chr>         <int>     <dbl> <int>  <dbl> <chr>          <dbl>
##  1 33     1     concentration     6      4.35     1 0.0369 Kruskal-Wallis 0.063
##  2 SK11   1     concentration     6      3.86     1 0.0495 Kruskal-Wallis 0.063
##  3 Wg2    1     concentration     6      4.35     1 0.0369 Kruskal-Wallis 0.063
##  4 pAK80  1     concentration     6      3.86     1 0.0495 Kruskal-Wallis 0.063
##  5 33     2     concentration     6      3.97     1 0.0463 Kruskal-Wallis 0.063
##  6 Wg2    2     concentration     6      3.97     1 0.0463 Kruskal-Wallis 0.063
##  7 pAK80  2     concentration     6      3.86     1 0.0495 Kruskal-Wallis 0.063
##  8 33     6     concentration     8      5.40     1 0.0202 Kruskal-Wallis 0.063
##  9 SK11   6     concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.063
## 10 Wg2    6     concentration     7      4.58     1 0.0323 Kruskal-Wallis 0.063
## 11 pAK80  6     concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.063
# for each strain
res.strain <- dat.clean %>% 
  group_by(Day, Protein) %>%
  kruskal_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 6 x 9
##   Protein Day   .y.               n statistic    df       p method         p.adj
##   <chr>   <fct> <chr>         <int>     <dbl> <int>   <dbl> <chr>          <dbl>
## 1 Casein  1     concentration    15     13.2      4 0.0104  Kruskal-Wal~ 0.0182 
## 2 Potato  1     concentration    15     12.8      4 0.0121  Kruskal-Wal~ 0.0182 
## 3 Casein  2     concentration    12      9.84     3 0.02    Kruskal-Wal~ 0.024  
## 4 Potato  2     concentration    11      9.41     3 0.0243  Kruskal-Wal~ 0.0243 
## 5 Casein  6     concentration    19     16.1      4 0.00289 Kruskal-Wal~ 0.00867
## 6 Potato  6     concentration    19     16.3      4 0.00265 Kruskal-Wal~ 0.00867
pwc <- dat.clean %>%
  filter(Day == 6) %>%
  group_by(Day, Protein) %>%
  pairwise_wilcox_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 10 x 11
##    Protein Day   .y.           group1 group2    n1    n2 statistic     p p.adj
##    <chr>   <fct> <chr>         <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl>
##  1 Casein  6     concentration 33     No         4     4         0 0.029 0.058
##  2 Casein  6     concentration 33     pAK80      4     4         0 0.029 0.058
##  3 Casein  6     concentration 33     SK11       4     4         0 0.029 0.058
##  4 Casein  6     concentration No     pAK80      4     4        16 0.029 0.058
##  5 Casein  6     concentration No     SK11       4     4        16 0.029 0.058
##  6 Potato  6     concentration 33     pAK80      4     4         0 0.029 0.058
##  7 Potato  6     concentration 33     SK11       4     4         0 0.029 0.058
##  8 Potato  6     concentration pAK80  SK11       4     4        16 0.029 0.058
##  9 Potato  6     concentration pAK80  Wg2        4     4        16 0.029 0.058
## 10 Potato  6     concentration SK11   Wg2        4     4        16 0.029 0.058
## # i 1 more variable: p.adj.signif <chr>

4.3.4 VISUALIZE

#Name order in plot
dat.clean$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain, 
                           levels = c("No", "pAK80", "SK11", "Wg2", "33"), 
                           labels =c ("Non-inoculated", 
                                      "MS22418 (PrtP-)", 
                                      "MS22421 (PrtP/SK11)", 
                                      "MS22425 (PrtP/Wg2)", 
                                      "MS22427 (PrtP/MS22333)")
                           )
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))

#Plot - boxplot
bxp.glucose <- dat.clean %>%
  filter(., Protein  %in% c("Casein", "Potato")) %>%
  filter (., Day %in% c(1, 2, 6)) %>%
  ggboxplot(.,
            x = "Day",
            y = "concentration",
            color = "Strain", 
            facet.by = "Protein")+
  labs(x="Time [days]", y="Concentration [mM]")+
  scale_y_continuous()+
  facet_wrap(~Protein, 
             nrow=1, 
             labeller = as_labeller(c("Casein"="CCDM", 
                                      "Potato"="PCDM")
                                    )
             )+
  scale_color_manual(values = c("Non-inoculated"="#E89242FF", 
                                  "MS22418 (PrtP-)"="#82491EFF", 
                                  "MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "MS22427 (PrtP/MS22333)"="#FB6467FF")
                           )+
  geom_hline(yintercept=30.0, linetype="dashed", color = "#7d7d7dff")+
  theme_light()+
  theme(legend.position="bottom")

save(bxp.glucose, file = "R_objects/Fig1B.RData")

4.4 Lactate - Strain and Protein - Concentration

# load data 
load("R_objects/stat_test_data_OA.RData")

# Create subset
dat.clean <- dat %>% filter(compound == "Lactate", Day %in% c("1", "2", "6")) 

# Look at cleaned data
glimpse(dat.clean)
## Rows: 93
## Columns: 6
## $ Sample        <chr> "pAK80-P1-D1", "pAK80-P2-D1", "pAK80-P3-D1", "pAK80-C1-D~
## $ Strain        <chr> "pAK80", "pAK80", "pAK80", "pAK80", "pAK80", "pAK80", "W~
## $ Protein       <chr> "Potato", "Potato", "Potato", "Casein", "Casein", "Casei~
## $ Day           <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ compound      <chr> "Lactate", "Lactate", "Lactate", "Lactate", "Lactate", "~
## $ concentration <dbl> 6.7332, 6.8342, 6.8214, 1.6452, 1.8118, 1.7192, 13.7722,~
skimr::skim(dat.clean)
Data summary
Name dat.clean
Number of rows 93
Number of columns 6
_______________________
Column type frequency:
character 4
factor 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample 0 1 8 11 0 93 0
Strain 0 1 2 5 0 5 0
Protein 0 1 6 6 0 2 0
compound 0 1 7 7 0 1 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Day 0 1 FALSE 3 6: 40, 1: 30, 2: 23

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
concentration 0 1 25.67 21.85 0 4.57 17.74 49.36 57.8 <U+2587><U+2583><U+2581><U+2582><U+2587>

4.5 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Strain", "Day")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 28 x 7
##    Strain Protein Day   variable          n   mean    sd
##    <chr>  <chr>   <fct> <fct>         <dbl>  <dbl> <dbl>
##  1 33     Casein  1     concentration     3 54.1   1.90 
##  2 33     Casein  2     concentration     3 50.9   2.36 
##  3 33     Casein  6     concentration     4 49.5   0.838
##  4 No     Casein  1     concentration     3  0.115 0.021
##  5 No     Casein  6     concentration     4  0.162 0.039
##  6 SK11   Casein  1     concentration     3  2.28  0.113
##  7 SK11   Casein  2     concentration     3  4.90  0.305
##  8 SK11   Casein  6     concentration     4 12.1   0.738
##  9 Wg2    Casein  1     concentration     3 53.1   0.489
## 10 Wg2    Casein  2     concentration     3 54.8   2.38 
## # i 18 more rows
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

4.5.1 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 3 x 8
##   Strain Protein Day   Sample      compound concentration is.outlier is.extreme
##   <chr>  <chr>   <fct> <chr>       <chr>            <dbl> <lgl>      <lgl>     
## 1 Wg2    Casein  6     Wg2-C3-D6   Lactate           0.15 TRUE       FALSE     
## 2 pAK80  Casein  6     pAK80-C4-D6 Lactate          20.9  TRUE       FALSE     
## 3 No     Potato  6     No-P4-D6    Lactate          47.9  TRUE       FALSE

Look at outliers

Outliers are derfined with a coefficient (coef=1.5), which specify how far the outliers are from the edge of their box.

Points are extreme outliers if they have high coefficient values that are close to 3. Therefore, we defined outliers as extreme with coef above 2.59. The extreme outliers are removed.

#Test coef for outlier Wg2-C3-D6 
dat.clean %>%
  filter(Strain == "Wg2" & Protein == "Casein" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE  TRUE FALSE
#Test coef for outlier No-P4-D6
dat.clean %>%
  filter(Strain == "No" & Protein == "Potato" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.98)
## [1] FALSE FALSE FALSE  TRUE
#Test coef for outlier pAK80-C4-D6
dat.clean %>%
  filter(Strain == "pAK80" & Protein == "Casein" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.55)
## [1] FALSE FALSE FALSE FALSE
#Remove extreme outliers

extreme_outliers <- c("Wg2-C3-D6", "No-P4-D6")

dat.clean <- subset(dat.clean, !Sample %in% extreme_outliers) 
# Create plot without extreme outliers
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

Check normality

QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.The Shapiro-Wilk test is performed on all data groups when possible with n > 2 and only with unique values.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic  p.value
##   <chr>                <dbl>    <dbl>
## 1 residuals(model)     0.706 3.32e-12
# Compute Shapiro-Wilk test of normality for each data group
sets <- dat.clean %>%
  group_by(Day, Protein, Strain) %>% 
  summarise(n_unique = n_distinct(concentration),
            n = n()) %>%
  filter(n > 2, n_unique > 1) %>% 
  transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
dat.clean %>% 
  mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
  filter(day_protein_strain %in% sets$day_protein_strain) %>%
  group_by(Day, Protein, Strain) %>% 
  shapiro_test(vars = "concentration") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 0 x 7
## # i 7 variables: Strain <chr>, Protein <chr>, Day <fct>, variable <chr>,
## #   statistic <dbl>, p <dbl>, p.adj <dbl>

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1    27    63     0.956 0.537
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

4.5.2 PERFORM TEST

The data are not normal distributed, but normality is found within each group by the Shapiro-Wilk normality test. However, the same data groups of the glucose data were not normal distributed.

The same statistical analysis is applied for the two data sets, including a Kruskal-Wallis test and a Wilcoxon pairwise test.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Day, Strain) %>%
  kruskal_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 10 x 9
##    Strain Day   .y.               n statistic    df      p method          p.adj
##    <chr>  <fct> <chr>         <int>     <dbl> <int>  <dbl> <chr>           <dbl>
##  1 33     1     concentration     6      3.86     1 0.0495 Kruskal-Wallis 0.0693
##  2 SK11   1     concentration     6      3.86     1 0.0495 Kruskal-Wallis 0.0693
##  3 Wg2    1     concentration     6      3.86     1 0.0495 Kruskal-Wallis 0.0693
##  4 pAK80  1     concentration     6      3.86     1 0.0495 Kruskal-Wallis 0.0693
##  5 Wg2    2     concentration     6      3.86     1 0.0495 Kruskal-Wallis 0.0693
##  6 pAK80  2     concentration     6      3.86     1 0.0495 Kruskal-Wallis 0.0693
##  7 33     6     concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0693
##  8 SK11   6     concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0693
##  9 Wg2    6     concentration     7      4.5      1 0.0339 Kruskal-Wallis 0.0693
## 10 pAK80  6     concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0693
# for each strain
res.strain <- dat.clean %>% 
  group_by(Day, Protein) %>%
  kruskal_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 6 x 9
##   Protein Day   .y.               n statistic    df       p method         p.adj
##   <chr>   <fct> <chr>         <int>     <dbl> <int>   <dbl> <chr>          <dbl>
## 1 Casein  1     concentration    15     12.8      4 0.0121  Kruskal-Wal~ 0.0182 
## 2 Potato  1     concentration    15     12.9      4 0.0118  Kruskal-Wal~ 0.0182 
## 3 Casein  2     concentration    12      9.67     3 0.0216  Kruskal-Wal~ 0.0243 
## 4 Potato  2     concentration    11      9.41     3 0.0243  Kruskal-Wal~ 0.0243 
## 5 Casein  6     concentration    19     16.1      4 0.00293 Kruskal-Wal~ 0.00879
## 6 Potato  6     concentration    19     16.3      4 0.00263 Kruskal-Wal~ 0.00879
pwc <- dat.clean %>%
  filter(Day == 6) %>%
  group_by(Day, Protein) %>%
  pairwise_wilcox_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 10 x 11
##    Protein Day   .y.           group1 group2    n1    n2 statistic     p p.adj
##    <chr>   <fct> <chr>         <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl>
##  1 Casein  6     concentration 33     No         4     4        16 0.029 0.058
##  2 Casein  6     concentration 33     pAK80      4     4        16 0.029 0.058
##  3 Casein  6     concentration 33     SK11       4     4        16 0.029 0.058
##  4 Casein  6     concentration No     pAK80      4     4         0 0.029 0.058
##  5 Casein  6     concentration No     SK11       4     4         0 0.029 0.058
##  6 Potato  6     concentration 33     pAK80      4     4        16 0.029 0.058
##  7 Potato  6     concentration 33     SK11       4     4        16 0.029 0.058
##  8 Potato  6     concentration pAK80  SK11       4     4         0 0.029 0.058
##  9 Potato  6     concentration pAK80  Wg2        4     4         0 0.029 0.058
## 10 Potato  6     concentration SK11   Wg2        4     4         0 0.029 0.058
## # i 1 more variable: p.adj.signif <chr>

4.5.3 VISUALIZE

#Name order in plot
dat.clean$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain, 
                           levels = c("No", "pAK80", "SK11", "Wg2", "33"),
                           labels =c ("Non-inoculated", 
                                      "MS22418 (PrtP-)", 
                                      "MS22421 (PrtP/SK11)", 
                                      "MS22425 (PrtP/Wg2)", 
                                      "MS22427 (PrtP/MS22333)")
                           )
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))

#Boxplot
bxp.lactate <- dat.clean %>%
  filter(., Protein  %in% c("Casein", "Potato")) %>%
  filter (., Day %in% c(1, 2, 6)) %>%
  ggboxplot(.,
            x = "Day",
            y = "concentration",
            color = "Strain", 
            facet.by = "Protein")+
  labs(x="Time [days]", y="Concentration [mM]")+
  scale_y_continuous()+
  facet_wrap(~Protein, 
             nrow=1, 
             labeller = as_labeller(c("Casein"="CCDM", 
                                      "Potato"="PCDM")
                                    )
             )+
  scale_color_manual(values = c("Non-inoculated"="#E89242FF", 
                                  "MS22418 (PrtP-)"="#82491EFF", 
                                  "MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "MS22427 (PrtP/MS22333)"="#FB6467FF")
                           )+

  theme_light()+
  theme(legend.position="bottom")

save(bxp.lactate, file = "R_objects/Fig1C.RData")

4.6 VISUALIZE: Organic acids together with pH

load(file = "R_objects/Fig1A.RData")
load(file = "R_objects/Fig1B.RData")
load(file = "R_objects/Fig1C.RData")

fig1 <- ggarrange(bxp.pH, bxp.glucose, bxp.lactate, heights = c(3,2,2),
          ncol = 1,
          labels = c("A","B", "C"),
          align = "v",
          legend = "right", common.legend = TRUE
          )

#Save Figure 1
ggsave(filename = "plots/fig1_pH_glucose_lactate.svg", plot = fig1, width = 16.5, height = 20, units =c("cm"), dpi = 600)

Figure 1

5 DATA SET: Amino acids

5.1 DATA

5.1.1 LOAD

dat <- read_excel("./data/Data_collection.xlsx", sheet="AA", na = "NA")

5.1.2 LOOK

# Take a glimpse
glimpse(dat)
## Rows: 94
## Columns: 27
## $ Sample_ID         <chr> "Wg2-C1-D1", "Wg2-C2-D1", "Wg2-C3-D1", "Sk11-C1-D1",~
## $ Sample            <chr> "DC1-1", "DC1-2", "DC1-3", "CC1-1", "CC1-2", "CC1-3"~
## $ `First Injection` <chr> "50Lise", "51Lise", "52Lise", "56Lise", "57Lise", "5~
## $ `Sample no`       <chr> "C1", "C2", "C3", "C1", "C2", "C3", "C1", "C2", "C3"~
## $ Day               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2~
## $ Protein           <chr> "Casein", "Casein", "Casein", "Casein", "Casein", "C~
## $ Strain            <chr> "Wg2", "Wg2", "Wg2", "SK11", "SK11", "SK11", "pAK80"~
## $ conc              <chr> "uM", "uM", "uM", "uM", "uM", "uM", "uM", "uM", "uM"~
## $ His               <dbl> 18.4537472, 17.1673140, 18.5421378, 0.7142539, 0.794~
## $ Asn               <dbl> 13.546499, 12.328580, 15.030006, 1.541399, 2.227930,~
## $ Ser               <dbl> 6.097704, 6.053979, 6.150434, 4.579292, 5.554308, 5.~
## $ `Gln/Arg`         <dbl> 3.3068133, 2.9704246, 3.1039723, 0.2580126, 0.286869~
## $ Gly               <dbl> 59.60045, 58.21350, 57.65550, 66.11592, 83.48793, 82~
## $ Asp               <dbl> 2.3255601, 2.5187825, 2.4126894, 0.5196090, 0.770297~
## $ Glu               <dbl> 31.6012219, 31.9529499, 29.6124309, 0.7158399, 1.459~
## $ Thr               <dbl> 3.6553179, 3.2991876, 3.7922678, 0.2722407, 0.605377~
## $ Ala               <dbl> 22.832045, 22.878213, 26.319410, 1.558777, 1.925683,~
## $ Pro               <dbl> 318.4714880, 309.0975224, 314.1835222, 4.9019494, 5.~
## $ Cys               <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.000000~
## $ Lys               <dbl> 5.526448, 6.202491, 4.736371, 2.165367, 2.188678, 2.~
## $ Tyr               <dbl> 95.12991, 90.49597, 96.35402, 15.21761, 16.79873, 17~
## $ Met               <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.~
## $ Val               <dbl> 17.711121, 17.242679, 18.731475, 3.099217, 3.132582,~
## $ Ile               <dbl> 24.26801, 24.62159, 25.36100, 22.05858, 24.04160, 24~
## $ Leu               <dbl> 25.0888213, 23.7935900, 25.8345528, 0.3067792, 0.511~
## $ Phe               <dbl> 51.791264, 48.307623, 51.469286, 3.014791, 3.351971,~
## $ Trp               <dbl> 36.56362, 35.95126, 37.81154, 26.58388, 29.87836, 30~
# Make a explorative summary 
skimr::skim(dat)
Data summary
Name dat
Number of rows 94
Number of columns 27
_______________________
Column type frequency:
character 7
numeric 20
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample_ID 0 1 8 11 0 94 0
Sample 0 1 5 5 0 94 0
First Injection 0 1 5 6 0 94 0
Sample no 0 1 2 2 0 8 0
Protein 0 1 6 6 0 2 0
Strain 0 1 2 7 0 5 0
conc 0 1 2 2 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Day 0 1.00 3.38 2.30 1.00 1.00 2.00 6.00 6.00 <U+2587><U+2581><U+2581><U+2581><U+2586>
His 1 0.99 9.17 17.53 0.00 0.00 0.79 2.06 59.98 <U+2587><U+2581><U+2581><U+2581><U+2581>
Asn 1 0.99 9.70 12.56 0.61 2.06 2.76 12.97 56.05 <U+2587><U+2582><U+2581><U+2581><U+2581>
Ser 1 0.99 8.31 6.85 2.55 4.73 5.89 8.30 47.89 <U+2587><U+2581><U+2581><U+2581><U+2581>
Gln/Arg 1 0.99 4.18 4.86 0.00 0.87 3.01 4.67 33.49 <U+2587><U+2582><U+2581><U+2581><U+2581>
Gly 1 0.99 75.34 12.36 48.38 68.05 76.24 82.79 134.98 <U+2583><U+2587><U+2583><U+2581><U+2581>
Asp 1 0.99 10.23 10.64 0.19 2.10 5.96 15.02 37.30 <U+2587><U+2583><U+2581><U+2582><U+2581>
Glu 1 0.99 24.65 29.81 0.52 1.44 9.45 32.59 104.82 <U+2587><U+2582><U+2581><U+2581><U+2581>
Thr 1 0.99 2.83 3.09 0.27 0.60 1.09 4.37 18.32 <U+2587><U+2582><U+2581><U+2581><U+2581>
Ala 1 0.99 30.80 35.96 0.95 3.52 14.27 42.82 172.54 <U+2587><U+2583><U+2581><U+2581><U+2581>
Pro 1 0.99 111.49 174.20 0.00 6.27 20.78 62.16 647.38 <U+2587><U+2581><U+2581><U+2581><U+2581>
Cys 1 0.99 0.73 2.36 0.00 0.00 0.00 0.00 14.34 <U+2587><U+2581><U+2581><U+2581><U+2581>
Lys 1 0.99 18.36 27.02 0.00 2.17 5.48 20.61 114.74 <U+2587><U+2581><U+2581><U+2581><U+2581>
Tyr 1 0.99 43.20 43.24 0.82 16.80 26.27 51.81 229.70 <U+2587><U+2581><U+2582><U+2581><U+2581>
Met 1 0.99 0.78 1.82 0.00 0.00 0.00 0.00 11.50 <U+2587><U+2581><U+2581><U+2581><U+2581>
Val 1 0.99 9.63 12.18 0.31 2.56 3.17 13.16 65.19 <U+2587><U+2581><U+2581><U+2581><U+2581>
Ile 1 0.99 16.13 10.40 0.00 3.48 22.05 24.10 35.35 <U+2585><U+2581><U+2581><U+2587><U+2581>
Leu 1 0.99 14.64 25.21 0.00 0.51 1.57 19.97 134.90 <U+2587><U+2581><U+2581><U+2581><U+2581>
Phe 1 0.99 31.41 33.76 0.47 3.52 15.24 60.94 163.17 <U+2587><U+2583><U+2582><U+2581><U+2581>
Trp 1 0.99 22.47 14.84 0.00 2.28 29.48 30.87 44.19 <U+2585><U+2581><U+2581><U+2587><U+2582>

5.1.3 CLEAN DATA

# Reorganization of data
dat <-  pivot_longer(dat, cols=9:27, names_to="aa", values_to = "concentration")

# Change variables
dat <- dat %>% mutate(Day = as.factor(Day))

# Look at cleaned data
skimr::skim(dat)
Data summary
Name dat
Number of rows 1786
Number of columns 10
_______________________
Column type frequency:
character 8
factor 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample_ID 0 1 8 11 0 94 0
Sample 0 1 5 5 0 94 0
First Injection 0 1 5 6 0 94 0
Sample no 0 1 2 2 0 8 0
Protein 0 1 6 6 0 2 0
Strain 0 1 2 7 0 5 0
conc 0 1 2 2 0 1 0
aa 0 1 3 7 0 19 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Day 0 1 FALSE 3 6: 760, 1: 570, 2: 456

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
concentration 19 0.99 23.37 52.19 0 1.12 5.55 26.67 647.38 <U+2587><U+2581><U+2581><U+2581><U+2581>
glimpse(dat)
## Rows: 1,786
## Columns: 10
## $ Sample_ID         <chr> "Wg2-C1-D1", "Wg2-C1-D1", "Wg2-C1-D1", "Wg2-C1-D1", ~
## $ Sample            <chr> "DC1-1", "DC1-1", "DC1-1", "DC1-1", "DC1-1", "DC1-1"~
## $ `First Injection` <chr> "50Lise", "50Lise", "50Lise", "50Lise", "50Lise", "5~
## $ `Sample no`       <chr> "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1"~
## $ Day               <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ Protein           <chr> "Casein", "Casein", "Casein", "Casein", "Casein", "C~
## $ Strain            <chr> "Wg2", "Wg2", "Wg2", "Wg2", "Wg2", "Wg2", "Wg2", "Wg~
## $ conc              <chr> "uM", "uM", "uM", "uM", "uM", "uM", "uM", "uM", "uM"~
## $ aa                <chr> "His", "Asn", "Ser", "Gln/Arg", "Gly", "Asp", "Glu",~
## $ concentration     <dbl> 18.453747, 13.546499, 6.097704, 3.306813, 59.600447,~

5.1.4 SAVE

# Save cleaned data
save(dat, file = "R_objects/stat_test_data_AA.RData")

# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())

5.2 TEST INDEPENDENCE

5.2.1 CATEGORICAL VARIABLES

Chi squared test of independence

# load data 
load("R_objects/stat_test_data_AA.RData")

# This converts all character columns to factor (if only specific columns should be converted replace "where(is_character)" with "c(<col1>,<col2>,...)"
dat <- dat %>% mutate(across(where(is_character), as_factor))


# Create vector of categorical variables with more than two categories and less than half of the number of samples.
cat_var <- dat %>% 
  select_if(is.factor) %>% 
  select(where(~n_distinct(.) > 1)) %>% 
  select(where(~n_distinct(.) < (nrow(dat)/2))) %>%
  colnames()

length(cat_var) # If less than 2 variables the following code should not be run
## [1] 8
# Test the variables pairwise 
cat_var %>% 
  combn(2) %>% 
  t() %>% 
  as_tibble() %>% 
  rowwise %>% 
  mutate(chisq_test = list(
    table(dat[[V1]], dat[[V2]]) %>% chisq.test()
    ),
    chisq_pval = chisq_test$p.value
    )
## Warning: There were 15 warnings in `mutate()`.
## The first warning was:
## i In argument: `chisq_test = list(table(dat[[V1]], dat[[V2]]) %>%
##   chisq.test())`.
## i In row 1.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## i Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 14 remaining warnings.
## # A tibble: 28 x 4
## # Rowwise: 
##    V1        V2              chisq_test chisq_pval
##    <chr>     <chr>           <list>          <dbl>
##  1 Sample_ID Sample          <htest>     0        
##  2 Sample_ID First Injection <htest>     0        
##  3 Sample_ID Sample no       <htest>     0        
##  4 Sample_ID Day             <htest>     0        
##  5 Sample_ID Protein         <htest>     3.58e-311
##  6 Sample_ID Strain          <htest>     0        
##  7 Sample_ID aa              <htest>     1   e+  0
##  8 Sample    First Injection <htest>     0        
##  9 Sample    Sample no       <htest>     0        
## 10 Sample    Day             <htest>     0        
## # i 18 more rows

5.2.2 QUANTITATIVE VARIABLES

# create vector with the relevant variables 
CON.VARS <- dat %>% select(where(is.numeric)) %>% colnames()

CON.VARS
## [1] "concentration"

5.3 Casein - ALL amino acids

5.3.1 DATA

# load data 
load("R_objects/stat_test_data_AA.RData")

# data subset

dat.clean <- dat %>%
  filter(., Protein == "Casein", !is.na(concentration))
# Set names of variables
PREDICTOR <- c("Strain", "Day", "aa")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 266 x 7
##    Day   Strain  aa      variable          n   mean    sd
##    <fct> <chr>   <chr>   <fct>         <dbl>  <dbl> <dbl>
##  1 1     MS22333 Ala     concentration     3 41.9   1.27 
##  2 1     MS22333 Asn     concentration     3 32.7   1.88 
##  3 1     MS22333 Asp     concentration     3  4.69  0.286
##  4 1     MS22333 Cys     concentration     3  0.394 0.161
##  5 1     MS22333 Gln/Arg concentration     3  3.07  0.123
##  6 1     MS22333 Glu     concentration     3 81.1   1.42 
##  7 1     MS22333 Gly     concentration     3 74.3   2.42 
##  8 1     MS22333 His     concentration     3 49.9   1.40 
##  9 1     MS22333 Ile     concentration     3 22.9   0.925
## 10 1     MS22333 Leu     concentration     3 28.9   0.328
## # i 256 more rows

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 53 x 12
##    Day   Strain  aa      Sample_ID Sample `First Injection` `Sample no` Protein
##    <fct> <chr>   <chr>   <chr>     <chr>  <chr>             <chr>       <chr>  
##  1 6     MS22333 Asn     33-C4-D6  EC6-4  32Lise            C4          Casein 
##  2 6     MS22333 Gln/Arg 33-C3-D6  EC6-3  31Lise            C3          Casein 
##  3 6     MS22333 His     33-C3-D6  EC6-3  31Lise            C3          Casein 
##  4 6     MS22333 Met     33-C4-D6  EC6-4  32Lise            C4          Casein 
##  5 6     MS22333 Phe     33-C3-D6  EC6-3  31Lise            C3          Casein 
##  6 6     MS22333 Thr     33-C3-D6  EC6-3  31Lise            C3          Casein 
##  7 6     MS22333 Tyr     33-C4-D6  EC6-4  32Lise            C4          Casein 
##  8 6     MS22333 Val     33-C4-D6  EC6-4  32Lise            C4          Casein 
##  9 6     No      Ala     No-C3-D6  AC6-3  39Lise            C3          Casein 
## 10 6     No      Asn     No-C2-D6  AC6-2  38Lise            C2          Casein 
## # i 43 more rows
## # i 4 more variables: conc <chr>, concentration <dbl>, is.outlier <lgl>,
## #   is.extreme <lgl>

###VISUALIZE

#Name order in plot
dat.clean$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain, 
                           levels = c("No", "pAK80", "SK11", "Wg2", "MS22333"), 
                           labels =c ("Non-inoculated", 
                                      "MS22418 (PrtP-)", 
                                      "MS22421 (PrtP/SK11)", 
                                      "MS22425 (PrtP/Wg2)", 
                                      "MS22427 (PrtP/MS22333)"))
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))

#Boxplots

bxp.AA.CCDM <- dat.clean %>%
  ggboxplot(.,
            x = "Day",
            y = "concentration",
            color = "Strain",
            facet.by = "aa")+
  labs(title="CCDM", x="Time [days]", y="Amino acid concentration [uM]")+
  scale_y_continuous()+
  facet_wrap(~aa, 
             nrow=5,
             scales = "free_y")+
  scale_color_manual(values = c("Non-inoculated"="#E89242FF", 
                                  "MS22418 (PrtP-)"="#82491EFF", 
                                  "MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "MS22427 (PrtP/MS22333)"="#FB6467FF")
                           )+
  theme_light()+
  theme(legend.position="bottom")
#Save plot
ggsave(filename = "plots/bxp.AA.CCDM.svg", plot = bxp.AA.CCDM, width = 16.5, height = 20, units =c("cm"), dpi = 600)

Figure S5A

5.4 Potato - ALL amino acids

5.4.1 DATA

# load data 
load("R_objects/stat_test_data_AA.RData")

# data subset

dat.clean <- dat %>%
  filter(., Protein == "Potato", !is.na(concentration))
# Set names of variables
PREDICTOR <- c("Strain", "Day", "aa")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 266 x 7
##    Day   Strain  aa      variable          n   mean    sd
##    <fct> <chr>   <chr>   <fct>         <dbl>  <dbl> <dbl>
##  1 1     MS22333 Ala     concentration     3 10.3   0.274
##  2 1     MS22333 Asn     concentration     3  3.74  1.77 
##  3 1     MS22333 Asp     concentration     3  7.71  0.054
##  4 1     MS22333 Cys     concentration     3  0     0    
##  5 1     MS22333 Gln/Arg concentration     3  9.75  0.105
##  6 1     MS22333 Glu     concentration     3  9.10  0.297
##  7 1     MS22333 Gly     concentration     3 80.8   1.97 
##  8 1     MS22333 His     concentration     3  0.911 0.11 
##  9 1     MS22333 Ile     concentration     3 23.0   0.481
## 10 1     MS22333 Leu     concentration     3  1.74  0.057
## # i 256 more rows

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 17 x 12
##    Day   Strain  aa      Sample_ID  Sample `First Injection` `Sample no` Protein
##    <fct> <chr>   <chr>   <chr>      <chr>  <chr>             <chr>       <chr>  
##  1 6     MS22333 Ile     33_P1-D6   EP6-1  Lise25            P1          Potato 
##  2 6     MS22333 Leu     33_P1-D6   EP6-1  Lise25            P1          Potato 
##  3 6     MS22333 Val     33_P1-D6   EP6-1  Lise25            P1          Potato 
##  4 6     SK11    Asn     SK11_P1-D6 CP6-1  Lise17            P1          Potato 
##  5 6     SK11    Glu     SK11_P1-D6 CP6-1  Lise17            P1          Potato 
##  6 6     SK11    Gly     SK11_P1-D6 CP6-1  Lise17            P1          Potato 
##  7 6     SK11    Thr     SK11_P4-D6 CP6-4  Lise20            P4          Potato 
##  8 6     SK11    Val     SK11_P1-D6 CP6-1  Lise17            P1          Potato 
##  9 6     Wg2     Gln/Arg Wg2_P3-D6  DP6-3  Lise11            P3          Potato 
## 10 6     Wg2     Ser     Wg2_P1-D6  DP6-1  Lise9             P1          Potato 
## 11 6     Wg2     Thr     Wg2_P1-D6  DP6-1  Lise9             P1          Potato 
## 12 6     Wg2     Tyr     Wg2_P3-D6  DP6-3  Lise11            P3          Potato 
## 13 6     pAK80   Gly     pAK80_P4-~ BP6-4  Lise4             P4          Potato 
## 14 6     pAK80   Leu     pAK80_P3-~ BP6-3  Lise3             P3          Potato 
## 15 6     pAK80   Lys     pAK80_P3-~ BP6-3  Lise3             P3          Potato 
## 16 6     pAK80   Trp     pAK80_P2-~ BP6-2  Lise2             P2          Potato 
## 17 6     pAK80   Val     pAK80_P3-~ BP6-3  Lise3             P3          Potato 
## # i 4 more variables: conc <chr>, concentration <dbl>, is.outlier <lgl>,
## #   is.extreme <lgl>

###VISUALIZE

#Name order in plot
dat.clean$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain, 
                           levels = c("No", "pAK80", "SK11", "Wg2", "MS22333"), 
                           labels =c ("Non-inoculated", 
                                      "MS22418 (PrtP-)", 
                                      "MS22421 (PrtP/SK11)", 
                                      "MS22425 (PrtP/Wg2)", 
                                      "MS22427 (PrtP/MS22333)"))
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))

#Boxplots

bxp.AA.PCDM <- dat.clean %>%
  ggboxplot(.,
            x = "Day",
            y = "concentration",
            color = "Strain",
            facet.by = "aa")+
  labs(title="PCDM", x="Time [days]", y="Amino acid concentration [uM]")+
  scale_y_continuous()+
  facet_wrap(~aa, 
             nrow=5,
             scales = "free_y")+
  scale_color_manual(values = c("Non-inoculated"="#E89242FF", 
                                  "MS22418 (PrtP-)"="#82491EFF", 
                                  "MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "MS22427 (PrtP/MS22333)"="#FB6467FF")
                           )+
  theme_light()+
  theme(legend.position="bottom")
#Save plot

ggsave(filename = "plots/bxp.AA.PCDM.svg", plot = bxp.AA.PCDM, width = 16.5, height = 20, units =c("cm"), dpi = 600)

Figure S5B

5.5 ALL - Create normalized heatmap

# load data 
load("R_objects/stat_test_data_AA.RData")

5.5.1 NORMALIZED DATA

dat.norm  <- dat %>%
  group_by(Sample) %>%
  mutate(conc_norm = 100*concentration/sum(concentration))

table(is.na(dat.norm$concentration))
## 
## FALSE  TRUE 
##  1767    19
# Extract metadata and aa table
dat.wide <- dat.norm %>%
  filter(!is.na(concentration)) %>%
  pivot_wider(values_from = conc_norm, 
              names_from = aa, 
              id_cols = c(`Sample no`, Strain, Protein, Day, Sample), 
              values_fill = 0)

meta <- dat.wide %>% 
  ungroup() %>%
  select(`Sample no`:Sample) %>% 
  data.frame
row.names(meta) <- meta$Sample

#aa.num defines the aa data in prercent
aa.num <- dat.wide %>%
  ungroup() %>%
  select(His:Trp) %>%
  data.frame
row.names(aa.num) <- meta$Sample

#aa.norm is the calculated z-score used in the heatmap

aa.norm <- aa.num

for (i in 1:ncol(aa.num)){
  x.no <- mean(aa.num[meta$Strain == "No",i])
  aa.norm[,i] <- (aa.num[,i]-x.no)/sd(aa.num[,i])
}

5.5.2 VISUALIZE: Heatmap

#Annotations
ano_df2 =data.frame("Day"=meta$Day, "Medium" = meta$Protein, "Strain" = meta$Strain)
rownames(ano_df2) = rownames(meta)
ano_df2$Day <- as.factor(ano_df2$Day)
ano_df2$Medium <- factor(ano_df2$Medium, levels = c("Casein", "Potato"), labels=c("CCDM", "PCDM"))
ano_df2$Strain <- factor(ano_df2$Strain, levels = c("No", "pAK80", "SK11", "Wg2", "MS22333"), labels=c("A: Non-inoculated", "B: MS22418 (PrtP-)", "C: MS22421 (PrtP/SK11)", "D: MS22425 (PrtP/Wg2)", "E: MS22427 (PrtP/MS22333)"))

annot_colors=list("Medium"=c("CCDM"="#A6EEE6FF", "PCDM"="#FAE48BFF"), "Day" = c("1"= "#d2d3daff", "2"= "#86889bff", "6"="#282931ff"), "Strain" =c("A: Non-inoculated"="#E89242FF", "B: MS22418 (PrtP-)"="#82491EFF", "C: MS22421 (PrtP/SK11)"="#24325FFF", "D: MS22425 (PrtP/Wg2)"="#B7E4F9FF", "E: MS22427 (PrtP/MS22333)"="#FB6467FF"))

#Plot
heatmap <- pheatmap(aa.norm,
         annotation_row = ano_df2, 
         cutree_cols = 3, cutree_rows = 5, border_color = NA,
         color=colorRampPalette(c("navy", "white", "#d50c00ff"))(50), annotation_colors = annot_colors)

#Save
ggsave(filename = "plots/pmap_aa.svg", plot = heatmap,width = 20, height = 32, units = "cm",dpi = 600)

Figure 3

5.6 ALL - PERMANOVA and PCoA

The heatmap’s normalized amino acid concentrations are used, with the same distances between samples.

5.6.1 PERMANOVA

# Calculate distances

ed <- vegdist(aa.num, method = "euclidian")

adonis2(ed ~ Day, data = ano_df2)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = ed ~ Day, data = ano_df2)
##          Df SumOfSqs      R2      F Pr(>F)   
## Day       2     5324 0.08926 4.4103  0.005 **
## Residual 90    54328 0.91074                 
## Total    92    59652 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed ~ Strain, data = ano_df2)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = ed ~ Strain, data = ano_df2)
##          Df SumOfSqs     R2      F Pr(>F)    
## Strain    4    29063 0.4872 20.902  0.001 ***
## Residual 88    30589 0.5128                  
## Total    92    59652 1.0000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed ~ Medium, data = ano_df2)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = ed ~ Medium, data = ano_df2)
##          Df SumOfSqs      R2      F Pr(>F)   
## Medium    1     4415 0.07401 7.2736  0.006 **
## Residual 91    55237 0.92599                 
## Total    92    59652 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed ~ Strain*Day*Medium, data = ano_df2)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = ed ~ Strain * Day * Medium, data = ano_df2)
##                   Df SumOfSqs      R2        F Pr(>F)    
## Strain             4    29063 0.48720 123.0467  0.001 ***
## Day                2     5072 0.08502  42.9452  0.001 ***
## Medium             1     4496 0.07536  76.1354  0.001 ***
## Strain:Day         7     3214 0.05388   7.7763  0.001 ***
## Strain:Medium      4    11093 0.18596  46.9646  0.001 ***
## Day:Medium         2     1298 0.02177  10.9943  0.001 ***
## Strain:Day:Medium  7     1579 0.02647   3.8194  0.001 ***
## Residual          65     3838 0.06434                    
## Total             92    59652 1.00000                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.6.1.1 VISUALIZE: PCoA plot

# load data 
load("R_objects/stat_test_data.RData")

# Create subset
dat.clean <- dat
dat.num <- dat.clean %>% select(all_of(ends_with("acid")))

Eigenvalues for axes

# Calculating Bray-Curtis PCoA with capscale
tmp2 <- capscale(aa.num ~ 1, comm = aa.num, distance = "euclidian", metaMDS = TRUE)
## Square root transformation
## Wisconsin double standardization
tmp2
## Call: capscale(formula = aa.num ~ 1, distance = "euclidian", comm =
## aa.num, metaMDSdist = TRUE)
## 
##               Inertia Rank
## Total           1.257     
## Unconstrained   1.257   18
## Inertia is squared Euclidean distance 
## Species scores projected from 'aa.num' 
## 
## Eigenvalues for unconstrained axes:
##   MDS1   MDS2   MDS3   MDS4   MDS5   MDS6   MDS7   MDS8 
## 0.6189 0.2642 0.1496 0.0648 0.0433 0.0377 0.0204 0.0165 
## (Showing 8 of 18 unconstrained eigenvalues)
## 
## metaMDSdist transformed data: wisconsin(sqrt(aa.num))

Prepare data for plotting

# Collect data for plotting
mds.samples <- data.frame(tmp2$CA$u)
mds.scfa <- data.frame(tmp2$CA$v)

# Prepare point zero and labels for arrows
mds.scfa$label1 <- row.names(mds.scfa)
mds.scfa$x <- 0
mds.scfa$y <- 0

# Bind with main data
dat.mds <- cbind(ano_df2, mds.samples)
plots <- list()
sets <- data.frame(Day = c(1,2,6))
for (i in 1:nrow(sets)) {
  tmp <- dat.mds[dat.mds$Day == sets$Day[i],]
  plots[[i]] <- ggplot() +
  geom_point(data = tmp, 
             mapping = aes(x = MDS1, y = MDS2, 
                           color = Strain, 
                           shape = Medium),
             size = 3, 
             alpha = 0.7) +
  theme_pubr(legend = "top") + 
  labs(x = "Axis 1", y = "Axis 2") +
  guides(fill = "none") +
  guides(size = "none") +
    scale_color_manual(values = c("A: Non-inoculated"="#E89242FF", 
                                  "B: MS22418 (PrtP-)"="#82491EFF", 
                                  "C: MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "D: MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "E: MS22427 (PrtP/MS22333)"="#FB6467FF"
                                  )
                       ) +
    coord_cartesian(xlim = c(min(mds.samples$MDS1)*1.1,
                              max(mds.samples$MDS1)*1.1),
                    ylim = c(min(mds.samples$MDS2)*1.1,
                             max(mds.samples$MDS2)*1.1))
  names(plots)[i] <- paste(sets$Medium[i], "day", sets$Day[i], sep = " ")
}
  plots[[4]] <- ggplot() +
  geom_segment(data = mds.scfa, mapping = aes(x=x, y=y, xend=0.35*MDS1, yend=0.35*MDS2),
               lineend = "butt",
               linejoin = "round",
               linewidth = 0.5,
               arrow = arrow(length = unit(0.3, 'cm'))) +
  geom_label_repel(data = mds.scfa,
             mapping = aes(x = 0.35*MDS1, y = 0.35*MDS2), #
             label = mds.scfa$label1,
             size = 4,
             min.segment.length = 0,
             segment.alpha = 0.8,
             box.padding = 0.3,
             force = 1) +
  theme_pubr(legend = "top") + 
  labs(x = "Axis 1", y = "Axis 2") +
  guides(fill = "none") +
    coord_cartesian(xlim = c(min(mds.samples$MDS1)*1.1, max(mds.samples$MDS1)*1.1), ylim = c(min(mds.samples$MDS2)*1.1,max(mds.samples$MDS2)*1.1))
  names(plots)[4] <- "AA weights"

p.out <- ggarrange(plotlist = plots, widths = 12, heights = 8,
          common.legend = TRUE, 
          legend = "bottom", 
          labels = names(plots)
          )
#Save plot
ggsave(filename = "plots/SCFA_PCoA.svg", plot = p.out, width = 30.48, height = 20.32, units = "cm",bg = "white", dpi = 600)

Figure 4

5.6.2 Proteolytic strains - PERMANOVA and PCOA plot

5.6.2.1 PERMANOVA

# Create subset to compare Wg2 and MS22333 samples 
sub.aa <- dat.wide %>%
  filter(Strain %in% c("Wg2", "MS22333")) 

# Extract metadata and aa table for subset
meta <- sub.aa %>% 
  ungroup() %>%
  select(`Sample no`:Sample) %>% 
  data.frame
row.names(meta) <- meta$Sample

# sub.aa.num defines the aa data in prercent
sub.aa.num <- sub.aa %>%
  ungroup() %>%
  select(His:Trp) %>%
  data.frame
row.names(sub.aa.num) <- meta$Sample

# Calculate distances

ed.sub <- vegdist(sub.aa.num, method = "euclidian")

adonis2(ed.sub ~ as.factor(Day), data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = ed.sub ~ as.factor(Day), data = meta)
##                Df SumOfSqs      R2      F Pr(>F)  
## as.factor(Day)  2   1595.3 0.11949 2.5106  0.072 .
## Residual       37  11755.4 0.88051                
## Total          39  13350.7 1.00000                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed.sub ~ Strain, data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = ed.sub ~ Strain, data = meta)
##          Df SumOfSqs      R2      F Pr(>F)
## Strain    1    257.4 0.01928 0.7471  0.401
## Residual 38  13093.3 0.98072              
## Total    39  13350.7 1.00000
adonis2(ed.sub ~ Protein, data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = ed.sub ~ Protein, data = meta)
##          Df SumOfSqs    R2      F Pr(>F)    
## Protein   1  10306.8 0.772 128.67  0.001 ***
## Residual 38   3043.9 0.228                  
## Total    39  13350.7 1.000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed.sub ~ Protein*Day*Strain, data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = ed.sub ~ Protein * Day * Strain, data = meta)
##                    Df SumOfSqs      R2         F Pr(>F)    
## Protein             1  10306.8 0.77200 2517.3348  0.001 ***
## Day                 2   1595.3 0.11949  194.8173  0.001 ***
## Strain              1    257.4 0.01928   62.8747  0.001 ***
## Protein:Day         2    879.2 0.06586  107.3707  0.001 ***
## Protein:Strain      1     55.4 0.00415   13.5394  0.001 ***
## Day:Strain          2     86.4 0.00648   10.5573  0.001 ***
## Protein:Day:Strain  2     55.5 0.00415    6.7738  0.001 ***
## Residual           28    114.6 0.00859                     
## Total              39  13350.7 1.00000                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed.sub ~ Protein:Day, data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = ed.sub ~ Protein:Day, data = meta)
##             Df SumOfSqs      R2      F Pr(>F)    
## Protein:Day  5  12781.3 0.95735 152.63  0.001 ***
## Residual    34    569.4 0.04265                  
## Total       39  13350.7 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.6.2.2 VISUALIZE: PCoA plot

Eigenvalues for axes

# Calculating Bray-Curtis PCoA with capscale
tmp.sub <- capscale(sub.aa.num ~ 1, comm = sub.aa.num, distance = "euclidian", metaMDS = TRUE)
## Wisconsin double standardization
tmp.sub
## Call: capscale(formula = sub.aa.num ~ 1, distance = "euclidian", comm =
## sub.aa.num, metaMDSdist = TRUE)
## 
##               Inertia Rank
## Total          0.6945     
## Unconstrained  0.6945   18
## Inertia is squared Euclidean distance 
## Species scores projected from 'sub.aa.num' 
## 
## Eigenvalues for unconstrained axes:
##   MDS1   MDS2   MDS3   MDS4   MDS5   MDS6   MDS7   MDS8 
## 0.3510 0.1746 0.0584 0.0360 0.0279 0.0123 0.0107 0.0075 
## (Showing 8 of 18 unconstrained eigenvalues)
## 
## metaMDSdist transformed data: wisconsin(sub.aa.num)
# Collect data for plotting
mds.samples.sub <- data.frame(tmp.sub$CA$u)

# Bind with main data
dat.mds.sub <- cbind(ano_df2[ano_df2$Strain %in% c("D: MS22425 (PrtP/Wg2)", "E: MS22427 (PrtP/MS22333)"),], mds.samples.sub)

# Plot
p.sub <- ggplot() +
  geom_point(data = dat.mds.sub, 
             mapping = aes(x = MDS1, y = MDS2, 
                           color = Strain, 
                           shape = Medium),
             size = 3,
             alpha = 0.7) +
  theme_pubr(legend = "top") + 
  labs(x = "Axis 1", y = "Axis 2") +
  guides(fill = "none") +
  guides(size = "none") +
  facet_wrap("Day", ncol = 1) +
  scale_color_manual(values = c("D: MS22425 (PrtP/Wg2)"="#B7E4F9FF",
                                "E: MS22427 (PrtP/MS22333)"="#FB6467FF")
                     ) +
  coord_cartesian(xlim = c(min(mds.samples.sub$MDS1)*1.1,
                           max(mds.samples.sub$MDS1)*1.1),
                  ylim = c(min(mds.samples.sub$MDS2)*1.1,
                           max(mds.samples.sub$MDS2)*1.1))+
  theme(legend.position="bottom")

p.sub

5.7 ALL - Total free amino acid concentraiton (TFAA)

5.7.1 DATA

# load data 
load("R_objects/stat_test_data_AA.RData")

#Calculate TFAA for each sample

dat <- dat %>%
  group_by(Sample, Protein, Strain, Day)  %>%
  summarise(TFAA = sum(concentration))
## `summarise()` has grouped output by 'Sample', 'Protein', 'Strain'. You can
## override using the `.groups` argument.
dat.clean <- dat %>%
  filter(!is.na(TFAA))

5.7.2 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Strain",  "Protein", "Day")
OUTCOME <- "TFAA"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 28 x 7
##    Protein Strain  Day   variable     n  mean    sd
##    <chr>   <chr>   <fct> <fct>    <dbl> <dbl> <dbl>
##  1 Casein  MS22333 1     TFAA         3 1145. 19.8 
##  2 Casein  MS22333 2     TFAA         3 1292. 32.2 
##  3 Casein  MS22333 6     TFAA         4 1376. 79.7 
##  4 Potato  MS22333 1     TFAA         3  275.  7.73
##  5 Potato  MS22333 2     TFAA         3  479. 25.4 
##  6 Potato  MS22333 6     TFAA         4  451. 12.9 
##  7 Casein  No      1     TFAA         3  175. 10.5 
##  8 Casein  No      6     TFAA         4  172. 16.4 
##  9 Potato  No      1     TFAA         3  200. 12.1 
## 10 Potato  No      6     TFAA         3  201.  3.38
## # i 18 more rows
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

5.7.3 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 5 x 7
##   Protein Strain  Day   Sample  TFAA is.outlier is.extreme
##   <chr>   <chr>   <fct> <chr>  <dbl> <lgl>      <lgl>     
## 1 Potato  MS22333 6     EP6-1   432. TRUE       FALSE     
## 2 Casein  No      6     AC6-3   197. TRUE       FALSE     
## 3 Potato  SK11    6     CP6-4   249. TRUE       FALSE     
## 4 Casein  Wg2     6     DC6-4  1982. TRUE       FALSE     
## 5 Casein  pAK80   6     BC6-4   622. TRUE       FALSE

Check normality

QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals. The Shapiro-Wilk test is performed on all data groups.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality for data groups
dat.clean %>% 
  group_by(Day, Protein, Strain) %>% 
  shapiro_test(vars = "TFAA") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 0 x 7
## # i 7 variables: Protein <chr>, Strain <chr>, Day <fct>, variable <chr>,
## #   statistic <dbl>, p <dbl>, p.adj <dbl>

Test homogneity of variance assumption

plot(model, 1)

#Levene_test
dat.clean %>% 
  group_by(Protein, Strain) %>% 
  levene_test(TFAA ~ Day) %>% 
  adjust_pvalue(method = "fdr")
## # A tibble: 10 x 7
##    Protein Strain    df1   df2 statistic     p p.adj
##    <chr>   <chr>   <int> <int>     <dbl> <dbl> <dbl>
##  1 Casein  MS22333     2     7    0.827  0.476 0.748
##  2 Casein  No          1     5    0.0669 0.806 0.806
##  3 Casein  SK11        2     7    0.406  0.681 0.757
##  4 Casein  Wg2         2     7    0.541  0.605 0.756
##  5 Casein  pAK80       2     7    1.22   0.350 0.748
##  6 Potato  MS22333     2     7    0.798  0.487 0.748
##  7 Potato  No          1     4    0.543  0.502 0.748
##  8 Potato  SK11        2     7    1.19   0.358 0.748
##  9 Potato  Wg2         2     7    0.711  0.524 0.748
## 10 Potato  pAK80       2     7    1.62   0.263 0.748

5.7.4 PERFORM TEST

Tests on the whole data set did not find normality, but data groups were normal distributed according to the Shapiro-Wilk test of normality. Therefore, ANOVA and t-test are used for the statistics of this data set.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Strain, Day) %>%
  anova_test(TFAA ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein 
## # A tibble: 14 x 10
##    Strain  Day   Effect    DFn   DFd        F           p `p<.05`   ges    p.adj
##    <chr>   <fct> <chr>   <dbl> <dbl>    <dbl>       <dbl> <chr>   <dbl>    <dbl>
##  1 MS22333 1     Protein     1     4 5005.    0.000000239 "*"     0.999  2.12e-6
##  2 MS22333 2     Protein     1     4 1181.    0.00000428  "*"     0.997  1.20e-5
##  3 MS22333 6     Protein     1     6  524.    0.000000454 "*"     0.989  2.12e-6
##  4 No      1     Protein     1     4    7.29  0.054       ""      0.646  6.87e-2
##  5 No      6     Protein     1     5    8.43  0.034       "*"     0.628  4.76e-2
##  6 SK11    1     Protein     1     4    2.44  0.193       ""      0.379  2.25e-1
##  7 SK11    2     Protein     1     4    1.65  0.268       ""      0.292  2.89e-1
##  8 SK11    6     Protein     1     6  361.    0.00000138  "*"     0.984  4.83e-6
##  9 Wg2     1     Protein     1     4 4141.    0.000000349 "*"     0.999  2.12e-6
## 10 Wg2     2     Protein     1     4   64.9   0.001       "*"     0.942  1.75e-3
## 11 Wg2     6     Protein     1     6   11.4   0.015       "*"     0.654  2.33e-2
## 12 pAK80   1     Protein     1     4  111.    0.000462    "*"     0.965  9.24e-4
## 13 pAK80   2     Protein     1     4  166.    0.000209    "*"     0.976  4.88e-4
## 14 pAK80   6     Protein     1     6    0.246 0.637       ""      0.039  6.37e-1
pwc.protein <- dat.clean %>%
  filter(Protein %in% c("Casein", "Potato")) %>%
  group_by(Strain, Day) %>%
   pairwise_t_test(TFAA ~ Protein) %>%
  adjust_pvalue(method = "fdr")
pwc.protein %>% filter(p < 0.05)
## # A tibble: 10 x 11
##    Strain  Day   .y.   group1 group2    n1    n2           p p.signif      p.adj
##    <chr>   <fct> <chr> <chr>  <chr>  <int> <int>       <dbl> <chr>         <dbl>
##  1 MS22333 1     TFAA  Casein Potato     3     3 0.000000239 ****     0.00000212
##  2 MS22333 2     TFAA  Casein Potato     3     3 0.00000428  ****     0.0000120 
##  3 MS22333 6     TFAA  Casein Potato     4     4 0.000000454 ****     0.00000212
##  4 No      6     TFAA  Casein Potato     4     3 0.0337      *        0.0472    
##  5 SK11    6     TFAA  Casein Potato     4     4 0.00000138  ****     0.00000483
##  6 Wg2     1     TFAA  Casein Potato     3     3 0.000000349 ****     0.00000212
##  7 Wg2     2     TFAA  Casein Potato     3     3 0.00129     **       0.00226   
##  8 Wg2     6     TFAA  Casein Potato     4     4 0.015       *        0.0233    
##  9 pAK80   1     TFAA  Casein Potato     3     3 0.000462    ***      0.000924  
## 10 pAK80   2     TFAA  Casein Potato     3     3 0.000209    ***      0.000488  
## # i 1 more variable: p.adj.signif <chr>
# for each strain
res.strain <- dat.clean %>% 
  group_by(Day, Protein) %>%
  anova_test(TFAA ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain
## # A tibble: 6 x 10
##   Protein Day   Effect   DFn   DFd        F        p `p<.05`   ges    p.adj
##   <chr>   <fct> <chr>  <dbl> <dbl>    <dbl>    <dbl> <chr>   <dbl>    <dbl>
## 1 Casein  1     Strain     4    10 2945.    2.63e-15 "*"     0.999 1.58e-14
## 2 Potato  1     Strain     4    10   34.7   7.78e- 6 "*"     0.933 9.34e- 6
## 3 Casein  2     Strain     3     8  233.    3.99e- 8 "*"     0.989 7.98e- 8
## 4 Potato  2     Strain     3     8    0.926 4.71e- 1 ""      0.258 4.71e- 1
## 5 Casein  6     Strain     4    15   23.9   2.35e- 6 "*"     0.864 3.52e- 6
## 6 Potato  6     Strain     4    14  355.    6.67e-14 "*"     0.99  2.00e-13
pwc.strain <- dat.clean %>%
  filter(Protein %in% c("Casein", "Potato")) %>%
  group_by(Day, Protein) %>%
  pairwise_t_test(TFAA ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc.strain %>% filter(p.adj < 0.05)
## # A tibble: 33 x 11
##    Protein Day   .y.   group1  group2    n1    n2        p p.signif    p.adj
##    <chr>   <fct> <chr> <chr>   <chr>  <int> <int>    <dbl> <chr>       <dbl>
##  1 Casein  1     TFAA  MS22333 No         3     3 1.49e-15 ****     2.58e-14
##  2 Casein  1     TFAA  MS22333 pAK80      3     3 1.22e-15 ****     2.58e-14
##  3 Casein  1     TFAA  MS22333 SK11       3     3 1.48e-15 ****     2.58e-14
##  4 Casein  1     TFAA  MS22333 Wg2        3     3 6.82e-12 ****     3.22e-11
##  5 Casein  1     TFAA  No      Wg2        3     3 4.02e-13 ****     2.61e-12
##  6 Casein  1     TFAA  pAK80   Wg2        3     3 2.84e-13 ****     2.46e-12
##  7 Casein  1     TFAA  SK11    Wg2        3     3 3.96e-13 ****     2.61e-12
##  8 Potato  1     TFAA  MS22333 No         3     3 8.01e- 6 ****     1.81e- 5
##  9 Potato  1     TFAA  MS22333 pAK80      3     3 6.58e- 7 ****     1.90e- 6
## 10 Potato  1     TFAA  No      pAK80      3     3 2.45e- 2 *        3.86e- 2
## # i 23 more rows
## # i 1 more variable: p.adj.signif <chr>

5.7.5 VISUALIZE

#Name order in plot
dat.clean$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain, 
                           levels = c("No", "pAK80", "SK11", "Wg2", "MS22333"), 
                           labels =c ("Non-inoculated", 
                                      "MS22418 (PrtP-)", 
                                      "MS22421 (PrtP/SK11)", 
                                      "MS22425 (PrtP/Wg2)", 
                                      "MS22427 (PrtP/MS22333)"))
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))

#Plot - boxplot
bxp.TFAA <- dat.clean %>%
  filter(., Protein  %in% c("Casein", "Potato")) %>%
  filter (., Day %in% c(1, 2, 6)) %>%
  ggboxplot(.,
            x = "Day",
            y = "TFAA",
            color = "Strain",
            facet.by = "Protein")+
  labs(x="Time [days]", y="TFAA [uM]")+
  scale_y_continuous()+
  facet_wrap(~Protein, 
             nrow=1, 
             labeller = as_labeller(c("Casein"="CCDM", 
                                      "Potato"="PCDM"))
             )+
  scale_color_manual(values = c("Non-inoculated"="#E89242FF", 
                                  "MS22418 (PrtP-)"="#82491EFF", 
                                  "MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "MS22427 (PrtP/MS22333)"="#FB6467FF")
                           )+
  theme_light()+
  theme(legend.position="bottom")
#Save plot

ggsave(filename = "plots/bxp.TFAA.svg", plot = bxp.TFAA, width = 13, height = 10, units =c("cm"), dpi = 600)

Figure 2

5.8 Sum of Phe for all days - Strain and Protein- Concentration

5.8.1 DATA

# load data 
load("R_objects/stat_test_data_AA.RData")

# Create subset
dat.clean <- dat %>% filter(aa %in% c("Phe"), !is.na(concentration))

5.8.2 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Strain")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 10 x 6
##    Protein Strain  variable          n  mean     sd
##    <chr>   <chr>   <fct>         <dbl> <dbl>  <dbl>
##  1 Casein  MS22333 concentration    10 89.4   8.72 
##  2 Casein  No      concentration     7  3.32  0.194
##  3 Casein  SK11    concentration    10  2.28  1.22 
##  4 Casein  Wg2     concentration    10 67.7  34.3  
##  5 Casein  pAK80   concentration    10  5.83  8.78 
##  6 Potato  MS22333 concentration    10 50.9  20.1  
##  7 Potato  No      concentration     6  4.75  0.304
##  8 Potato  SK11    concentration    10 19.7  17.4  
##  9 Potato  Wg2     concentration    10 42.0  19.8  
## 10 Potato  pAK80   concentration    10  9.22  4.31
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

5.8.3 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 7 x 12
##   Protein Strain Sample_ID   Sample `First Injection` `Sample no` Day   conc 
##   <chr>   <chr>  <chr>       <chr>  <chr>             <chr>       <fct> <chr>
## 1 Casein  Wg2    Wg2_C4-D6   DC6-4  Lise16            C4          6     uM   
## 2 Casein  pAK80  pAK80_C1-D6 BC6-1  Lise5             C1          6     uM   
## 3 Casein  pAK80  pAK80_C2-D6 BC6-2  Lise6             C2          6     uM   
## 4 Casein  pAK80  pAK80_C3-D6 BC6-3  Lise7             C3          6     uM   
## 5 Casein  pAK80  pAK80_C4-D6 BC6-4  Lise8             C4          6     uM   
## 6 Potato  No     No-P2-D1    AP1-2  66Lise            P2          1     uM   
## 7 Potato  SK11   Sk11-P3-D2  CP2-3  85Lise            P3          2     uM   
## # i 4 more variables: aa <chr>, concentration <dbl>, is.outlier <lgl>,
## #   is.extreme <lgl>
#Test coef for outlier
dat.clean %>%
  filter(Strain == "Wg2" & Protein == "Casein" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.8)
## [1] FALSE FALSE FALSE  TRUE
dat.clean %>%
  filter(Strain == "pAK80" & Protein == "Casein" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE FALSE
dat.clean %>%
  filter(Strain == "SK11" & Protein == "Potato" & Day == 2) %>%
  pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE
#Remove extreme outliers with coef > 2.59 
extreme_outliers <- c("Wg2_C4-D6")

dat.clean <- subset(dat.clean, !Sample %in% extreme_outliers)
# Create plot without extreme outliers
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

Check normality

QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic  p.value
##   <chr>                <dbl>    <dbl>
## 1 residuals(model)     0.756 3.87e-11
# Compute Shapiro-Wilk test of normality for data groups
dat.clean %>% 
  group_by(Protein, Strain) %>% 
  shapiro_test(vars = "concentration") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 7 x 6
##   Protein Strain  variable      statistic          p     p.adj
##   <chr>   <chr>   <chr>             <dbl>      <dbl>     <dbl>
## 1 Casein  No      concentration     0.755 0.0143     0.0205   
## 2 Casein  SK11    concentration     0.775 0.00720    0.0133   
## 3 Casein  Wg2     concentration     0.549 0.0000137  0.0000687
## 4 Casein  pAK80   concentration     0.456 0.00000110 0.0000110
## 5 Potato  MS22333 concentration     0.709 0.00112    0.00280  
## 6 Potato  No      concentration     0.711 0.00798    0.0133   
## 7 Potato  SK11    concentration     0.600 0.0000556  0.000185

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic      p
##   <int> <int>     <dbl>  <dbl>
## 1     9    83      1.99 0.0509
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

5.8.4 PERFORM TEST

The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Strain) %>%
  kruskal_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 5 x 8
##   Strain  .y.               n statistic    df        p method            p.adj
##   <chr>   <chr>         <int>     <dbl> <int>    <dbl> <chr>             <dbl>
## 1 MS22333 concentration    20     14.3      1 0.000157 Kruskal-Wallis 0.000392
## 2 No      concentration    13      9        1 0.0027   Kruskal-Wallis 0.004   
## 3 SK11    concentration    20     14.3      1 0.000157 Kruskal-Wallis 0.000392
## 4 Wg2     concentration    20      4.17     1 0.0413   Kruskal-Wallis 0.0413  
## 5 pAK80   concentration    20      8.69     1 0.0032   Kruskal-Wallis 0.004
pwc <- dat.clean %>% 
  group_by(Strain) %>%
  pairwise_wilcox_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 5 x 10
##   Strain  .y.   group1 group2    n1    n2 statistic       p   p.adj p.adj.signif
##   <chr>   <chr> <chr>  <chr>  <int> <int>     <dbl>   <dbl>   <dbl> <chr>       
## 1 MS22333 conc~ Casein Potato    10    10       100 1.08e-5 2.70e-5 ****        
## 2 No      conc~ Casein Potato     7     6         0 1   e-3 1.67e-3 **          
## 3 SK11    conc~ Casein Potato    10    10         0 1.08e-5 2.70e-5 ****        
## 4 Wg2     conc~ Casein Potato    10    10        77 4.3 e-2 4.3 e-2 *           
## 5 pAK80   conc~ Casein Potato    10    10        11 2   e-3 2.5 e-3 **
# for each strain
res.strain <- dat.clean %>% 
  group_by(Protein) %>%
  kruskal_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 2 x 8
##   Protein .y.               n statistic    df           p method           p.adj
##   <chr>   <chr>         <int>     <dbl> <int>       <dbl> <chr>            <dbl>
## 1 Casein  concentration    47      37.2     4 0.000000166 Kruskal-Wallis 3.32e-7
## 2 Potato  concentration    46      34.5     4 0.00000058  Kruskal-Wallis 5.80e-7
pwc <- dat.clean %>%
  group_by(Protein) %>%
  pairwise_wilcox_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 16 x 10
##    Protein .y.           group1  group2    n1    n2 statistic         p    p.adj
##    <chr>   <chr>         <chr>   <chr>  <int> <int>     <dbl>     <dbl>    <dbl>
##  1 Casein  concentration MS22333 No        10     7        70 0.000103  0.000257
##  2 Casein  concentration MS22333 pAK80     10    10       100 0.0000108 0.000036
##  3 Casein  concentration MS22333 SK11      10    10       100 0.0000108 0.000036
##  4 Casein  concentration MS22333 Wg2       10    10        90 0.002     0.00308 
##  5 Casein  concentration No      Wg2        7    10         0 0.000103  0.000257
##  6 Casein  concentration pAK80   Wg2       10    10         0 0.0000108 0.000036
##  7 Casein  concentration SK11    Wg2       10    10         0 0.0000108 0.000036
##  8 Potato  concentration MS22333 No        10     6        60 0.00025   0.000455
##  9 Potato  concentration MS22333 pAK80     10    10       100 0.0000108 0.000036
## 10 Potato  concentration MS22333 SK11      10    10        91 0.001     0.00167 
## 11 Potato  concentration No      pAK80      6    10         6 0.007     0.00933 
## 12 Potato  concentration No      SK11       6    10         0 0.00025   0.000455
## 13 Potato  concentration No      Wg2        6    10         0 0.00025   0.000455
## 14 Potato  concentration pAK80   SK11      10    10        14 0.005     0.00714 
## 15 Potato  concentration pAK80   Wg2       10    10         0 0.0000108 0.000036
## 16 Potato  concentration SK11    Wg2       10    10        20 0.023     0.0288  
## # i 1 more variable: p.adj.signif <chr>

5.9 Sum of Thr for all days - Strain and Protein- Concentration

5.9.1 DATA

# load data 
load("R_objects/stat_test_data_AA.RData")

# Create subset
dat.clean <- dat %>% filter(aa %in% c("Thr"), !is.na(concentration))

5.9.2 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Strain")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 10 x 6
##    Protein Strain  variable          n  mean    sd
##    <chr>   <chr>   <fct>         <dbl> <dbl> <dbl>
##  1 Casein  MS22333 concentration    10 6.58  1.86 
##  2 Casein  No      concentration     7 0.614 0.68 
##  3 Casein  SK11    concentration    10 0.586 0.239
##  4 Casein  Wg2     concentration    10 6.74  4.33 
##  5 Casein  pAK80   concentration    10 0.765 0.445
##  6 Potato  MS22333 concentration    10 5.05  2.59 
##  7 Potato  No      concentration     6 2.79  0.347
##  8 Potato  SK11    concentration    10 1.35  2.40 
##  9 Potato  Wg2     concentration    10 2.37  1.56 
## 10 Potato  pAK80   concentration    10 0.743 0.176
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

5.9.3 PRELIMINARY TEST

Identify outliers 

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 6 x 12
##   Protein Strain Sample_ID   Sample `First Injection` `Sample no` Day   conc 
##   <chr>   <chr>  <chr>       <chr>  <chr>             <chr>       <fct> <chr>
## 1 Casein  No     No-C3-D6    AC6-3  39Lise            C3          6     uM   
## 2 Casein  SK11   SK11_C1-D6  CC6-1  Lise21            C1          6     uM   
## 3 Casein  SK11   SK11_C2-D6  CC6-2  Lise22            C2          6     uM   
## 4 Casein  Wg2    Wg2_C4-D6   DC6-4  Lise16            C4          6     uM   
## 5 Casein  pAK80  pAK80_C3-D6 BC6-3  Lise7             C3          6     uM   
## 6 Potato  SK11   Sk11-P3-D2  CP2-3  85Lise            P3          2     uM   
## # i 4 more variables: aa <chr>, concentration <dbl>, is.outlier <lgl>,
## #   is.extreme <lgl>
#Test coef for outliers (No-C3-D6 Wg2_C4-D6 pAK80_C3-D6 Sk11-P3-D2)
dat.clean %>%
  filter(Strain == "No" & Protein == "Casein" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE  TRUE FALSE
dat.clean %>%
  filter(Strain == "Wg2" & Protein == "Casein" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE  TRUE
dat.clean %>%
  filter(Strain == "pAK80" & Protein == "Casein" & Day == 6) %>%
  pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE FALSE
dat.clean %>%
  filter(Strain == "SK11" & Protein == "Potato" & Day == 2) %>%
  pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE
#Remove extreme outliers with coef > 2.59 
extreme_outliers <- c("Wg2_C4-D6")

dat.clean <- subset(dat.clean, !Sample %in% extreme_outliers)
# Create plot without extreme outliers
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

Check normality

QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic  p.value
##   <chr>                <dbl>    <dbl>
## 1 residuals(model)     0.752 3.20e-11
# Compute Shapiro-Wilk test of normality for data groups
dat.clean %>% 
  group_by(Protein, Strain) %>% 
  shapiro_test(vars = "concentration") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 4 x 6
##   Protein Strain variable      statistic           p      p.adj
##   <chr>   <chr>  <chr>             <dbl>       <dbl>      <dbl>
## 1 Casein  No     concentration     0.538 0.0000483   0.000242  
## 2 Casein  Wg2    concentration     0.683 0.000543    0.00181   
## 3 Casein  pAK80  concentration     0.703 0.000967    0.00242   
## 4 Potato  SK11   concentration     0.442 0.000000770 0.00000770

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic       p
##   <int> <int>     <dbl>   <dbl>
## 1     9    83      3.10 0.00292
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

5.9.4 PERFORM TEST

The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Strain) %>%
  kruskal_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 2 x 8
##   Strain .y.               n statistic    df       p method           p.adj
##   <chr>  <chr>         <int>     <dbl> <int>   <dbl> <chr>            <dbl>
## 1 No     concentration    13       9       1 0.0027  Kruskal-Wallis 0.00675
## 2 Wg2    concentration    20      10.6     1 0.00115 Kruskal-Wallis 0.00575
pwc <- dat.clean %>% 
  group_by(Strain) %>%
  pairwise_wilcox_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 2 x 10
##   Strain .y.    group1 group2    n1    n2 statistic       p   p.adj p.adj.signif
##   <chr>  <chr>  <chr>  <chr>  <int> <int>     <dbl>   <dbl>   <dbl> <chr>       
## 1 No     conce~ Casein Potato     7     6         0 1   e-3 0.0025  **          
## 2 Wg2    conce~ Casein Potato    10    10        93 4.87e-4 0.00244 ***
# for each strain
res.strain <- dat.clean %>% 
  group_by(Protein) %>%
  kruskal_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 2 x 8
##   Protein .y.               n statistic    df           p method           p.adj
##   <chr>   <chr>         <int>     <dbl> <int>       <dbl> <chr>            <dbl>
## 1 Casein  concentration    47      35.6     4 0.000000353 Kruskal-Wallis 7.06e-7
## 2 Potato  concentration    46      27.6     4 0.0000148   Kruskal-Wallis 1.48e-5
pwc <- dat.clean %>%
  group_by(Protein) %>%
  pairwise_wilcox_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 14 x 10
##    Protein .y.           group1  group2    n1    n2 statistic         p    p.adj
##    <chr>   <chr>         <chr>   <chr>  <int> <int>     <dbl>     <dbl>    <dbl>
##  1 Casein  concentration MS22333 No        10     7        70 0.000103   2.94e-4
##  2 Casein  concentration MS22333 pAK80     10    10       100 0.0000108  4.32e-5
##  3 Casein  concentration MS22333 SK11      10    10       100 0.0000108  4.32e-5
##  4 Casein  concentration No      pAK80      7    10        14 0.043      6.14e-2
##  5 Casein  concentration No      Wg2        7    10         0 0.000103   2.94e-4
##  6 Casein  concentration pAK80   Wg2       10    10         0 0.0000108  4.32e-5
##  7 Casein  concentration SK11    Wg2       10    10         0 0.0000108  4.32e-5
##  8 Potato  concentration MS22333 pAK80     10    10       100 0.0000108  4.32e-5
##  9 Potato  concentration MS22333 SK11      10    10        90 0.002      4   e-3
## 10 Potato  concentration MS22333 Wg2       10    10        85 0.007      1.08e-2
## 11 Potato  concentration No      pAK80      6    10        60 0.00025    6.25e-4
## 12 Potato  concentration No      SK11       6    10        54 0.007      1.08e-2
## 13 Potato  concentration pAK80   Wg2       10    10         6 0.000325   7.22e-4
## 14 Potato  concentration SK11    Wg2       10    10        13 0.004      7.27e-3
## # i 1 more variable: p.adj.signif <chr>

5.10 Phe & Day6 - Strain and Protein- Concentration

5.10.1 DATA

# load data 
load("R_objects/stat_test_data_AA.RData")

# Create subset
dat.clean <- dat %>% filter(aa %in% c("Phe"), !is.na(concentration) & Day == 6)

5.10.2 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Strain")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 10 x 6
##    Protein Strain  variable          n   mean     sd
##    <chr>   <chr>   <fct>         <dbl>  <dbl>  <dbl>
##  1 Casein  MS22333 concentration     4 97.5    7.15 
##  2 Casein  No      concentration     4  3.28   0.162
##  3 Casein  SK11    concentration     4  0.907  0.519
##  4 Casein  Wg2     concentration     4 87.9   50.2  
##  5 Casein  pAK80   concentration     4  9.68  14.1  
##  6 Potato  MS22333 concentration     4 63.9    2.89 
##  7 Potato  No      concentration     3  4.64   0.113
##  8 Potato  SK11    concentration     4 15.8    2.57 
##  9 Potato  Wg2     concentration     4 61.4    3.42 
## 10 Potato  pAK80   concentration     4  5.67   1.66
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

5.10.3 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 4 x 12
##   Protein Strain  Sample_ID   Sample `First Injection` `Sample no` Day   conc 
##   <chr>   <chr>   <chr>       <chr>  <chr>             <chr>       <fct> <chr>
## 1 Casein  MS22333 33-C3-D6    EC6-3  31Lise            C3          6     uM   
## 2 Casein  No      No-C3-D6    AC6-3  39Lise            C3          6     uM   
## 3 Casein  Wg2     Wg2_C4-D6   DC6-4  Lise16            C4          6     uM   
## 4 Casein  pAK80   pAK80_C4-D6 BC6-4  Lise8             C4          6     uM   
## # i 4 more variables: aa <chr>, concentration <dbl>, is.outlier <lgl>,
## #   is.extreme <lgl>

Check normality

QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic       p.value
##   <chr>                <dbl>         <dbl>
## 1 residuals(model)     0.582 0.00000000239
# Compute Shapiro-Wilk test of normality for data groups
dat.clean %>% 
  group_by(Protein, Strain) %>% 
  shapiro_test(vars = "concentration") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 1 x 6
##   Protein Strain variable      statistic       p  p.adj
##   <chr>   <chr>  <chr>             <dbl>   <dbl>  <dbl>
## 1 Casein  Wg2    concentration     0.644 0.00206 0.0206

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     9    29     0.872 0.560
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

5.10.4 PERFORM TEST

The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Strain) %>%
  kruskal_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 3 x 8
##   Strain  .y.               n statistic    df      p method          p.adj
##   <chr>   <chr>         <int>     <dbl> <int>  <dbl> <chr>           <dbl>
## 1 MS22333 concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0522
## 2 No      concentration     7      4.5      1 0.0339 Kruskal-Wallis 0.0565
## 3 SK11    concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0522
pwc <- dat.clean %>% 
  group_by(Strain) %>%
  pairwise_wilcox_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 2 x 10
##   Strain  .y.      group1 group2    n1    n2 statistic     p  p.adj p.adj.signif
##   <chr>   <chr>    <chr>  <chr>  <int> <int>     <dbl> <dbl>  <dbl> <chr>       
## 1 MS22333 concent~ Casein Potato     4     4        16 0.029 0.0725 *           
## 2 SK11    concent~ Casein Potato     4     4         0 0.029 0.0725 *
# for each strain
res.strain <- dat.clean %>% 
  group_by(Protein) %>%
  kruskal_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 2 x 8
##   Protein .y.               n statistic    df       p method           p.adj
##   <chr>   <chr>         <int>     <dbl> <int>   <dbl> <chr>            <dbl>
## 1 Casein  concentration    20      16.4     4 0.00258 Kruskal-Wallis 0.00277
## 2 Potato  concentration    19      16.2     4 0.00277 Kruskal-Wallis 0.00277
pwc <- dat.clean %>%
  group_by(Protein) %>%
  pairwise_wilcox_test(concentration ~ Strain) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 12 x 10
##    Protein .y.     group1 group2    n1    n2 statistic     p  p.adj p.adj.signif
##    <chr>   <chr>   <chr>  <chr>  <int> <int>     <dbl> <dbl>  <dbl> <chr>       
##  1 Casein  concen~ MS223~ No         4     4        16 0.029 0.0483 ns          
##  2 Casein  concen~ MS223~ pAK80      4     4        16 0.029 0.0483 ns          
##  3 Casein  concen~ MS223~ SK11       4     4        16 0.029 0.0483 ns          
##  4 Casein  concen~ No     SK11       4     4        16 0.029 0.0483 ns          
##  5 Casein  concen~ No     Wg2        4     4         0 0.029 0.0483 ns          
##  6 Casein  concen~ pAK80  Wg2        4     4         0 0.029 0.0483 ns          
##  7 Casein  concen~ SK11   Wg2        4     4         0 0.029 0.0483 ns          
##  8 Potato  concen~ MS223~ pAK80      4     4        16 0.029 0.0483 ns          
##  9 Potato  concen~ MS223~ SK11       4     4        16 0.029 0.0483 ns          
## 10 Potato  concen~ pAK80  SK11       4     4         0 0.029 0.0483 ns          
## 11 Potato  concen~ pAK80  Wg2        4     4         0 0.029 0.0483 ns          
## 12 Potato  concen~ SK11   Wg2        4     4         0 0.029 0.0483 ns

6 DATA SET: Volatile organic compounds

6.1 DATA

LOAD

dat <- read_excel("./data/Data_collection.xlsx", sheet="VOC", na = "NA")

LOOK

# Take a glimpse
glimpse(dat)
## Rows: 42
## Columns: 19
## $ Sample                                            <chr> "Blanc", "pAK80-P1-D~
## $ Protease                                          <chr> NA, "pAK80", "pAK80"~
## $ Protein                                           <chr> NA, "Potato", "Potat~
## $ `First Injection`                                 <chr> "blanc", "PAK80 P1",~
## $ `MS Quantitation Peak`                            <chr> "MS Quantitation Pea~
## $ Acetaldehyde                                      <dbl> 1192.328, 7963.007, ~
## $ Acetone                                           <dbl> 139.560, 5813.107, 6~
## $ Ethanol                                           <dbl> 33.586, 48647.408, 4~
## $ `Pyrimidine-2,4(1H,3H)-dione, 5-amino-6-nitroso-` <dbl> 1535.197, 22074.072,~
## $ Hexanal                                           <dbl> 46.831, 392.587, 294~
## $ `Butane, 2-nitro-`                                <dbl> 0.000, 4532.694, 422~
## $ `1-Heptyn-6-one`                                  <dbl> 44.339, 3133.171, 28~
## $ `2-Heptanone, 4-methyl-`                          <dbl> 49.768, 5853.148, 60~
## $ `Benzene, 1,3-bis(1,1-dimethylethyl)-`            <dbl> 0.000, 7768.256, 797~
## $ `2-Decenal, (Z)-`                                 <dbl> 44.544, 3821.739, 37~
## $ Benzaldehyde                                      <dbl> 19.609, 49008.878, 4~
## $ `Benzaldehyde, 3,5-dimethyl-`                     <dbl> 41.008, 292.719, 264~
## $ Dodecanal                                         <dbl> 68.303, 55.985, 49.4~
## $ `Phenol, 3,5-bis(1,1-dimethylethyl)-`             <dbl> 15.394, 27498.312, 2~
# Make a explorative summary 
skimr::skim(dat)
Data summary
Name dat
Number of rows 42
Number of columns 19
_______________________
Column type frequency:
character 5
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample 0 1.00 5 11 0 41 0
Protease 2 0.95 2 5 0 5 0
Protein 2 0.95 6 6 0 2 0
First Injection 0 1.00 4 9 0 39 0
MS Quantitation Peak 0 1.00 20 20 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Acetaldehyde 0 1 5806.49 3793.70 875.40 1804.22 6193.44 9202.37 11517.26 <U+2587><U+2581><U+2582><U+2583><U+2585>
Acetone 0 1 2554.57 1994.45 82.69 940.60 1938.98 3980.78 7033.34 <U+2587><U+2583><U+2582><U+2582><U+2582>
Ethanol 0 1 44058.68 10287.32 33.59 43942.42 46138.78 48479.88 50492.31 <U+2581><U+2581><U+2581><U+2581><U+2587>
Pyrimidine-2,4(1H,3H)-dione, 5-amino-6-nitroso- 0 1 24576.67 10220.58 437.84 18565.65 25421.17 33446.90 36567.13 <U+2581><U+2582><U+2582><U+2583><U+2587>
Hexanal 0 1 564.65 1151.96 15.13 76.80 153.09 388.92 4369.41 <U+2587><U+2581><U+2581><U+2581><U+2581>
Butane, 2-nitro- 0 1 4023.55 947.46 0.00 4011.18 4154.18 4380.86 4973.10 <U+2581><U+2581><U+2581><U+2581><U+2587>
1-Heptyn-6-one 0 1 10583.28 4996.73 44.34 8080.94 11324.10 15299.75 16982.06 <U+2583><U+2582><U+2583><U+2586><U+2587>
2-Heptanone, 4-methyl- 0 1 5886.08 1505.63 49.77 5565.87 6036.96 6659.30 7868.71 <U+2581><U+2581><U+2581><U+2587><U+2585>
Benzene, 1,3-bis(1,1-dimethylethyl)- 0 1 7400.38 2070.80 0.00 6563.71 7782.64 9009.80 9940.67 <U+2581><U+2581><U+2581><U+2587><U+2587>
2-Decenal, (Z)- 0 1 3630.04 890.75 26.88 3505.29 3742.92 4046.71 4788.62 <U+2581><U+2581><U+2581><U+2587><U+2586>
Benzaldehyde 0 1 62949.62 102111.00 19.61 4096.76 22991.31 61164.68 380542.86 <U+2587><U+2581><U+2581><U+2581><U+2581>
Benzaldehyde, 3,5-dimethyl- 0 1 2941.93 2647.16 41.01 302.84 2168.34 5696.22 7707.58 <U+2587><U+2582><U+2581><U+2583><U+2583>
Dodecanal 0 1 1680.07 2568.86 24.72 56.97 116.02 2614.14 8272.26 <U+2587><U+2581><U+2581><U+2581><U+2581>
Phenol, 3,5-bis(1,1-dimethylethyl)- 0 1 16717.16 9217.09 15.39 9063.44 14473.79 24839.21 29888.71 <U+2583><U+2585><U+2583><U+2582><U+2587>

CLEAN

# Remove blanc samples
dat <- dat %>%
  filter(Sample != "Blanc")

# reorganisation of data
dat <- dat %>%
  pivot_longer(.,cols=6:19, names_to="compound", values_to = "concentration")

# Look at cleaned data
skimr::skim(dat)
Data summary
Name dat
Number of rows 560
Number of columns 7
_______________________
Column type frequency:
character 6
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample 0 1 8 11 0 40 0
Protease 0 1 2 5 0 5 0
Protein 0 1 6 6 0 2 0
First Injection 0 1 4 9 0 38 0
MS Quantitation Peak 0 1 20 20 0 1 0
compound 0 1 7 47 0 14 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
concentration 0 1 14492.58 33347.63 15.13 3162.34 5867.55 12984.02 380542.9 <U+2587><U+2581><U+2581><U+2581><U+2581>

SAVE

# Save cleaned data
save(dat, file = "R_objects/stat_test_data_VOC.RData")

# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())

6.2 TEST INDEPENDENCE

6.2.1 CATEGORICAL VARIABLES

Chi squared test of independence

# load data 
load("R_objects/stat_test_data_VOC.RData")

# This converts all character columns to factor (if only specific columns should be converted replace "where(is_character)" with "c(<col1>,<col2>,...)"
dat <- dat %>% mutate(across(where(is_character), as_factor))


# Create vector of categorical variables with more than two categories and less than half of the number of samples.
cat_var <- dat %>% 
  select_if(is.factor) %>% 
  select(where(~n_distinct(.) > 1)) %>% 
  select(where(~n_distinct(.) < (nrow(dat)/2))) %>%
  colnames()

length(cat_var) # If less than 2 variables the following code should not be run
## [1] 5
# Test the variables pairwise 
cat_var %>% 
  combn(2) %>% 
  t() %>% 
  as_tibble() %>% 
  rowwise %>% 
  mutate(chisq_test = list(
    table(dat[[V1]], dat[[V2]]) %>% chisq.test()
    ),
    chisq_pval = chisq_test$p.value
    )
## Warning: There were 5 warnings in `mutate()`.
## The first warning was:
## i In argument: `chisq_test = list(table(dat[[V1]], dat[[V2]]) %>%
##   chisq.test())`.
## i In row 1.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## i Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 4 remaining warnings.
## # A tibble: 10 x 4
## # Rowwise: 
##    V1              V2              chisq_test chisq_pval
##    <chr>           <chr>           <list>          <dbl>
##  1 Sample          Protease        <htest>      0       
##  2 Sample          Protein         <htest>      1.81e-93
##  3 Sample          First Injection <htest>      0       
##  4 Sample          compound        <htest>      1   e+ 0
##  5 Protease        Protein         <htest>      1   e+ 0
##  6 Protease        First Injection <htest>      0       
##  7 Protease        compound        <htest>      1   e+ 0
##  8 Protein         First Injection <htest>      2.74e-83
##  9 Protein         compound        <htest>      1   e+ 0
## 10 First Injection compound        <htest>      1   e+ 0

6.2.2 QUANTITATIVE VARIABLES

Spearman’s test without first testing the normality of the data

# create vector with the relevant variables 
CON.VARS <- dat %>% select(where(is.numeric)) %>% colnames()

CON.VARS
## [1] "concentration"

6.3 Acetaldehyde - Protease and Protein - Concentration

6.3.1 DATA

# load data 
load("R_objects/stat_test_data_VOC.RData")

# Create subset
dat.clean <- dat %>% filter(compound %in% c("Acetaldehyde"))

6.3.2 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Protease")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 10 x 6
##    Protease Protein variable          n   mean    sd
##    <chr>    <chr>   <fct>         <dbl>  <dbl> <dbl>
##  1 33       Casein  concentration     4  9982.  896.
##  2 No       Casein  concentration     4  1373.  337.
##  3 SK11     Casein  concentration     4  2421.  212.
##  4 Wg2      Casein  concentration     4  9696. 1182.
##  5 pAK80    Casein  concentration     4  2819. 1809.
##  6 33       Potato  concentration     4  6675. 1685.
##  7 No       Potato  concentration     4  1469.  628.
##  8 SK11     Potato  concentration     4 10803.  570.
##  9 Wg2      Potato  concentration     4  6684. 2403.
## 10 pAK80    Potato  concentration     4  8344. 1728.
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

6.3.3 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 3 x 9
##   Protease Protein Sample     `First Injection` `MS Quantitation Peak` compound 
##   <chr>    <chr>   <chr>      <chr>             <chr>                  <chr>    
## 1 No       Casein  No-C3-D6   NO C3             MS Quantitation Peak   Acetalde~
## 2 SK11     Casein  SK11-C2-D6 SK11 C2           MS Quantitation Peak   Acetalde~
## 3 No       Potato  No-P2-D6   NO P2             MS Quantitation Peak   Acetalde~
## # i 3 more variables: concentration <dbl>, is.outlier <lgl>, is.extreme <lgl>

Check normality

QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic p.value
##   <chr>                <dbl>   <dbl>
## 1 residuals(model)     0.969   0.342
# Compute Shapiro-Wilk test of normality for data groups
dat.clean %>% 
  group_by(Protein, Protease) %>% 
  shapiro_test(vars = "concentration") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 1 x 6
##   Protease Protein variable      statistic       p  p.adj
##   <chr>    <chr>   <chr>             <dbl>   <dbl>  <dbl>
## 1 No       Potato  concentration     0.652 0.00275 0.0275

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic      p
##   <int> <int>     <dbl>  <dbl>
## 1     9    30      1.86 0.0975
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

6.3.4 PERFORM TEST

The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Protease) %>%
  kruskal_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 4 x 8
##   Protease .y.               n statistic    df      p method          p.adj
##   <chr>    <chr>         <int>     <dbl> <int>  <dbl> <chr>           <dbl>
## 1 33       concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0348
## 2 SK11     concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0348
## 3 Wg2      concentration     8      4.08     1 0.0433 Kruskal-Wallis 0.0541
## 4 pAK80    concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0348
pwc <- dat.clean %>% 
  group_by(Protease) %>%
  pairwise_wilcox_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 3 x 10
##   Protease .y.     group1 group2    n1    n2 statistic     p  p.adj p.adj.signif
##   <chr>    <chr>   <chr>  <chr>  <int> <int>     <dbl> <dbl>  <dbl> <chr>       
## 1 33       concen~ Casein Potato     4     4        16 0.029 0.0483 *           
## 2 SK11     concen~ Casein Potato     4     4         0 0.029 0.0483 *           
## 3 pAK80    concen~ Casein Potato     4     4         0 0.029 0.0483 *
# for each strain
res.strain <- dat.clean %>% 
  group_by(Protein) %>%
  kruskal_test(concentration ~ Protease) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 2 x 8
##   Protein .y.               n statistic    df       p method           p.adj
##   <chr>   <chr>         <int>     <dbl> <int>   <dbl> <chr>            <dbl>
## 1 Casein  concentration    20      15.6     4 0.00358 Kruskal-Wallis 0.00407
## 2 Potato  concentration    20      15.3     4 0.00407 Kruskal-Wallis 0.00407
pwc <- dat.clean %>%
  group_by(Protein) %>%
  pairwise_wilcox_test(concentration ~ Protease) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p.adj < 0.05)
## # A tibble: 14 x 10
##    Protein .y.     group1 group2    n1    n2 statistic     p  p.adj p.adj.signif
##    <chr>   <chr>   <chr>  <chr>  <int> <int>     <dbl> <dbl>  <dbl> <chr>       
##  1 Casein  concen~ 33     No         4     4        16 0.029 0.0414 ns          
##  2 Casein  concen~ 33     pAK80      4     4        16 0.029 0.0414 ns          
##  3 Casein  concen~ 33     SK11       4     4        16 0.029 0.0414 ns          
##  4 Casein  concen~ No     SK11       4     4         0 0.029 0.0414 ns          
##  5 Casein  concen~ No     Wg2        4     4         0 0.029 0.0414 ns          
##  6 Casein  concen~ pAK80  Wg2        4     4         0 0.029 0.0414 ns          
##  7 Casein  concen~ SK11   Wg2        4     4         0 0.029 0.0414 ns          
##  8 Potato  concen~ 33     No         4     4        16 0.029 0.0414 ns          
##  9 Potato  concen~ 33     SK11       4     4         0 0.029 0.0414 ns          
## 10 Potato  concen~ No     pAK80      4     4         0 0.029 0.0414 ns          
## 11 Potato  concen~ No     SK11       4     4         0 0.029 0.0414 ns          
## 12 Potato  concen~ No     Wg2        4     4         0 0.029 0.0414 ns          
## 13 Potato  concen~ pAK80  SK11       4     4         0 0.029 0.0414 ns          
## 14 Potato  concen~ SK11   Wg2        4     4        16 0.029 0.0414 ns

6.3.5 VISUALIZE

#Name order in plot
dat.clean$Protease <- factor(dat.clean$Protease, 
                             levels = c("No", "pAK80", "SK11", "Wg2", "33"),
                             labels =c ("A: Non-inoculated", 
                                      "B: MS22418 (PrtP-)", 
                                      "C: MS22421 (PrtP/SK11)", 
                                      "D: MS22425 (PrtP/Wg2)", 
                                      "E: MS22427 (PrtP/MS22333)"))
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))

#Plot - boxplot
bxp.Acetaldehyde <- dat.clean %>%
  ggboxplot(.,
            x = "Protease",
            y = "concentration",
            color = "Protease",
            facet.by = "Protein")+
  labs(, x="Strain", y="Relative amount")+
  scale_y_continuous()+
  facet_wrap(~Protein, 
             nrow=1, 
             labeller = as_labeller(c("Casein"="CCDM", 
                                      "Potato"="PCDM"))
             )+
scale_color_manual(values = c("A: Non-inoculated"="#E89242FF", 
                                  "B: MS22418 (PrtP-)"="#82491EFF", 
                                  "C: MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "D: MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "E: MS22427 (PrtP/MS22333)"="#FB6467FF")
                           )+
  scale_x_discrete(labels=c("A", "B", "C", "D", "E")
                   )+
  theme_light()

bxp.Acetaldehyde

save(bxp.Acetaldehyde, file = "R_objects/Acetaldehyde.RData")

6.4 Benzaldehyde - Protease and Protein - Concentration

6.4.1 DATA

# load data 
load("R_objects/stat_test_data_VOC.RData")

# Create subset
dat.clean <- dat %>% filter(compound %in% c("Benzaldehyde"))

6.4.2 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Protease")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 10 x 6
##    Protease Protein variable          n    mean     sd
##    <chr>    <chr>   <fct>         <dbl>   <dbl>  <dbl>
##  1 33       Casein  concentration     4 352750. 29410.
##  2 No       Casein  concentration     4   1518.   143.
##  3 SK11     Casein  concentration     4   4684.  1498.
##  4 Wg2      Casein  concentration     4 120599.  9077.
##  5 pAK80    Casein  concentration     4  12228. 15317.
##  6 33       Potato  concentration     4  27207.  3723.
##  7 No       Potato  concentration     4   4013.  1030.
##  8 SK11     Potato  concentration     4  69328. 13373.
##  9 Wg2      Potato  concentration     4  21156.  1706.
## 10 pAK80    Potato  concentration     4  47467.  6500.
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

6.4.3 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 3 x 9
##   Protease Protein Sample      `First Injection` `MS Quantitation Peak` compound
##   <chr>    <chr>   <chr>       <chr>             <chr>                  <chr>   
## 1 pAK80    Casein  pAK80-C4-D6 4 PAK80 C         MS Quantitation Peak   Benzald~
## 2 No       Potato  No-P1-D6    NO P1             MS Quantitation Peak   Benzald~
## 3 SK11     Potato  SK11-P4-D6  4 SK11 P          MS Quantitation Peak   Benzald~
## # i 3 more variables: concentration <dbl>, is.outlier <lgl>, is.extreme <lgl>

Check normality

QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic   p.value
##   <chr>                <dbl>     <dbl>
## 1 residuals(model)     0.843 0.0000628
# Compute Shapiro-Wilk test of normality for data groups
dat.clean %>% 
  group_by(Protein, Protease) %>% 
  shapiro_test(vars = "concentration") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 2 x 6
##   Protease Protein variable      statistic       p  p.adj
##   <chr>    <chr>   <chr>             <dbl>   <dbl>  <dbl>
## 1 pAK80    Casein  concentration     0.649 0.00251 0.0251
## 2 No       Potato  concentration     0.683 0.00714 0.0357

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic       p
##   <int> <int>     <dbl>   <dbl>
## 1     9    30      3.41 0.00539
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

6.4.4 PERFORM TEST

The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Protease) %>%
  kruskal_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 5 x 8
##   Protease .y.               n statistic    df      p method          p.adj
##   <chr>    <chr>         <int>     <dbl> <int>  <dbl> <chr>           <dbl>
## 1 33       concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0209
## 2 No       concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0209
## 3 SK11     concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0209
## 4 Wg2      concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0209
## 5 pAK80    concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0209
pwc <- dat.clean %>% 
  group_by(Protease) %>%
  pairwise_wilcox_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 5 x 10
##   Protease .y.      group1 group2    n1    n2 statistic     p p.adj p.adj.signif
##   <chr>    <chr>    <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <chr>       
## 1 33       concent~ Casein Potato     4     4        16 0.029 0.029 *           
## 2 No       concent~ Casein Potato     4     4         0 0.029 0.029 *           
## 3 SK11     concent~ Casein Potato     4     4         0 0.029 0.029 *           
## 4 Wg2      concent~ Casein Potato     4     4        16 0.029 0.029 *           
## 5 pAK80    concent~ Casein Potato     4     4         0 0.029 0.029 *
# for each strain
res.strain <- dat.clean %>% 
  group_by(Protein) %>%
  kruskal_test(concentration ~ Protease) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 2 x 8
##   Protein .y.               n statistic    df       p method           p.adj
##   <chr>   <chr>         <int>     <dbl> <int>   <dbl> <chr>            <dbl>
## 1 Casein  concentration    20      17.5     4 0.00154 Kruskal-Wallis 0.00154
## 2 Potato  concentration    20      18.3     4 0.00109 Kruskal-Wallis 0.00154
pwc <- dat.clean %>%
  group_by(Protein) %>%
  pairwise_wilcox_test(concentration ~ Protease) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 19 x 10
##    Protein .y.     group1 group2    n1    n2 statistic     p  p.adj p.adj.signif
##    <chr>   <chr>   <chr>  <chr>  <int> <int>     <dbl> <dbl>  <dbl> <chr>       
##  1 Casein  concen~ 33     No         4     4        16 0.029 0.0305 ns          
##  2 Casein  concen~ 33     pAK80      4     4        16 0.029 0.0305 ns          
##  3 Casein  concen~ 33     SK11       4     4        16 0.029 0.0305 ns          
##  4 Casein  concen~ 33     Wg2        4     4        16 0.029 0.0305 ns          
##  5 Casein  concen~ No     pAK80      4     4         0 0.029 0.0305 ns          
##  6 Casein  concen~ No     SK11       4     4         0 0.029 0.0305 ns          
##  7 Casein  concen~ No     Wg2        4     4         0 0.029 0.0305 ns          
##  8 Casein  concen~ pAK80  Wg2        4     4         0 0.029 0.0305 ns          
##  9 Casein  concen~ SK11   Wg2        4     4         0 0.029 0.0305 ns          
## 10 Potato  concen~ 33     No         4     4        16 0.029 0.0305 ns          
## 11 Potato  concen~ 33     pAK80      4     4         0 0.029 0.0305 ns          
## 12 Potato  concen~ 33     SK11       4     4         0 0.029 0.0305 ns          
## 13 Potato  concen~ 33     Wg2        4     4        16 0.029 0.0305 ns          
## 14 Potato  concen~ No     pAK80      4     4         0 0.029 0.0305 ns          
## 15 Potato  concen~ No     SK11       4     4         0 0.029 0.0305 ns          
## 16 Potato  concen~ No     Wg2        4     4         0 0.029 0.0305 ns          
## 17 Potato  concen~ pAK80  SK11       4     4         0 0.029 0.0305 ns          
## 18 Potato  concen~ pAK80  Wg2        4     4        16 0.029 0.0305 ns          
## 19 Potato  concen~ SK11   Wg2        4     4        16 0.029 0.0305 ns

6.4.5 VISUALIZE

#Name order in plot
dat.clean$Protease <- factor(dat.clean$Protease, 
                             levels = c("No", "pAK80", "SK11", "Wg2", "33"),
                             labels =c ("A: Non-inoculated", 
                                      "B: MS22418 (PrtP-)", 
                                      "C: MS22421 (PrtP/SK11)", 
                                      "D: MS22425 (PrtP/Wg2)", 
                                      "E: MS22427 (PrtP/MS22333)"))
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))

#Plot - boxplot
bxp.Benzaldehyde <- dat.clean %>%
  ggboxplot(.,
            x = "Protease",
            y = "concentration",
            color = "Protease",
            facet.by = "Protein")+
  labs(x="Strain", y="Relative amount")+
  scale_y_continuous()+
  facet_wrap(~Protein, 
             nrow=1, 
             labeller = as_labeller(c("Casein"="CCDM", 
                                      "Potato"="PCDM"))
             )+
scale_color_manual(values = c("A: Non-inoculated"="#E89242FF", 
                                  "B: MS22418 (PrtP-)"="#82491EFF", 
                                  "C: MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "D: MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "E: MS22427 (PrtP/MS22333)"="#FB6467FF")
                           )+
  scale_x_discrete(labels=c("A", "B", "C", "D", "E")
                   )+
  theme_light()

bxp.Benzaldehyde

save(bxp.Benzaldehyde, file = "R_objects/Benzaldehyde.RData")

6.5 Hexanal - Protease and Protein - Concentration

6.5.1 DATA

# load data 
load("R_objects/stat_test_data_VOC.RData")

# Create subset
dat.clean <- dat %>% filter(compound %in% c("Hexanal"))

6.5.2 PREPARE VARIABLES

# Set names of variables
PREDICTOR <- c("Protein","Protease")
OUTCOME <- "concentration"
SUBJECT <- NULL

# Create formula
PREDICTOR.F <- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
FORMULA <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))

# Summary samples in groups
dat.clean %>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd")
## # A tibble: 10 x 6
##    Protease Protein variable          n   mean    sd
##    <chr>    <chr>   <fct>         <dbl>  <dbl> <dbl>
##  1 33       Casein  concentration     4  157.  158. 
##  2 No       Casein  concentration     4  145.  120. 
##  3 SK11     Casein  concentration     4   84.8  55.0
##  4 Wg2      Casein  concentration     4  106.   67.1
##  5 pAK80    Casein  concentration     4   53.7  37.5
##  6 33       Potato  concentration     4  451.  233. 
##  7 No       Potato  concentration     4 4015.  462. 
##  8 SK11     Potato  concentration     4  175.   58.9
##  9 Wg2      Potato  concentration     4  327.  171. 
## 10 pAK80    Potato  concentration     4  394.   89.9
# Create plot
bxp <- dat.clean %>%
  ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
            y = OUTCOME,
            color = PREDICTOR[1],
            facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
            palette = "jco")
bxp

6.5.3 PRELIMINARY TEST

Identify outliers

# Test for outliers
dat.clean %>% 
  group_by(across(all_of(PREDICTOR))) %>% 
  identify_outliers(!!sym(OUTCOME))
## # A tibble: 3 x 9
##   Protease Protein Sample    `First Injection` `MS Quantitation Peak` compound
##   <chr>    <chr>   <chr>     <chr>             <chr>                  <chr>   
## 1 No       Casein  No-C2-D6  NO C2             MS Quantitation Peak   Hexanal 
## 2 Wg2      Casein  Wg2-C4-D6 4 W2G C           MS Quantitation Peak   Hexanal 
## 3 33       Potato  33-P2-D6  33P2              MS Quantitation Peak   Hexanal 
## # i 3 more variables: concentration <dbl>, is.outlier <lgl>, is.extreme <lgl>

Check normality

QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.

# Build the linear model
model  <- lm(FORMULA, data = dat.clean)

# Create a QQ plot of residuals
ggqqplot(residuals(model))

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
##   variable         statistic p.value
##   <chr>                <dbl>   <dbl>
## 1 residuals(model)     0.897 0.00161
# Compute Shapiro-Wilk test of normality for data groups
dat.clean %>% 
  group_by(Protein, Protease, compound) %>% 
  shapiro_test(vars = "concentration") %>% 
  adjust_pvalue(method = "fdr") %>%
  filter(p.adj < 0.05)
## # A tibble: 0 x 7
## # i 7 variables: Protease <chr>, Protein <chr>, compound <chr>, variable <chr>,
## #   statistic <dbl>, p <dbl>, p.adj <dbl>

Test homogneity of variance assumption

plot(model, 1)

dat.clean %>% levene_test(FORMULA)
## # A tibble: 1 x 4
##     df1   df2 statistic      p
##   <int> <int>     <dbl>  <dbl>
## 1     9    30      2.83 0.0154
# Save result
EQUAL.VAR <- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05

6.5.4 PERFORM TEST

The data are not normal distributed, but normality is found within each group by the Shapiro-Wilk normality test. However, the same groups of the VOC data were not normal distributed. Therefore, the analysis of this data is performed as a Kruskal-Wallis test following a Wilcoxon pairwise test.

# for each protein source
res.protein <- dat.clean %>% 
  group_by(Protease) %>%
  kruskal_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
res.protein %>% filter(p < 0.05)
## # A tibble: 3 x 8
##   Protease .y.               n statistic    df      p method          p.adj
##   <chr>    <chr>         <int>     <dbl> <int>  <dbl> <chr>           <dbl>
## 1 No       concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0522
## 2 Wg2      concentration     8      4.08     1 0.0433 Kruskal-Wallis 0.0722
## 3 pAK80    concentration     8      5.33     1 0.0209 Kruskal-Wallis 0.0522
pwc <- dat.clean %>% 
  group_by(Protease) %>%
  pairwise_wilcox_test(concentration ~ Protein) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 2 x 10
##   Protease .y.     group1 group2    n1    n2 statistic     p  p.adj p.adj.signif
##   <chr>    <chr>   <chr>  <chr>  <int> <int>     <dbl> <dbl>  <dbl> <chr>       
## 1 No       concen~ Casein Potato     4     4         0 0.029 0.0725 *           
## 2 pAK80    concen~ Casein Potato     4     4         0 0.029 0.0725 *
# for each strain
res.strain <- dat.clean %>% 
  group_by(Protein) %>%
  kruskal_test(concentration ~ Protease) %>%
  adjust_pvalue(method = "fdr")
res.strain %>% filter(p < 0.05)
## # A tibble: 1 x 8
##   Protein .y.               n statistic    df      p method          p.adj
##   <chr>   <chr>         <int>     <dbl> <int>  <dbl> <chr>           <dbl>
## 1 Potato  concentration    20      12.6     4 0.0133 Kruskal-Wallis 0.0266
pwc <- dat.clean %>%
  filter(Protein == "Potato") %>%
  group_by(Protein) %>%
  pairwise_wilcox_test(concentration ~ Protease) %>%
  adjust_pvalue(method = "fdr")
pwc %>% filter(p < 0.05)
## # A tibble: 5 x 10
##   Protein .y.       group1 group2    n1    n2 statistic     p p.adj p.adj.signif
##   <chr>   <chr>     <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <chr>       
## 1 Potato  concentr~ 33     No         4     4         0 0.029 0.058 ns          
## 2 Potato  concentr~ No     pAK80      4     4        16 0.029 0.058 ns          
## 3 Potato  concentr~ No     SK11       4     4        16 0.029 0.058 ns          
## 4 Potato  concentr~ No     Wg2        4     4        16 0.029 0.058 ns          
## 5 Potato  concentr~ pAK80  SK11       4     4        16 0.029 0.058 ns

6.5.5 VISUALIZE

#Name order in plot
dat.clean$Protease <- factor(dat.clean$Protease, 
                             levels = c("No", "pAK80", "SK11", "Wg2", "33"),
                             labels =c ("A: Non-inoculated", 
                                      "B: MS22418 (PrtP-)", 
                                      "C: MS22421 (PrtP/SK11)", 
                                      "D: MS22425 (PrtP/Wg2)", 
                                      "E: MS22427 (PrtP/MS22333)"))
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))

#Plot - boxplot
bxp.Hexanal <- dat.clean %>%
  ggboxplot(.,
            x = "Protease",
            y = "concentration",
            color = "Protease",
            facet.by = "Protein")+
  labs(x="Strain", y="Relative amount")+
  scale_y_continuous()+
  facet_wrap(~Protein, 
             nrow=1, 
             labeller = as_labeller(c("Casein"="CCDM", 
                                      "Potato"="PCDM"))
             )+
scale_color_manual(values = c("A: Non-inoculated"="#E89242FF", 
                                  "B: MS22418 (PrtP-)"="#82491EFF", 
                                  "C: MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "D: MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "E: MS22427 (PrtP/MS22333)"="#FB6467FF")
                           )+
  scale_x_discrete(labels=c("A", "B", "C", "D", "E")
                   )+
  theme_light()

bxp.Hexanal

save(bxp.Hexanal, file = "R_objects/Hexanal.RData")

6.6 VISUALIZE: All VOC plots

load(file = "R_objects/Acetaldehyde.RData")
load(file = "R_objects/Benzaldehyde.RData")
load(file = "R_objects/Hexanal.RData")



fig5 <- ggarrange(bxp.Acetaldehyde, bxp.Benzaldehyde, bxp.Hexanal, heights = c(1,1,1),
          ncol = 1,
          labels = c("A","B","C"),
          align = "v",
          legend = "right", common.legend = TRUE
          )

#Save Figure 5
ggsave(filename = "plots/fig5_VOCs.svg", plot = fig5, width = 16.5, height = 20, units =c("cm"), dpi = 600)

Figure 5

7 DATA SET: Milk acidification

7.1 DATA

7.1.1 LOAD

dat1 <- read_excel("./data/Data_collection.xlsx", sheet="Supl_fig_rep20191013", na = "NA")
dat2 <- read_excel("./data/Data_collection.xlsx", sheet="Supl_fig_rep20191031", na = "NA")

7.1.2 LOOK

# Take a glimpse
glimpse(dat1)
## Rows: 602
## Columns: 7
## $ Time           <dbl> 0.00000000, 0.03333333, 0.06666667, 0.10000000, 0.13333~
## $ pAK80          <dbl> 6.701807, 6.691638, 6.690138, 6.692273, 6.693289, 6.694~
## $ `PrtP/Wg2`     <dbl> 6.674124, 6.674505, 6.674023, 6.673349, 6.672475, 6.671~
## $ `PrtP/SK11`    <dbl> 6.663352, 6.664178, 6.663682, 6.662663, 6.661853, 6.660~
## $ `PrtP/MS22333` <dbl> 6.678489, 6.679384, 6.678436, 6.677651, 6.676611, 6.674~
## $ Milk           <dbl> 6.701210, 6.700569, 6.699893, 6.699427, 6.698843, 6.698~
## $ Milk_with_ery  <dbl> 6.710694, 6.714046, 6.716038, 6.717559, 6.718817, 6.719~
glimpse(dat2)
## Rows: 1,353
## Columns: 7
## $ Time           <dbl> 0.00000000, 0.01666667, 0.03333333, 0.05000000, 0.06666~
## $ pAK80          <dbl> 6.611264, 6.614421, 6.616817, 6.619895, 6.623252, 6.626~
## $ `PrtP/Wg2`     <dbl> 6.679517, 6.686099, 6.688319, 6.690095, 6.690137, 6.689~
## $ `PrtP/SK11`    <dbl> 6.699851, 6.699850, 6.699850, 6.699849, 6.699849, 6.699~
## $ `PrtP/MS22333` <dbl> 6.667879, 6.665835, 6.664838, 6.663531, 6.662640, 6.661~
## $ Milk           <dbl> 6.652904, 6.659897, 6.663321, 6.666914, 6.668894, 6.670~
## $ Milk_with_ery  <dbl> 6.696377, 6.695598, 6.694867, 6.694411, 6.693895, 6.693~
# Make a explorative summary 
skimr::skim(dat1)
Data summary
Name dat1
Number of rows 602
Number of columns 7
_______________________
Column type frequency:
numeric 7
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Time 0 1 10.02 5.80 0.00 5.01 10.02 15.03 20.03 <U+2587><U+2587><U+2587><U+2587><U+2587>
pAK80 0 1 6.34 0.22 5.99 6.15 6.32 6.54 6.70 <U+2587><U+2587><U+2586><U+2586><U+2587>
PrtP/Wg2 0 1 5.06 0.97 4.15 4.23 4.48 6.11 6.67 <U+2587><U+2581><U+2581><U+2581><U+2583>
PrtP/SK11 0 1 5.08 0.98 4.12 4.22 4.51 6.17 6.66 <U+2587><U+2581><U+2581><U+2581><U+2583>
PrtP/MS22333 0 1 5.00 1.02 4.02 4.10 4.48 6.11 6.68 <U+2587><U+2582><U+2581><U+2581><U+2583>
Milk 0 1 6.49 0.23 5.95 6.35 6.57 6.69 6.70 <U+2582><U+2581><U+2581><U+2582><U+2587>
Milk_with_ery 0 1 6.72 0.00 6.71 6.72 6.72 6.72 6.72 <U+2581><U+2587><U+2583><U+2585><U+2587>
skimr::skim(dat2)
Data summary
Name dat2
Number of rows 1353
Number of columns 7
_______________________
Column type frequency:
numeric 7
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Time 0 1 11.27 6.51 0.00 5.63 11.27 16.90 22.54 <U+2587><U+2587><U+2587><U+2587><U+2587>
pAK80 0 1 6.19 0.35 5.61 5.84 6.25 6.52 6.65 <U+2586><U+2582><U+2583><U+2585><U+2587>
PrtP/Wg2 0 1 4.90 0.94 4.12 4.19 4.34 5.65 6.69 <U+2587><U+2581><U+2581><U+2581><U+2582>
PrtP/SK11 0 1 4.90 0.96 4.09 4.16 4.33 5.73 6.70 <U+2587><U+2581><U+2581><U+2581><U+2582>
PrtP/MS22333 0 1 4.76 0.97 3.99 4.03 4.19 5.41 6.67 <U+2587><U+2581><U+2581><U+2581><U+2582>
Milk 0 1 5.77 0.65 4.77 5.18 5.73 6.39 6.68 <U+2586><U+2585><U+2585><U+2583><U+2587>
Milk_with_ery 0 1 6.40 0.36 5.77 5.95 6.65 6.69 6.70 <U+2583><U+2581><U+2581><U+2581><U+2587>

7.1.3 CLEAN DATA

#Reorganize data

dat1 <- dat1 %>%
  pivot_longer(.,cols=2:7, names_to="Sample", values_to = "Rep1")

dat2 <- dat2 %>%
  pivot_longer(.,cols=2:7, names_to="Sample", values_to = "Rep2")

dat <- full_join(dat1, dat2, by=c("Time", "Sample"))

dat=pivot_longer(dat, cols=3:4, names_to = "Replicate", values_to = "pH")

dat <- dat %>%
  filter(., !is.na(pH))

dat.clean <- filter(dat, Sample %in% c("pAK80", "PrtP/Wg2", "PrtP/SK11", "PrtP/MS22333", "Milk_with_ery"))

# Look at cleaned data
skimr::skim(dat.clean)
Data summary
Name dat.clean
Number of rows 9775
Number of columns 4
_______________________
Column type frequency:
character 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sample 0 1 5 13 0 5 0
Replicate 0 1 4 4 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Time 0 1 10.88 6.32 0.00 5.42 10.85 16.28 22.54 <U+2587><U+2587><U+2587><U+2587><U+2586>
pH 0 1 5.49 1.06 3.99 4.25 5.84 6.57 6.72 <U+2587><U+2581><U+2581><U+2583><U+2587>

7.2 VISUALIZE

dat.clean$Sample <- factor(dat.clean$Sample, 
                     levels = c("Milk_with_ery", "pAK80", "PrtP/SK11", "PrtP/Wg2", "PrtP/MS22333"),
                     labels =c ("Non-inoculated", 
                                      "MS22418 (PrtP-)", 
                                      "MS22421 (PrtP/SK11)", 
                                      "MS22425 (PrtP/Wg2)", 
                                      "MS22427 (PrtP/MS22333)"))

p.out <- dat.clean %>%
  ggplot(., aes(x=Time, y=pH, color=Sample, linetype=Replicate))+
  geom_line(size=1)+
   ylim(3.5,7)+
   theme_light()+
  scale_color_manual(values = c("Non-inoculated"="#E89242FF", 
                                  "MS22418 (PrtP-)"="#82491EFF", 
                                  "MS22421 (PrtP/SK11)"="#24325FFF", 
                                  "MS22425 (PrtP/Wg2)"="#B7E4F9FF", 
                                  "MS22427 (PrtP/MS22333)"="#FB6467FF")
                     )+
  labs(color="Strain")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
# SAVE plot

ggsave(filename = "plots/Milk_acidification.svg", plot = p.out, width = 16.5, height = 9, units =c("cm"), dpi = 600)

Figure S2

8 INFO

8.1 FINAL COMMENT

This completes the fundamental statistical test of the sample variables.

8.2 REFERENCES

These are the references of R and the applied R packages.

citation()
## 
## To cite R in publications use:
## 
##   R Core Team (2021). R: A language and environment for statistical
##   computing. R Foundation for Statistical Computing, Vienna, Austria.
##   URL https://www.R-project.org/.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {R: A Language and Environment for Statistical Computing},
##     author = {{R Core Team}},
##     organization = {R Foundation for Statistical Computing},
##     address = {Vienna, Austria},
##     year = {2021},
##     url = {https://www.R-project.org/},
##   }
## 
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.
citation("tidyverse")
## 
## Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E,
## Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi
## K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the
## tidyverse." _Journal of Open Source Software_, *4*(43), 1686. doi:
## 10.21105/joss.01686 (URL: https://doi.org/10.21105/joss.01686).
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Welcome to the {tidyverse}},
##     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
##     year = {2019},
##     journal = {Journal of Open Source Software},
##     volume = {4},
##     number = {43},
##     pages = {1686},
##     doi = {10.21105/joss.01686},
##   }
citation("rstatix")
## 
## To cite package 'rstatix' in publications use:
## 
##   Alboukadel Kassambara (2023). rstatix: Pipe-Friendly Framework for
##   Basic Statistical Tests. R package version 0.7.2.
##   https://CRAN.R-project.org/package=rstatix
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {rstatix: Pipe-Friendly Framework for Basic Statistical Tests},
##     author = {Alboukadel Kassambara},
##     year = {2023},
##     note = {R package version 0.7.2},
##     url = {https://CRAN.R-project.org/package=rstatix},
##   }
citation("pheatmap")
## 
## To cite package 'pheatmap' in publications use:
## 
##   Raivo Kolde (2019). pheatmap: Pretty Heatmaps. R package version
##   1.0.12. https://CRAN.R-project.org/package=pheatmap
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {pheatmap: Pretty Heatmaps},
##     author = {Raivo Kolde},
##     year = {2019},
##     note = {R package version 1.0.12},
##     url = {https://CRAN.R-project.org/package=pheatmap},
##   }
## 
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
citation("vegan")
## 
## To cite package 'vegan' in publications use:
## 
##   Jari Oksanen, Gavin L. Simpson, F. Guillaume Blanchet, Roeland Kindt,
##   Pierre Legendre, Peter R. Minchin, R.B. O'Hara, Peter Solymos, M.
##   Henry H. Stevens, Eduard Szoecs, Helene Wagner, Matt Barbour, Michael
##   Bedward, Ben Bolker, Daniel Borcard, Gustavo Carvalho, Michael
##   Chirico, Miquel De Caceres, Sebastien Durand, Heloisa Beatriz
##   Antoniazi Evangelista, Rich FitzJohn, Michael Friendly, Brendan
##   Furneaux, Geoffrey Hannigan, Mark O. Hill, Leo Lahti, Dan McGlinn,
##   Marie-Helene Ouellette, Eduardo Ribeiro Cunha, Tyler Smith, Adrian
##   Stier, Cajo J.F. Ter Braak and James Weedon (2022). vegan: Community
##   Ecology Package. R package version 2.6-4.
##   https://CRAN.R-project.org/package=vegan
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {vegan: Community Ecology Package},
##     author = {Jari Oksanen and Gavin L. Simpson and F. Guillaume Blanchet and Roeland Kindt and Pierre Legendre and Peter R. Minchin and R.B. O'Hara and Peter Solymos and M. Henry H. Stevens and Eduard Szoecs and Helene Wagner and Matt Barbour and Michael Bedward and Ben Bolker and Daniel Borcard and Gustavo Carvalho and Michael Chirico and Miquel {De Caceres} and Sebastien Durand and Heloisa Beatriz Antoniazi Evangelista and Rich FitzJohn and Michael Friendly and Brendan Furneaux and Geoffrey Hannigan and Mark O. Hill and Leo Lahti and Dan McGlinn and Marie-Helene Ouellette and Eduardo {Ribeiro Cunha} and Tyler Smith and Adrian Stier and Cajo J.F. {Ter Braak} and James Weedon},
##     year = {2022},
##     note = {R package version 2.6-4},
##     url = {https://CRAN.R-project.org/package=vegan},
##   }
citation("kableExtra")
## 
## To cite package 'kableExtra' in publications use:
## 
##   Hao Zhu (2021). kableExtra: Construct Complex Table with 'kable' and
##   Pipe Syntax. R package version 1.3.4.
##   https://CRAN.R-project.org/package=kableExtra
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
##     author = {Hao Zhu},
##     year = {2021},
##     note = {R package version 1.3.4},
##     url = {https://CRAN.R-project.org/package=kableExtra},
##   }
citation("readxl")
## 
## To cite package 'readxl' in publications use:
## 
##   Hadley Wickham and Jennifer Bryan (2022). readxl: Read Excel Files. R
##   package version 1.4.1. https://CRAN.R-project.org/package=readxl
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {readxl: Read Excel Files},
##     author = {Hadley Wickham and Jennifer Bryan},
##     year = {2022},
##     note = {R package version 1.4.1},
##     url = {https://CRAN.R-project.org/package=readxl},
##   }
citation("ggpubr")
## 
## To cite package 'ggpubr' in publications use:
## 
##   Alboukadel Kassambara (2023). ggpubr: 'ggplot2' Based Publication
##   Ready Plots. R package version 0.6.0.
##   https://CRAN.R-project.org/package=ggpubr
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {ggpubr: 'ggplot2' Based Publication Ready Plots},
##     author = {Alboukadel Kassambara},
##     year = {2023},
##     note = {R package version 0.6.0},
##     url = {https://CRAN.R-project.org/package=ggpubr},
##   }
citation("ggsci")
## 
## To cite package 'ggsci' in publications use:
## 
##   Nan Xiao (2018). ggsci: Scientific Journal and Sci-Fi Themed Color
##   Palettes for 'ggplot2'. R package version 2.9.
##   https://CRAN.R-project.org/package=ggsci
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {ggsci: Scientific Journal and Sci-Fi Themed Color Palettes for
## 'ggplot2'},
##     author = {Nan Xiao},
##     year = {2018},
##     note = {R package version 2.9},
##     url = {https://CRAN.R-project.org/package=ggsci},
##   }
citation("scales")
## 
## To cite package 'scales' in publications use:
## 
##   Hadley Wickham and Dana Seidel (2022). scales: Scale Functions for
##   Visualization. R package version 1.2.1.
##   https://CRAN.R-project.org/package=scales
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {scales: Scale Functions for Visualization},
##     author = {Hadley Wickham and Dana Seidel},
##     year = {2022},
##     note = {R package version 1.2.1},
##     url = {https://CRAN.R-project.org/package=scales},
##   }
citation("ggrepel")
## 
## To cite package 'ggrepel' in publications use:
## 
##   Kamil Slowikowski (2023). ggrepel: Automatically Position
##   Non-Overlapping Text Labels with 'ggplot2'. R package version 0.9.3.
##   https://CRAN.R-project.org/package=ggrepel
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {ggrepel: Automatically Position Non-Overlapping Text Labels with
## 'ggplot2'},
##     author = {Kamil Slowikowski},
##     year = {2023},
##     note = {R package version 0.9.3},
##     url = {https://CRAN.R-project.org/package=ggrepel},
##   }

8.3 SESSION INFO

The analysis was run in the following environment:

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252   
## [3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C                   
## [5] LC_TIME=Danish_Denmark.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggrepel_0.9.3    pheatmap_1.0.12  scales_1.2.1     ggsci_2.9       
##  [5] ggpubr_0.6.0     kableExtra_1.3.4 vegan_2.6-4      lattice_0.20-44 
##  [9] permute_0.9-7    rstatix_0.7.2    forcats_1.0.0    stringr_1.5.0   
## [13] dplyr_1.1.0      purrr_1.0.1      readr_2.1.3      tidyr_1.3.0     
## [17] tibble_3.1.8     ggplot2_3.4.0    tidyverse_1.3.2  readxl_1.4.1    
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-152        fs_1.6.0            lubridate_1.9.1    
##  [4] webshot_0.5.5       RColorBrewer_1.1-3  httr_1.4.7         
##  [7] repr_1.1.6          tools_4.1.1         backports_1.4.1    
## [10] bslib_0.5.1         utf8_1.2.2          R6_2.5.1           
## [13] DBI_1.1.3           mgcv_1.8-36         colorspace_2.1-0   
## [16] withr_2.5.0         gridExtra_2.3       tidyselect_1.2.0   
## [19] compiler_4.1.1      textshaping_0.3.6   cli_3.6.0          
## [22] rvest_1.0.3         xml2_1.3.3          labeling_0.4.2     
## [25] bookdown_0.32       sass_0.4.7          systemfonts_1.0.4  
## [28] digest_0.6.31       rmarkdown_2.24      svglite_2.1.1      
## [31] base64enc_0.1-3     pkgconfig_2.0.3     htmltools_0.5.4    
## [34] highr_0.10          dbplyr_2.3.3        fastmap_1.1.0      
## [37] rlang_1.0.6         rstudioapi_0.15.0   farver_2.1.1       
## [40] jquerylib_0.1.4     generics_0.1.3      jsonlite_1.8.4     
## [43] car_3.1-2           googlesheets4_1.0.1 magrittr_2.0.3     
## [46] Matrix_1.5-3        Rcpp_1.0.10         munsell_0.5.0      
## [49] fansi_1.0.4         abind_1.4-5         lifecycle_1.0.3    
## [52] stringi_1.7.12      yaml_2.3.7          carData_3.0-5      
## [55] MASS_7.3-54         grid_4.1.1          parallel_4.1.1     
## [58] crayon_1.5.2        cowplot_1.1.1       haven_2.5.1        
## [61] splines_4.1.1       hms_1.1.3           knitr_1.42         
## [64] pillar_1.9.0        ggsignif_0.6.4      reprex_2.0.2       
## [67] glue_1.6.2          evaluate_0.21       modelr_0.1.11      
## [70] vctrs_0.5.2         rmdformats_1.0.4    tzdb_0.3.0         
## [73] cellranger_1.1.0    gtable_0.3.1        cachem_1.0.6       
## [76] xfun_0.36           skimr_2.1.5         broom_1.0.5        
## [79] ragg_1.2.5          googledrive_2.0.0   viridisLite_0.4.2  
## [82] gargle_1.2.1        cluster_2.1.2       timechange_0.2.0