Supporting Information: Statistical analyses and plots
1 INFO
1.1 BACKGROUND
This supporting file containes the codes for the statistical analyses and plots in our paper, Lactococcus cell envelope proteases enable lactococcal growth in minimal growth media supplemented with high molecular weight proteins of plant and animal origin.
1.2 Figures
The codes for the figures and supporting figures referred to in the paper can be found in this supporting file by searching for Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure S2, Figure S4, Figure S5A, and Figure S5B.
2 SETUP
2.1 PACKAGES
R packages applied for:
data loading: readxl
main analyses: tidyverse, rstatix, vegan, and kableExtra
visualization: ggpubr, ggsci, scales, pheatmap, and ggrepel
2.2 DATA
The excel file Data_collection is included as supporting file of the paper and includes all the data for the analyses. Different data sets are divided into separate excel sheets.
2.3 VARIABLES
Some variables in the excel file Data_collection were named differently in the paper, as explained further below.
Strain is also known as Protease in the data file. Otherwise, the names of the strains are as follows:
Non-inoculated = “No”
MS22418 (PrtP-) = “pAK80”
MS22421 (PrtP/SK11) = “SK11”
MS22425 (PrtP/Wg2) = “Wg2”
MS22427 (PrtP/MS22333) = “MS22333” or “33”
Medium is also referred to as Protein in the data file. Otherwise, the names of the media are as follows:
CCDM = “Casein”
PCDM = “Potato”
3 DATA SET: pH and cell counts
3.1 DATA
3.1.1 LOAD
<- read_excel("./data/Data_collection.xlsx", sheet="pH_and_CFU", na = "NA") dat
3.1.2 LOOK
# Take a glimpse
glimpse(dat)
## Rows: 154
## Columns: 7
## $ Sample <chr> "pAK80-D0", "Wg2-D0", "SK11-D0", "33-D0", "pAK80-P1-D1", "pAK8~
## $ Strain <chr> "pAK80", "Wg2", "SK11", "33", "pAK80", "pAK80", "pAK80", "Wg2"~
## $ Protein <chr> "Inoculation", "Inoculation", "Inoculation", "Inoculation", "P~
## $ Day <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,~
## $ pH <dbl> 7.20, 7.20, 7.20, 7.20, 6.96, 6.96, 6.96, 6.84, 6.85, 6.84, 6.~
## $ CFU <dbl> 6.20e+06, 6.54e+06, 6.00e+06, 6.43e+06, 1.70e+08, 2.42e+08, 1.~
## $ logN <dbl> 6.792392, 6.815578, 6.778151, 6.808211, 8.230449, 8.383815, 8.~
# Make a explorative summary
::skim(dat) skimr
Name | dat |
Number of rows | 154 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sample | 0 | 1 | 5 | 11 | 0 | 154 | 0 |
Strain | 0 | 1 | 2 | 5 | 0 | 5 | 0 |
Protein | 0 | 1 | 5 | 11 | 0 | 5 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Day | 0 | 1.00 | 3.18 | 2.30 | 0.00 | 1.00e+00 | 2.00e+00 | 6.00e+00 | 6.00e+00 | <U+2587><U+2585><U+2581><U+2581><U+2587> |
pH | 9 | 0.94 | 6.36 | 0.86 | 4.93 | 5.19e+00 | 6.91e+00 | 6.99e+00 | 7.20e+00 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
CFU | 71 | 0.54 | 193698566.78 | 429785198.34 | 0.00 | 1.30e+06 | 7.08e+06 | 4.85e+07 | 1.78e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
logN | 71 | 0.54 | 6.25 | 2.63 | 0.00 | 6.04e+00 | 6.85e+00 | 7.68e+00 | 9.25e+00 | <U+2582><U+2581><U+2582><U+2587><U+2585> |
3.1.3 SAVE
# Save cleaned data
save(dat, file = "R_objects/stat_test_data_growth.RData")
# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())
3.2 TEST INDEPENDENCE
3.2.1 CATEGORICAL VARIABLES
Chi squared test of independence.
# load data
load("R_objects/stat_test_data_growth.RData")
# This converts all character columns to factor (if only specific columns should be converted replace "where(is_character)" with "c(<col1>,<col2>,...)"
<- dat %>% mutate(across(where(is_character), as_factor))
dat
# Create vector of categorical variables with more than two categories and less than half of the number of samples.
<- dat %>%
cat_var select_if(is.factor) %>%
select(where(~n_distinct(.) > 1)) %>%
select(where(~n_distinct(.) < (nrow(dat)/2))) %>%
colnames()
length(cat_var) # If less than 2 variables the following code should not be run
## [1] 2
# Test the variables pairwise
%>%
cat_var combn(2) %>%
t() %>%
as_tibble() %>%
%>%
rowwise mutate(chisq_test = list(
table(dat[[V1]], dat[[V2]]) %>% chisq.test()
),chisq_pval = chisq_test$p.value
)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## i Using compatibility `.name_repair`.
## Warning: There was 1 warning in `mutate()`.
## i In argument: `chisq_test = list(table(dat[[V1]], dat[[V2]]) %>%
## chisq.test())`.
## i In row 1.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## # A tibble: 1 x 4
## # Rowwise:
## V1 V2 chisq_test chisq_pval
## <chr> <chr> <list> <dbl>
## 1 Strain Protein <htest> 1.00
3.2.2 QUANTITATIVE VARIABLES
Spearman’s test without first testing the normality of the data
# create vector with the relevant variables
<- dat %>% select(where(is.numeric)) %>% colnames()
CON.VARS
# Plot all variables against each other
pairs(dat[,CON.VARS], pch = 19, cex = 0.5,
lower.panel=NULL)
# Run Spearman test
<- cor(dat[,CON.VARS], method = "spearman", use = "pairwise.complete.obs")
corrmat
# Create heatmap
<- round(corrmat, 2)
corrmat_rounded
<- tibble(Var1 = rep(row.names(corrmat_rounded), length(row.names(corrmat_rounded))),
melted_corrmat_rounded Var2 = rep(row.names(corrmat_rounded), each = length(row.names(corrmat_rounded))),
dist = as.numeric(matrix(corrmat_rounded)))
ggplot(melted_corrmat_rounded, aes(x = Var1, y = Var2, fill = dist)) +
labs(title = "Correlation between continous variables") +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", mid = "white",
high = "red", midpoint = 0, limit = c(-1,1),
space = "Lab", name = "Correlation coefficient") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, size = 12,
hjust = 1, vjust = 1),
axis.text.y = element_text(size = 12, hjust = 1, vjust = 0.5),
axis.title.x = element_blank(), axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank()) +
geom_text(aes(x = Var1, y = Var2, label = dist),
color = "black", size = 4) +
coord_fixed()
# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())
3.3 ALL - Strain, Protein and Day - pH
# load data
load("R_objects/stat_test_data_growth.RData")
# Create subset
<- dat %>%
dat.clean filter(!is.na(pH) & Protein %in% c("Casein", "Potato", "Hydrolysate", "Water")) %>%
mutate(Day = as.factor(Day))
3.3.1 PREPARE VARIABLES
# Set names of variables
<- c("Day","Strain", "Protein")
PREDICTOR <- "pH"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 52 x 7
## Strain Protein Day variable n mean sd
## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein 1 pH 3 4.99 0.015
## 2 33 Hydrolysate 1 pH 2 6.97 0
## 3 33 Potato 1 pH 3 6.74 0.032
## 4 33 Water 1 pH 2 7.02 0.007
## 5 No Casein 1 pH 3 6.97 0.01
## 6 No Hydrolysate 1 pH 2 6.98 0.028
## 7 No Potato 1 pH 3 7.00 0.023
## 8 No Water 1 pH 2 7.04 0
## 9 SK11 Casein 1 pH 3 6.96 0
## 10 SK11 Hydrolysate 1 pH 2 7 0
## # i 42 more rows
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
3.3.2 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 2 x 9
## Strain Protein Day Sample pH CFU logN is.outlier is.extreme
## <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 Wg2 Potato 6 Wg2-P3-D6 5.02 6040000 6.78 TRUE FALSE
## 2 pAK80 Casein 6 pAK80-C4-D6 6.66 580000000 8.76 TRUE FALSE
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model
residuals.The Shapiro-Wilk test is performed on all data groups when
possible with n > 2 and only with unique values.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.281 2.47e-23
# Compute Shapiro-Wilk test of normality for each data group
<- dat.clean %>%
sets group_by(Day, Protein, Strain) %>%
summarise(n_unique = n_distinct(pH),
n = n()) %>%
filter(n > 2, n_unique > 1) %>%
transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
%>%
dat.clean mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
filter(day_protein_strain %in% sets$day_protein_strain) %>%
group_by(Day, Protein, Strain) %>%
shapiro_test(vars = "pH") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 7 x 7
## Strain Protein Day variable statistic p p.adj
## <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl>
## 1 No Potato 1 pH 0.75 0 0
## 2 Wg2 Potato 1 pH 0.75 0 0
## 3 SK11 Casein 2 pH 0.75 0 0
## 4 Wg2 Casein 2 pH 0.75 0 0
## 5 Wg2 Potato 2 pH 0.75 0 0
## 6 pAK80 Potato 2 pH 0.75 0 0
## 7 Wg2 Potato 6 pH 0.630 0.00124 0.00443
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 51 89 46.8 6.76e-47
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
3.3.3 PERFORM TEST
The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.
# for each protein source
<- dat.clean %>%
res.protein group_by(Strain, Day) %>%
kruskal_test(pH ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 10 x 9
## Strain Day .y. n statistic df p method p.adj
## <chr> <fct> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 33 1 pH 10 8.51 3 0.0366 Kruskal-Wallis 0.0683
## 2 33 2 pH 8 6.25 2 0.0439 Kruskal-Wallis 0.0683
## 3 33 6 pH 12 8.81 3 0.032 Kruskal-Wallis 0.0683
## 4 No 6 pH 12 7.85 3 0.0493 Kruskal-Wallis 0.0690
## 5 SK11 1 pH 10 8.18 3 0.0424 Kruskal-Wallis 0.0683
## 6 SK11 6 pH 12 10.2 3 0.0167 Kruskal-Wallis 0.0683
## 7 Wg2 1 pH 10 8.56 3 0.0358 Kruskal-Wallis 0.0683
## 8 Wg2 2 pH 8 6.40 2 0.0407 Kruskal-Wallis 0.0683
## 9 Wg2 6 pH 12 9.71 3 0.0212 Kruskal-Wallis 0.0683
## 10 pAK80 6 pH 12 10.2 3 0.0173 Kruskal-Wallis 0.0683
# for each strain
<- dat.clean %>%
res.strain group_by(Day, Protein) %>%
kruskal_test(pH ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 6 x 9
## Protein Day .y. n statistic df p method p.adj
## <chr> <fct> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein 1 pH 15 11.9 4 0.0182 Kruskal-Wallis 0.0400
## 2 Potato 1 pH 15 13.4 4 0.00943 Kruskal-Wallis 0.0346
## 3 Casein 2 pH 12 10.5 3 0.0151 Kruskal-Wallis 0.0400
## 4 Potato 2 pH 11 9.54 3 0.0229 Kruskal-Wallis 0.0420
## 5 Casein 6 pH 20 17.6 4 0.00151 Kruskal-Wallis 0.00830
## 6 Potato 6 pH 20 18.4 4 0.00105 Kruskal-Wallis 0.00830
<- dat.clean %>%
pwc.strain filter(Protein %in% c("Casein", "Potato"), Day==6) %>%
group_by(Protein) %>%
pairwise_wilcox_test(pH ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p.adj < 0.05) pwc.strain
## # A tibble: 19 x 10
## Protein .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Casein pH 33 No 4 4 0 0.028 0.0322 ns
## 2 Casein pH 33 pAK80 4 4 0 0.029 0.0322 ns
## 3 Casein pH 33 SK11 4 4 0 0.028 0.0322 ns
## 4 Casein pH 33 Wg2 4 4 0.5 0.041 0.0432 ns
## 5 Casein pH No pAK80 4 4 16 0.028 0.0322 ns
## 6 Casein pH No SK11 4 4 16 0.028 0.0322 ns
## 7 Casein pH No Wg2 4 4 16 0.028 0.0322 ns
## 8 Casein pH pAK80 Wg2 4 4 16 0.029 0.0322 ns
## 9 Casein pH SK11 Wg2 4 4 16 0.029 0.0322 ns
## 10 Potato pH 33 No 4 4 0 0.029 0.0322 ns
## 11 Potato pH 33 pAK80 4 4 0 0.029 0.0322 ns
## 12 Potato pH 33 SK11 4 4 0 0.029 0.0322 ns
## 13 Potato pH 33 Wg2 4 4 0 0.026 0.0322 ns
## 14 Potato pH No pAK80 4 4 16 0.029 0.0322 ns
## 15 Potato pH No SK11 4 4 16 0.028 0.0322 ns
## 16 Potato pH No Wg2 4 4 16 0.026 0.0322 ns
## 17 Potato pH pAK80 SK11 4 4 16 0.029 0.0322 ns
## 18 Potato pH pAK80 Wg2 4 4 16 0.026 0.0322 ns
## 19 Potato pH SK11 Wg2 4 4 16 0.026 0.0322 ns
3.3.4 VISUALIZE
#Name order in plot
$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain,
dat.cleanlevels = c("No", "pAK80", "SK11", "Wg2", "33"),
labels =c ("Non-inoculated",
"MS22418 (PrtP-)",
"MS22421 (PrtP/SK11)",
"MS22425 (PrtP/Wg2)",
"MS22427 (PrtP/MS22333)"))
$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato", "Hydrolysate", "Water"))
dat.clean
#Boxplot
<- dat.clean %>%
bxp.pH filter(., Protein %in% c("Casein", "Potato", "Hydrolysate", "Water")) %>%
filter (., Day %in% c(1, 2, 6)) %>%
ggboxplot(.,
x = "Day",
y = "pH",
color = "Strain",
facet.by = "Protein")+
labs(x="Time [days]", y="pH")+
scale_y_continuous(limits = c(4.8, 7.2))+
facet_wrap(~Protein,
nrow=2,
labeller = as_labeller(c("Casein"="CCDM",
"Potato"="PCDM",
"Hydrolysate"="HCDM",
"Water"="CDM")))+
scale_color_manual(values = c("Non-inoculated"="#E89242FF",
"MS22418 (PrtP-)"="#82491EFF",
"MS22421 (PrtP/SK11)"="#24325FFF",
"MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)geom_hline(yintercept=7.0, linetype="dashed", color = "#7d7d7dff")+
theme_light()+
theme(legend.position="bottom")
save(bxp.pH, file = "R_objects/Fig1A.RData")
3.4 PH CHANGE
The pH_change analyses are not presented or discussed in the paper because they did not appear to be relevant to the main conclusion of the paper.
3.4.1 DATA
# load data
load("R_objects/stat_test_data_growth.RData")
# Add replicate identifier
$replicate <- str_split(dat$Sample, "-", simplify = T)[,2]
dat
#Calculate pH change
<- dat %>%
dat.wide pivot_wider(values_from = pH, names_from = Day, names_prefix = "Day",id_cols = c(replicate, Strain, Protein), values_fill = NA)
<- dat.wide %>%
dat.wide mutate(pH_1 = Day1 - 7.2,
pH_2 = Day2 - Day1,
pH_6 = Day6 - Day2,
norm_1 = 100*Day1/7.2,
norm_2 = 100*Day2/7.2,
norm_6 = 100*Day6/7.2
)
ggplot(dat.wide, aes(x = Protein, y = pH_1, color = Strain)) +
geom_boxplot(outlier.colour = NA) +
geom_point(position = position_jitterdodge(jitter.width = 0.1,
dodge.width = 0.7))
## Warning: Removed 14 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 14 rows containing missing values (`geom_point()`).
ggplot(dat.wide, aes(x = Protein, y = pH_2, color = Strain)) +
geom_boxplot(outlier.colour = NA) +
geom_point(position = position_jitterdodge(jitter.width = 0.1,
dodge.width = 0.7))
## Warning: Removed 33 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 33 rows containing missing values (`geom_point()`).
ggplot(dat.wide, aes(x = Protein, y = pH_6, color = Strain)) +
geom_boxplot(outlier.colour = NA) +
geom_point(position = position_jitterdodge(jitter.width = 0.1,
dodge.width = 0.7))
## Warning: Removed 33 rows containing non-finite values (`stat_boxplot()`).
## Removed 33 rows containing missing values (`geom_point()`).
# Create subset
<- dat.wide %>% pivot_longer(names_to = "Day",values_to = "pH_change",cols = c(pH_1, pH_2, pH_6))
dat.clean.wide
ggplot(dat.clean.wide, aes(x = Day, y = pH_change, color = Protein, group = replicate)) +
geom_point() + geom_line() +
facet_wrap("Strain")
## Warning: Removed 80 rows containing missing values (`geom_point()`).
## Warning: Removed 46 rows containing missing values (`geom_line()`).
Save data
# Save cleaned data
save(dat.clean.wide, file = "R_objects/stat_test_data_pH_change.RData")
# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())
3.4.2 ALL - Strain & Protein - pH_change
# load data
load("R_objects/stat_test_data_pH_change.RData")
# Create subset
<-
dat.clean %>%
dat.clean.wide filter(!is.na(pH_change))
3.4.2.1 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Strain", "Day")
PREDICTOR <- "pH_change"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 44 x 7
## Strain Protein Day variable n mean sd
## <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein pH_1 pH_change 3 -2.21 0.015
## 2 33 Casein pH_2 pH_change 3 0.207 0.021
## 3 33 Casein pH_6 pH_change 3 -0.2 0
## 4 No Casein pH_1 pH_change 3 -0.23 0.01
## 5 SK11 Casein pH_1 pH_change 3 -0.24 0
## 6 SK11 Casein pH_2 pH_change 3 0.117 0.006
## 7 SK11 Casein pH_6 pH_change 3 -0.283 0.015
## 8 Wg2 Casein pH_1 pH_change 3 -2.17 0.026
## 9 Wg2 Casein pH_2 pH_change 3 0.083 0.023
## 10 Wg2 Casein pH_6 pH_change 3 -0.073 0.012
## # i 34 more rows
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
3.4.2.2 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## [1] Strain Protein Day replicate Day0 Day1
## [7] Day2 Day6 norm_1 norm_2 norm_6 pH_change
## [13] is.outlier is.extreme
## <0 rækker> (eller 0-længde row.names)
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model
residuals.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.363 6.80e-20
# Compute Shapiro-Wilk test of normality for each data group
<- dat.clean %>%
sets group_by(Day, Protein, Strain) %>%
summarise(n_unique = n_distinct(pH_change),
n = n()) %>%
filter(n > 2, n_unique > 1) %>%
transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
%>%
dat.clean mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
filter(day_protein_strain %in% sets$day_protein_strain) %>%
group_by(Day, Protein, Strain) %>%
shapiro_test(vars = "pH_change") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 8 x 7
## Strain Protein Day variable statistic p p.adj
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 No Potato pH_1 pH_change 0.75 0 0
## 2 Wg2 Potato pH_1 pH_change 0.75 0 0
## 3 SK11 Casein pH_2 pH_change 0.75 0 0
## 4 Wg2 Casein pH_2 pH_change 0.75 0 0
## 5 Wg2 Potato pH_2 pH_change 0.75 0 0
## 6 pAK80 Potato pH_2 pH_change 0.75 0 0
## 7 Wg2 Casein pH_6 pH_change 0.75 0 0
## 8 pAK80 Casein pH_6 pH_change 0.75 0 0
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 43 68 75.4 1.48e-43
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
3.4.2.3 PERFORM TEST
The data are not normally distributed. At least one data group is not normally distributed, according to the Shapiro-Wilk normality test. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon pairwise test.
# for each protein source
<- dat.clean %>%
res.protein group_by(Strain, Day) %>%
kruskal_test(pH_change ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 8 x 9
## Strain Day .y. n statistic df p method p.adj
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 33 pH_1 pH_change 10 8.51 3 0.0366 Kruskal-Wallis 0.0713
## 2 33 pH_2 pH_change 8 6.25 2 0.0439 Kruskal-Wallis 0.0713
## 3 33 pH_6 pH_change 8 6.56 2 0.0376 Kruskal-Wallis 0.0713
## 4 SK11 pH_1 pH_change 10 8.18 3 0.0424 Kruskal-Wallis 0.0713
## 5 Wg2 pH_1 pH_change 10 8.56 3 0.0358 Kruskal-Wallis 0.0713
## 6 Wg2 pH_2 pH_change 8 6.40 2 0.0407 Kruskal-Wallis 0.0713
## 7 Wg2 pH_6 pH_change 8 6.33 2 0.0423 Kruskal-Wallis 0.0713
## 8 pAK80 pH_6 pH_change 8 6.33 2 0.0423 Kruskal-Wallis 0.0713
# for each strain
<- dat.clean %>%
res.strain group_by(Day, Protein) %>%
kruskal_test(pH_change ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 6 x 9
## Protein Day .y. n statistic df p method p.adj
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein pH_1 pH_change 15 11.9 4 0.0182 Kruskal-Wallis 0.0468
## 2 Potato pH_1 pH_change 15 13.4 4 0.00943 Kruskal-Wallis 0.0468
## 3 Casein pH_2 pH_change 12 10.3 3 0.0164 Kruskal-Wallis 0.0468
## 4 Potato pH_2 pH_change 11 9.50 3 0.0234 Kruskal-Wallis 0.0468
## 5 Casein pH_6 pH_change 12 9.84 3 0.02 Kruskal-Wallis 0.0468
## 6 Potato pH_6 pH_change 11 8.77 3 0.0326 Kruskal-Wallis 0.0543
<- dat.clean %>%
pwc.strain filter(Protein %in% c("Casein", "Potato") & Day %in% c("pH_1")) %>%
group_by(Day, Protein) %>%
pairwise_wilcox_test(pH_change ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter (p < 0.5) pwc.strain
## # A tibble: 19 x 11
## Protein Day .y. group1 group2 n1 n2 statistic p p.adj
## <chr> <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Casein pH_1 pH_change 33 No 3 3 0 0.1 0.133
## 2 Casein pH_1 pH_change 33 pAK80 3 3 0 0.1 0.133
## 3 Casein pH_1 pH_change 33 SK11 3 3 0 0.064 0.133
## 4 Casein pH_1 pH_change 33 Wg2 3 3 1 0.2 0.211
## 5 Casein pH_1 pH_change No SK11 3 3 7.5 0.197 0.211
## 6 Casein pH_1 pH_change No Wg2 3 3 9 0.1 0.133
## 7 Casein pH_1 pH_change pAK80 SK11 3 3 7.5 0.197 0.211
## 8 Casein pH_1 pH_change pAK80 Wg2 3 3 9 0.1 0.133
## 9 Casein pH_1 pH_change SK11 Wg2 3 3 9 0.064 0.133
## 10 Potato pH_1 pH_change 33 No 3 3 0 0.076 0.133
## 11 Potato pH_1 pH_change 33 pAK80 3 3 0 0.064 0.133
## 12 Potato pH_1 pH_change 33 SK11 3 3 0 0.1 0.133
## 13 Potato pH_1 pH_change 33 Wg2 3 3 0 0.076 0.133
## 14 Potato pH_1 pH_change No pAK80 3 3 9 0.059 0.133
## 15 Potato pH_1 pH_change No SK11 3 3 9 0.076 0.133
## 16 Potato pH_1 pH_change No Wg2 3 3 9 0.072 0.133
## 17 Potato pH_1 pH_change pAK80 SK11 3 3 7.5 0.197 0.211
## 18 Potato pH_1 pH_change pAK80 Wg2 3 3 9 0.059 0.133
## 19 Potato pH_1 pH_change SK11 Wg2 3 3 9 0.076 0.133
## # i 1 more variable: p.adj.signif <chr>
3.4.3 pH_1 - Strain & Protein - pH_change
This subset analysis focus on the first pH change, but the analysis did not contribute to the main conclusion of the paper. Therefore, this analysis is not presented or discussed in the paper.
# load data
load("R_objects/stat_test_data_pH_change.RData")
# Create subset
<-
dat.clean %>%
dat.clean.wide filter(!is.na(pH_change) & Day == "pH_1")
3.4.3.1 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Strain")
PREDICTOR <- "pH_change"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 20 x 6
## Strain Protein variable n mean sd
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein pH_change 3 -2.21 0.015
## 2 No Casein pH_change 3 -0.23 0.01
## 3 SK11 Casein pH_change 3 -0.24 0
## 4 Wg2 Casein pH_change 3 -2.17 0.026
## 5 pAK80 Casein pH_change 3 -0.207 0.049
## 6 33 Hydrolysate pH_change 2 -0.23 0
## 7 No Hydrolysate pH_change 2 -0.22 0.028
## 8 SK11 Hydrolysate pH_change 2 -0.2 0
## 9 Wg2 Hydrolysate pH_change 2 -0.19 0.014
## 10 pAK80 Hydrolysate pH_change 2 -0.16 0.014
## 11 33 Potato pH_change 3 -0.457 0.032
## 12 No Potato pH_change 3 -0.197 0.023
## 13 SK11 Potato pH_change 3 -0.257 0.021
## 14 Wg2 Potato pH_change 3 -0.357 0.006
## 15 pAK80 Potato pH_change 3 -0.24 0
## 16 33 Water pH_change 2 -0.175 0.007
## 17 No Water pH_change 2 -0.16 0
## 18 SK11 Water pH_change 2 -0.155 0.021
## 19 Wg2 Water pH_change 2 -0.12 0
## 20 pAK80 Water pH_change 2 -0.11 0.014
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
3.4.3.2 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## [1] Strain Protein replicate Day0 Day1 Day2
## [7] Day6 norm_1 norm_2 norm_6 Day pH_change
## [13] is.outlier is.extreme
## <0 rækker> (eller 0-længde row.names)
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model
residuals.The Shapiro-Wilk test is performed on all data groups when
possible with n > 2 and only with unique values.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.942 0.0156
# Compute Shapiro-Wilk test of normality for each data group
<- dat.clean %>%
sets group_by(Day, Protein, Strain) %>%
summarise(n_unique = n_distinct(pH_change),
n = n()) %>%
filter(n > 2, n_unique > 1) %>%
transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
%>%
dat.clean mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
filter(day_protein_strain %in% sets$day_protein_strain) %>%
group_by(Day, Protein, Strain) %>%
shapiro_test(vars = "pH_change") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 2 x 7
## Strain Protein Day variable statistic p p.adj
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 No Potato pH_1 pH_change 0.75 0 0
## 2 Wg2 Potato pH_1 pH_change 0.75 0 0
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 19 30 0.708 0.782
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
3.4.3.3 PERFORM TEST
The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.
# for each protein source
<- dat.clean %>%
res.protein group_by(Strain, Day) %>%
kruskal_test(pH_change ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 3 x 9
## Strain Day .y. n statistic df p method p.adj
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 33 pH_1 pH_change 10 8.51 3 0.0366 Kruskal-Wallis 0.0707
## 2 SK11 pH_1 pH_change 10 8.18 3 0.0424 Kruskal-Wallis 0.0707
## 3 Wg2 pH_1 pH_change 10 8.56 3 0.0358 Kruskal-Wallis 0.0707
<- dat.clean %>%
pwc.protein filter(Protein %in% c("Casein", "Potato")) %>%
group_by(Strain, Day) %>%
pairwise_wilcox_test(pH_change ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc.protein
## # A tibble: 0 x 11
## # i 11 variables: Strain <chr>, Day <chr>, .y. <chr>, group1 <chr>,
## # group2 <chr>, n1 <int>, n2 <int>, statistic <dbl>, p <dbl>, p.adj <dbl>,
## # p.adj.signif <chr>
# for each strain
<- dat.clean %>%
res.strain group_by(Day, Protein) %>%
kruskal_test(pH_change ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 2 x 9
## Protein Day .y. n statistic df p method p.adj
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein pH_1 pH_change 15 11.9 4 0.0182 Kruskal-Wallis 0.0364
## 2 Potato pH_1 pH_change 15 13.4 4 0.00943 Kruskal-Wallis 0.0364
<- dat.clean %>%
pwc.strain filter(Protein %in% c("Casein", "Potato")) %>%
group_by(Protein) %>%
pairwise_wilcox_test(pH_change ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc.strain
## # A tibble: 0 x 10
## # i 10 variables: Protein <chr>, .y. <chr>, group1 <chr>, group2 <chr>,
## # n1 <int>, n2 <int>, statistic <dbl>, p <dbl>, p.adj <dbl>,
## # p.adj.signif <chr>
3.5 CASEIN AND POTATO - Protein and Strain and Day - logN
3.5.1 DATA
# load data
load("R_objects/stat_test_data_growth.RData")
# Create subset
<- dat %>%
dat.clean filter(!is.na(CFU) & Day != 0) %>%
subset(Protein %in% c("Potato", "Casein"))%>%
mutate(Day = as.factor(Day))
3.5.2 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Strain","Day")
PREDICTOR <- "logN"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 16 x 7
## Strain Protein Day variable n mean sd
## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein 1 logN 3 9.21 0.056
## 2 33 Casein 6 logN 2 4.11 0.038
## 3 SK11 Casein 1 logN 3 7.79 0.077
## 4 SK11 Casein 6 logN 3 6.17 0.785
## 5 Wg2 Casein 1 logN 3 9.13 0.019
## 6 Wg2 Casein 6 logN 2 4.46 0.025
## 7 pAK80 Casein 1 logN 3 7.38 0.075
## 8 pAK80 Casein 6 logN 4 7.30 1.02
## 9 33 Potato 1 logN 3 8.98 0.091
## 10 33 Potato 6 logN 4 6.82 0.034
## 11 SK11 Potato 1 logN 3 8.20 0.509
## 12 SK11 Potato 6 logN 4 7.15 0.14
## 13 Wg2 Potato 1 logN 3 8.79 0.051
## 14 Wg2 Potato 6 logN 4 6.75 0.101
## 15 pAK80 Potato 1 logN 3 8.26 0.107
## 16 pAK80 Potato 6 logN 4 6.86 0.176
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
3.5.3 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 2 x 9
## Strain Protein Day Sample pH CFU logN is.outlier is.extreme
## <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 SK11 Potato 6 SK11-P1-D6 5.99 8800000 6.94 TRUE FALSE
## 2 Wg2 Potato 6 Wg2-P2-D6 5 4020000 6.60 TRUE FALSE
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model
residuals.The Shapiro-Wilk test is performed on all data groups when
possible with n > 2 and only with unique values.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.724 0.0000000182
# Compute Shapiro-Wilk test of normality for each data group
<- dat.clean %>%
sets group_by(Day, Protein, Strain) %>%
summarise(n_unique = n_distinct(logN),
n = n()) %>%
filter(n > 2, n_unique > 1) %>%
transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
%>%
dat.clean mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
filter(day_protein_strain %in% sets$day_protein_strain) %>%
group_by(Day, Protein, Strain) %>%
shapiro_test(vars = "logN") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 1 x 7
## Strain Protein Day variable statistic p p.adj
## <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl>
## 1 SK11 Casein 1 logN 0.75 0 0
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 15 35 1.47 0.171
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
3.5.4 PERFORM TEST
The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.
# for each protein source
<- dat.clean %>%
res.protein group_by(Day, Strain) %>%
kruskal_test(logN ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 3 x 9
## Strain Day .y. n statistic df p method p.adj
## <chr> <fct> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 33 1 logN 6 3.86 1 0.0495 Kruskal-Wallis 0.103
## 2 Wg2 1 logN 6 3.86 1 0.0495 Kruskal-Wallis 0.103
## 3 pAK80 1 logN 6 3.86 1 0.0495 Kruskal-Wallis 0.103
# for each strain
<- dat.clean %>%
res.strain group_by(Day, Protein) %>%
kruskal_test(logN ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 4 x 9
## Protein Day .y. n statistic df p method p.adj
## <chr> <fct> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein 1 logN 12 10.2 3 0.0166 Kruskal-Wallis 0.0407
## 2 Potato 1 logN 12 9.46 3 0.0237 Kruskal-Wallis 0.0407
## 3 Casein 6 logN 11 8.21 3 0.0418 Kruskal-Wallis 0.0418
## 4 Potato 6 logN 16 8.91 3 0.0305 Kruskal-Wallis 0.0407
<- dat.clean %>%
pwc group_by(Day, Protein) %>%
pairwise_wilcox_test(logN ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 2 x 11
## Protein Day .y. group1 group2 n1 n2 statistic p p.adj
## <chr> <fct> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Potato 6 logN 33 SK11 4 4 0 0.029 0.200
## 2 Potato 6 logN SK11 Wg2 4 4 16 0.029 0.200
## # i 1 more variable: p.adj.signif <chr>
3.5.5 VISUALIZE
All CFU data
# load data
load("R_objects/stat_test_data_growth.RData")
# Create subset
<- dat %>% filter(!is.na(CFU) & Day != 0) %>%
dat.clean mutate(Day = as.factor(Day))
# Set names of variables
<- c("Protein","Strain","Day")
PREDICTOR <- "logN"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 31 x 7
## Strain Protein Day variable n mean sd
## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein 1 logN 3 9.21 0.056
## 2 33 Casein 6 logN 2 4.11 0.038
## 3 SK11 Casein 1 logN 3 7.79 0.077
## 4 SK11 Casein 6 logN 3 6.17 0.785
## 5 Wg2 Casein 1 logN 3 9.13 0.019
## 6 Wg2 Casein 6 logN 2 4.46 0.025
## 7 pAK80 Casein 1 logN 3 7.38 0.075
## 8 pAK80 Casein 6 logN 4 7.30 1.02
## 9 33 Hydrolysate 6 logN 2 2.20 3.12
## 10 SK11 Hydrolysate 1 logN 2 7.27 0.09
## # i 21 more rows
#Name order in plot
$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain, levels = c("pAK80", "SK11", "Wg2", "33"))
dat.clean$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato", "Hydrolysate", "Water"))
dat.clean
#Plot - boxplot
<- dat.clean %>%
bxp.logN.sub filter(., Protein %in% c("Casein", "Potato", "Water")) %>%
filter (., Day %in% c(1, 2, 6)) %>%
ggboxplot(.,
x = "Day",
y = "logN",
color = "Strain",
facet.by = "Protein")+
labs(x="Time [days]", y="log10(CFU/mL)")+
scale_y_continuous(limits = c(0, 10))+
geom_hline(yintercept=6.8, linetype="dashed", color = "#7d7d7dff")+
facet_wrap(~Protein,
nrow=1,
labeller = as_labeller(c("Casein"="CCDM",
"Potato"="PCDM",
"Water"="CDM")))+
scale_color_manual(values = pal_rickandmorty("schwifty")(5)[2:5],
labels=c("MS22418 (PrtP-)",
"MS22421 (PrtP/SK11)",
"MS22425 (PrtP/Wg2)",
"MS22427 (PrtP/MS22333)"))+
theme_light()+
theme(legend.position="bottom")
bxp.logN.sub
#Save plot
ggsave(filename = "plots/bxp.logN_all_sub.svg", plot = bxp.logN.sub, width = 16.5, height = 10, units =c("cm"), dpi = 600)
Figure S4
4 DATA SET: Organic acids
4.1 DATA
4.1.1 LOAD
<- read_excel("./data/Data_collection.xlsx", sheet="Organic_compounds", na = "NA") dat
4.1.2 LOOK
# Take a glimpse
glimpse(dat)
## Rows: 93
## Columns: 12
## $ `First Injection` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ Sample <chr> "pAK80-P1-D1", "pAK80-P2-D1", "pAK80-P3-D1", "pAK80-~
## $ Strain <chr> "pAK80", "pAK80", "pAK80", "pAK80", "pAK80", "pAK80"~
## $ Protein <chr> "Potato", "Potato", "Potato", "Casein", "Casein", "C~
## $ Day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ Lactose <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Glucose <dbl> 26.2580, 26.7594, 27.0690, 30.0366, 30.3484, 30.0170~
## $ Pyruvate <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.00~
## $ Lactate <dbl> 6.7332, 6.8342, 6.8214, 1.6452, 1.8118, 1.7192, 13.7~
## $ Formate <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Acetate <dbl> 1.5994, 1.6274, 1.6640, 1.4928, 1.3780, 1.2972, 1.68~
## $ Ethanol <dbl> 3.8348, 3.1974, 2.7324, 3.0098, 3.3522, 2.8502, 2.93~
# Make a explorative summary
::skim(dat) skimr
Name | dat |
Number of rows | 93 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 9 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sample | 0 | 1 | 8 | 11 | 0 | 93 | 0 |
Strain | 0 | 1 | 2 | 5 | 0 | 5 | 0 |
Protein | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
First Injection | 0 | 1 | 47.00 | 26.99 | 1 | 24.00 | 47.00 | 70.00 | 93.00 | <U+2587><U+2587><U+2587><U+2587><U+2587> |
Day | 0 | 1 | 3.40 | 2.30 | 1 | 1.00 | 2.00 | 6.00 | 6.00 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
Lactose | 0 | 1 | 0.01 | 0.01 | 0 | 0.00 | 0.00 | 0.02 | 0.03 | <U+2587><U+2581><U+2581><U+2582><U+2583> |
Glucose | 0 | 1 | 15.40 | 12.76 | 0 | 0.02 | 20.87 | 27.91 | 30.81 | <U+2587><U+2582><U+2581><U+2583><U+2587> |
Pyruvate | 0 | 1 | 0.04 | 0.06 | 0 | 0.00 | 0.00 | 0.08 | 0.22 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
Lactate | 0 | 1 | 25.67 | 21.85 | 0 | 4.57 | 17.74 | 49.36 | 57.80 | <U+2587><U+2583><U+2581><U+2582><U+2587> |
Formate | 0 | 1 | 0.07 | 0.16 | 0 | 0.00 | 0.00 | 0.00 | 0.60 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Acetate | 0 | 1 | 1.54 | 0.24 | 1 | 1.41 | 1.51 | 1.67 | 2.16 | <U+2582><U+2586><U+2587><U+2582><U+2582> |
Ethanol | 0 | 1 | 2.78 | 0.48 | 0 | 2.60 | 2.84 | 3.01 | 4.23 | <U+2581><U+2581><U+2582><U+2587><U+2581> |
4.1.3 CLEAN
#Reorganization of data
<- pivot_longer(dat, cols=6:12, names_to="compound", values_to = "concentration")
dat
# Remove column
<- c("First Injection")
out.var
<- dat %>% select(-one_of(out.var))
dat
# Change variables
<- dat %>% mutate(Day = as.factor(Day))
dat
# Look at cleaned data
::skim(dat) skimr
Name | dat |
Number of rows | 651 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 4 |
factor | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sample | 0 | 1 | 8 | 11 | 0 | 93 | 0 |
Strain | 0 | 1 | 2 | 5 | 0 | 5 | 0 |
Protein | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
compound | 0 | 1 | 7 | 8 | 0 | 7 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Day | 0 | 1 | FALSE | 3 | 6: 280, 1: 210, 2: 161 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
concentration | 0 | 1 | 6.5 | 13.34 | 0 | 0 | 1.16 | 2.93 | 57.8 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
4.1.4 SAVE
# Save cleaned data
save(dat, file = "R_objects/stat_test_data_OA.RData")
# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())
4.2 TEST INDEPENDENCE
4.2.1 CATEGORICAL VARIABLES
Chi squared test of independence.
# load data
load("R_objects/stat_test_data_OA.RData")
# This converts all character columns to factor (if only specific columns should be converted replace "where(is_character)" with "c(<col1>,<col2>,...)"
<- dat %>% mutate(across(where(is_character), as_factor))
dat
# Create vector of categorical variables with more than two categories and less than half of the number of samples.
<- dat %>%
cat_var select_if(is.factor) %>%
select(where(~n_distinct(.) > 1)) %>%
select(where(~n_distinct(.) < (nrow(dat)/2))) %>%
colnames()
length(cat_var) # If less than 2 variables the following code should not be run
## [1] 5
# Test the variables pairwise
%>%
cat_var combn(2) %>%
t() %>%
as_tibble() %>%
%>%
rowwise mutate(chisq_test = list(
table(dat[[V1]], dat[[V2]]) %>% chisq.test()
),chisq_pval = chisq_test$p.value
)
## Warning: There were 4 warnings in `mutate()`.
## The first warning was:
## i In argument: `chisq_test = list(table(dat[[V1]], dat[[V2]]) %>%
## chisq.test())`.
## i In row 1.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## i Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 3 remaining warnings.
## # A tibble: 10 x 4
## # Rowwise:
## V1 V2 chisq_test chisq_pval
## <chr> <chr> <list> <dbl>
## 1 Sample Strain <htest> 0
## 2 Sample Protein <htest> 4.88e- 85
## 3 Sample Day <htest> 1.76e-167
## 4 Sample compound <htest> 1 e+ 0
## 5 Strain Protein <htest> 9.90e- 1
## 6 Strain Day <htest> 5.72e- 6
## 7 Strain compound <htest> 1 e+ 0
## 8 Protein Day <htest> 8.92e- 1
## 9 Protein compound <htest> 1 e+ 0
## 10 Day compound <htest> 1 e+ 0
4.2.2 QUANTITATIVE VARIABLES
# create vector with the relevant variables
<- dat %>% select(where(is.numeric)) %>% colnames()
CON.VARS
CON.VARS
## [1] "concentration"
4.3 Glucose - Strain and Protein - Concentration
# load data
load("R_objects/stat_test_data_OA.RData")
# Create subset
<- dat %>% filter(compound == "Glucose", Day %in% c("1", "2", "6")) dat.clean
4.3.1 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Strain", "Day")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 28 x 7
## Strain Protein Day variable n mean sd
## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein 1 concentration 3 0 0
## 2 33 Casein 2 concentration 3 0.002 0.004
## 3 33 Casein 6 concentration 4 0.005 0.005
## 4 No Casein 1 concentration 3 30.6 0.13
## 5 No Casein 6 concentration 4 29.4 0.592
## 6 SK11 Casein 1 concentration 3 29.3 1.10
## 7 SK11 Casein 2 concentration 3 27.8 0.212
## 8 SK11 Casein 6 concentration 4 22.3 0.382
## 9 Wg2 Casein 1 concentration 3 0 0
## 10 Wg2 Casein 2 concentration 3 0.009 0.007
## # i 18 more rows
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
4.3.2 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 5 x 8
## Strain Protein Day Sample compound concentration is.outlier is.extreme
## <chr> <chr> <fct> <chr> <chr> <dbl> <lgl> <lgl>
## 1 No Casein 6 No-C1-D6 Glucose 30.2 TRUE FALSE
## 2 Wg2 Casein 6 Wg2-C3-D6 Glucose 29.8 TRUE FALSE
## 3 pAK80 Casein 6 pAK80-C4-D6 Glucose 16.7 TRUE FALSE
## 4 No Potato 6 No-P4-D6 Glucose 0 TRUE FALSE
## 5 pAK80 Potato 6 pAK80-P4-D6 Glucose 7.92 TRUE FALSE
Look at outliers
Outliers are derfined with a coefficient (coef=1.5), which specify how
far the outlier is from the edge of their boxes.
Extreme outliers are defined with a coef=3.0. However, two outliers appear to be extreme in the boxplot (Samples Wg2-C3-D6 and No-P4-D6) why they are tested with different coefficients. Both points are outliers with high coefficient values that are close to 3. Therefore, we defined outliers as extreme with coef above 2.59. The extreme outliers are removed.
#Test coef for outlier Wg2-C3-D6
%>%
dat.clean filter(Strain == "Wg2" & Protein == "Casein" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.995)
## [1] FALSE FALSE TRUE FALSE
#Test coef for outlier No-P4-D6
%>%
dat.clean filter(Strain == "No" & Protein == "Potato" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.88)
## [1] FALSE FALSE FALSE TRUE
#Test coef for outlier pAK80-C4-D6
%>%
dat.clean filter(Strain == "pAK80" & Protein == "Casein" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE FALSE
#Remove extreme outliers
<- c("Wg2-C3-D6", "No-P4-D6")
extreme_outliers
<- subset(dat.clean, !Sample %in% extreme_outliers) dat.clean
# Create plot without extreme outliers
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.The Shapiro-Wilk test is performed on all data groups when possible with n > 2 and only with unique values.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.468 1.50e-16
# Compute Shapiro-Wilk test of normality for each data group
<- dat.clean %>%
sets group_by(Day, Protein, Strain) %>%
summarise(n_unique = n_distinct(concentration),
n = n()) %>%
filter(n > 2, n_unique > 1) %>%
transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
%>%
dat.clean mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
filter(day_protein_strain %in% sets$day_protein_strain) %>%
group_by(Day, Protein, Strain) %>%
shapiro_test(vars = "concentration") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 4 x 7
## Strain Protein Day variable statistic p p.adj
## <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl>
## 1 33 Casein 2 concentration 0.75 0 0
## 2 Wg2 Casein 2 concentration 0.75 0 0
## 3 No Casein 6 concentration 0.665 0.00415 0.0259
## 4 Wg2 Casein 6 concentration 0.75 0 0
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 27 63 1.36 0.156
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
4.3.3 PERFORM TEST
The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.
# for each protein source
<- dat.clean %>%
res.protein group_by(Day, Strain) %>%
kruskal_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 11 x 9
## Strain Day .y. n statistic df p method p.adj
## <chr> <fct> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 33 1 concentration 6 4.35 1 0.0369 Kruskal-Wallis 0.063
## 2 SK11 1 concentration 6 3.86 1 0.0495 Kruskal-Wallis 0.063
## 3 Wg2 1 concentration 6 4.35 1 0.0369 Kruskal-Wallis 0.063
## 4 pAK80 1 concentration 6 3.86 1 0.0495 Kruskal-Wallis 0.063
## 5 33 2 concentration 6 3.97 1 0.0463 Kruskal-Wallis 0.063
## 6 Wg2 2 concentration 6 3.97 1 0.0463 Kruskal-Wallis 0.063
## 7 pAK80 2 concentration 6 3.86 1 0.0495 Kruskal-Wallis 0.063
## 8 33 6 concentration 8 5.40 1 0.0202 Kruskal-Wallis 0.063
## 9 SK11 6 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.063
## 10 Wg2 6 concentration 7 4.58 1 0.0323 Kruskal-Wallis 0.063
## 11 pAK80 6 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.063
# for each strain
<- dat.clean %>%
res.strain group_by(Day, Protein) %>%
kruskal_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 6 x 9
## Protein Day .y. n statistic df p method p.adj
## <chr> <fct> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein 1 concentration 15 13.2 4 0.0104 Kruskal-Wal~ 0.0182
## 2 Potato 1 concentration 15 12.8 4 0.0121 Kruskal-Wal~ 0.0182
## 3 Casein 2 concentration 12 9.84 3 0.02 Kruskal-Wal~ 0.024
## 4 Potato 2 concentration 11 9.41 3 0.0243 Kruskal-Wal~ 0.0243
## 5 Casein 6 concentration 19 16.1 4 0.00289 Kruskal-Wal~ 0.00867
## 6 Potato 6 concentration 19 16.3 4 0.00265 Kruskal-Wal~ 0.00867
<- dat.clean %>%
pwc filter(Day == 6) %>%
group_by(Day, Protein) %>%
pairwise_wilcox_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 10 x 11
## Protein Day .y. group1 group2 n1 n2 statistic p p.adj
## <chr> <fct> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Casein 6 concentration 33 No 4 4 0 0.029 0.058
## 2 Casein 6 concentration 33 pAK80 4 4 0 0.029 0.058
## 3 Casein 6 concentration 33 SK11 4 4 0 0.029 0.058
## 4 Casein 6 concentration No pAK80 4 4 16 0.029 0.058
## 5 Casein 6 concentration No SK11 4 4 16 0.029 0.058
## 6 Potato 6 concentration 33 pAK80 4 4 0 0.029 0.058
## 7 Potato 6 concentration 33 SK11 4 4 0 0.029 0.058
## 8 Potato 6 concentration pAK80 SK11 4 4 16 0.029 0.058
## 9 Potato 6 concentration pAK80 Wg2 4 4 16 0.029 0.058
## 10 Potato 6 concentration SK11 Wg2 4 4 16 0.029 0.058
## # i 1 more variable: p.adj.signif <chr>
4.3.4 VISUALIZE
#Name order in plot
$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain,
dat.cleanlevels = c("No", "pAK80", "SK11", "Wg2", "33"),
labels =c ("Non-inoculated",
"MS22418 (PrtP-)",
"MS22421 (PrtP/SK11)",
"MS22425 (PrtP/Wg2)",
"MS22427 (PrtP/MS22333)")
)$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))
dat.clean
#Plot - boxplot
<- dat.clean %>%
bxp.glucose filter(., Protein %in% c("Casein", "Potato")) %>%
filter (., Day %in% c(1, 2, 6)) %>%
ggboxplot(.,
x = "Day",
y = "concentration",
color = "Strain",
facet.by = "Protein")+
labs(x="Time [days]", y="Concentration [mM]")+
scale_y_continuous()+
facet_wrap(~Protein,
nrow=1,
labeller = as_labeller(c("Casein"="CCDM",
"Potato"="PCDM")
)+
)scale_color_manual(values = c("Non-inoculated"="#E89242FF",
"MS22418 (PrtP-)"="#82491EFF",
"MS22421 (PrtP/SK11)"="#24325FFF",
"MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)geom_hline(yintercept=30.0, linetype="dashed", color = "#7d7d7dff")+
theme_light()+
theme(legend.position="bottom")
save(bxp.glucose, file = "R_objects/Fig1B.RData")
4.4 Lactate - Strain and Protein - Concentration
# load data
load("R_objects/stat_test_data_OA.RData")
# Create subset
<- dat %>% filter(compound == "Lactate", Day %in% c("1", "2", "6"))
dat.clean
# Look at cleaned data
glimpse(dat.clean)
## Rows: 93
## Columns: 6
## $ Sample <chr> "pAK80-P1-D1", "pAK80-P2-D1", "pAK80-P3-D1", "pAK80-C1-D~
## $ Strain <chr> "pAK80", "pAK80", "pAK80", "pAK80", "pAK80", "pAK80", "W~
## $ Protein <chr> "Potato", "Potato", "Potato", "Casein", "Casein", "Casei~
## $ Day <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ compound <chr> "Lactate", "Lactate", "Lactate", "Lactate", "Lactate", "~
## $ concentration <dbl> 6.7332, 6.8342, 6.8214, 1.6452, 1.8118, 1.7192, 13.7722,~
::skim(dat.clean) skimr
Name | dat.clean |
Number of rows | 93 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 4 |
factor | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sample | 0 | 1 | 8 | 11 | 0 | 93 | 0 |
Strain | 0 | 1 | 2 | 5 | 0 | 5 | 0 |
Protein | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
compound | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Day | 0 | 1 | FALSE | 3 | 6: 40, 1: 30, 2: 23 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
concentration | 0 | 1 | 25.67 | 21.85 | 0 | 4.57 | 17.74 | 49.36 | 57.8 | <U+2587><U+2583><U+2581><U+2582><U+2587> |
4.5 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Strain", "Day")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 28 x 7
## Strain Protein Day variable n mean sd
## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein 1 concentration 3 54.1 1.90
## 2 33 Casein 2 concentration 3 50.9 2.36
## 3 33 Casein 6 concentration 4 49.5 0.838
## 4 No Casein 1 concentration 3 0.115 0.021
## 5 No Casein 6 concentration 4 0.162 0.039
## 6 SK11 Casein 1 concentration 3 2.28 0.113
## 7 SK11 Casein 2 concentration 3 4.90 0.305
## 8 SK11 Casein 6 concentration 4 12.1 0.738
## 9 Wg2 Casein 1 concentration 3 53.1 0.489
## 10 Wg2 Casein 2 concentration 3 54.8 2.38
## # i 18 more rows
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
4.5.1 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 3 x 8
## Strain Protein Day Sample compound concentration is.outlier is.extreme
## <chr> <chr> <fct> <chr> <chr> <dbl> <lgl> <lgl>
## 1 Wg2 Casein 6 Wg2-C3-D6 Lactate 0.15 TRUE FALSE
## 2 pAK80 Casein 6 pAK80-C4-D6 Lactate 20.9 TRUE FALSE
## 3 No Potato 6 No-P4-D6 Lactate 47.9 TRUE FALSE
Look at outliers
Outliers are derfined with a coefficient (coef=1.5), which specify how far the outliers are from the edge of their box.
Points are extreme outliers if they have high coefficient values that are close to 3. Therefore, we defined outliers as extreme with coef above 2.59. The extreme outliers are removed.
#Test coef for outlier Wg2-C3-D6
%>%
dat.clean filter(Strain == "Wg2" & Protein == "Casein" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE TRUE FALSE
#Test coef for outlier No-P4-D6
%>%
dat.clean filter(Strain == "No" & Protein == "Potato" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.98)
## [1] FALSE FALSE FALSE TRUE
#Test coef for outlier pAK80-C4-D6
%>%
dat.clean filter(Strain == "pAK80" & Protein == "Casein" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.55)
## [1] FALSE FALSE FALSE FALSE
#Remove extreme outliers
<- c("Wg2-C3-D6", "No-P4-D6")
extreme_outliers
<- subset(dat.clean, !Sample %in% extreme_outliers) dat.clean
# Create plot without extreme outliers
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.The Shapiro-Wilk test is performed on all data groups when possible with n > 2 and only with unique values.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.706 3.32e-12
# Compute Shapiro-Wilk test of normality for each data group
<- dat.clean %>%
sets group_by(Day, Protein, Strain) %>%
summarise(n_unique = n_distinct(concentration),
n = n()) %>%
filter(n > 2, n_unique > 1) %>%
transmute(day_protein_strain = paste(Day, Protein, Strain))
## `summarise()` has grouped output by 'Day', 'Protein'. You can override using
## the `.groups` argument.
%>%
dat.clean mutate(day_protein_strain = paste(Day, Protein, Strain)) %>%
filter(day_protein_strain %in% sets$day_protein_strain) %>%
group_by(Day, Protein, Strain) %>%
shapiro_test(vars = "concentration") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 0 x 7
## # i 7 variables: Strain <chr>, Protein <chr>, Day <fct>, variable <chr>,
## # statistic <dbl>, p <dbl>, p.adj <dbl>
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 27 63 0.956 0.537
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
4.5.2 PERFORM TEST
The data are not normal distributed, but normality is found within each group by the Shapiro-Wilk normality test. However, the same data groups of the glucose data were not normal distributed.
The same statistical analysis is applied for the two data sets, including a Kruskal-Wallis test and a Wilcoxon pairwise test.
# for each protein source
<- dat.clean %>%
res.protein group_by(Day, Strain) %>%
kruskal_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 10 x 9
## Strain Day .y. n statistic df p method p.adj
## <chr> <fct> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 33 1 concentration 6 3.86 1 0.0495 Kruskal-Wallis 0.0693
## 2 SK11 1 concentration 6 3.86 1 0.0495 Kruskal-Wallis 0.0693
## 3 Wg2 1 concentration 6 3.86 1 0.0495 Kruskal-Wallis 0.0693
## 4 pAK80 1 concentration 6 3.86 1 0.0495 Kruskal-Wallis 0.0693
## 5 Wg2 2 concentration 6 3.86 1 0.0495 Kruskal-Wallis 0.0693
## 6 pAK80 2 concentration 6 3.86 1 0.0495 Kruskal-Wallis 0.0693
## 7 33 6 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0693
## 8 SK11 6 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0693
## 9 Wg2 6 concentration 7 4.5 1 0.0339 Kruskal-Wallis 0.0693
## 10 pAK80 6 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0693
# for each strain
<- dat.clean %>%
res.strain group_by(Day, Protein) %>%
kruskal_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 6 x 9
## Protein Day .y. n statistic df p method p.adj
## <chr> <fct> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein 1 concentration 15 12.8 4 0.0121 Kruskal-Wal~ 0.0182
## 2 Potato 1 concentration 15 12.9 4 0.0118 Kruskal-Wal~ 0.0182
## 3 Casein 2 concentration 12 9.67 3 0.0216 Kruskal-Wal~ 0.0243
## 4 Potato 2 concentration 11 9.41 3 0.0243 Kruskal-Wal~ 0.0243
## 5 Casein 6 concentration 19 16.1 4 0.00293 Kruskal-Wal~ 0.00879
## 6 Potato 6 concentration 19 16.3 4 0.00263 Kruskal-Wal~ 0.00879
<- dat.clean %>%
pwc filter(Day == 6) %>%
group_by(Day, Protein) %>%
pairwise_wilcox_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 10 x 11
## Protein Day .y. group1 group2 n1 n2 statistic p p.adj
## <chr> <fct> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Casein 6 concentration 33 No 4 4 16 0.029 0.058
## 2 Casein 6 concentration 33 pAK80 4 4 16 0.029 0.058
## 3 Casein 6 concentration 33 SK11 4 4 16 0.029 0.058
## 4 Casein 6 concentration No pAK80 4 4 0 0.029 0.058
## 5 Casein 6 concentration No SK11 4 4 0 0.029 0.058
## 6 Potato 6 concentration 33 pAK80 4 4 16 0.029 0.058
## 7 Potato 6 concentration 33 SK11 4 4 16 0.029 0.058
## 8 Potato 6 concentration pAK80 SK11 4 4 0 0.029 0.058
## 9 Potato 6 concentration pAK80 Wg2 4 4 0 0.029 0.058
## 10 Potato 6 concentration SK11 Wg2 4 4 0 0.029 0.058
## # i 1 more variable: p.adj.signif <chr>
4.5.3 VISUALIZE
#Name order in plot
$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain,
dat.cleanlevels = c("No", "pAK80", "SK11", "Wg2", "33"),
labels =c ("Non-inoculated",
"MS22418 (PrtP-)",
"MS22421 (PrtP/SK11)",
"MS22425 (PrtP/Wg2)",
"MS22427 (PrtP/MS22333)")
)$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))
dat.clean
#Boxplot
<- dat.clean %>%
bxp.lactate filter(., Protein %in% c("Casein", "Potato")) %>%
filter (., Day %in% c(1, 2, 6)) %>%
ggboxplot(.,
x = "Day",
y = "concentration",
color = "Strain",
facet.by = "Protein")+
labs(x="Time [days]", y="Concentration [mM]")+
scale_y_continuous()+
facet_wrap(~Protein,
nrow=1,
labeller = as_labeller(c("Casein"="CCDM",
"Potato"="PCDM")
)+
)scale_color_manual(values = c("Non-inoculated"="#E89242FF",
"MS22418 (PrtP-)"="#82491EFF",
"MS22421 (PrtP/SK11)"="#24325FFF",
"MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)
theme_light()+
theme(legend.position="bottom")
save(bxp.lactate, file = "R_objects/Fig1C.RData")
4.6 VISUALIZE: Organic acids together with pH
load(file = "R_objects/Fig1A.RData")
load(file = "R_objects/Fig1B.RData")
load(file = "R_objects/Fig1C.RData")
<- ggarrange(bxp.pH, bxp.glucose, bxp.lactate, heights = c(3,2,2),
fig1 ncol = 1,
labels = c("A","B", "C"),
align = "v",
legend = "right", common.legend = TRUE
)
#Save Figure 1
ggsave(filename = "plots/fig1_pH_glucose_lactate.svg", plot = fig1, width = 16.5, height = 20, units =c("cm"), dpi = 600)
Figure 1
5 DATA SET: Amino acids
5.1 DATA
5.1.1 LOAD
<- read_excel("./data/Data_collection.xlsx", sheet="AA", na = "NA") dat
5.1.2 LOOK
# Take a glimpse
glimpse(dat)
## Rows: 94
## Columns: 27
## $ Sample_ID <chr> "Wg2-C1-D1", "Wg2-C2-D1", "Wg2-C3-D1", "Sk11-C1-D1",~
## $ Sample <chr> "DC1-1", "DC1-2", "DC1-3", "CC1-1", "CC1-2", "CC1-3"~
## $ `First Injection` <chr> "50Lise", "51Lise", "52Lise", "56Lise", "57Lise", "5~
## $ `Sample no` <chr> "C1", "C2", "C3", "C1", "C2", "C3", "C1", "C2", "C3"~
## $ Day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2~
## $ Protein <chr> "Casein", "Casein", "Casein", "Casein", "Casein", "C~
## $ Strain <chr> "Wg2", "Wg2", "Wg2", "SK11", "SK11", "SK11", "pAK80"~
## $ conc <chr> "uM", "uM", "uM", "uM", "uM", "uM", "uM", "uM", "uM"~
## $ His <dbl> 18.4537472, 17.1673140, 18.5421378, 0.7142539, 0.794~
## $ Asn <dbl> 13.546499, 12.328580, 15.030006, 1.541399, 2.227930,~
## $ Ser <dbl> 6.097704, 6.053979, 6.150434, 4.579292, 5.554308, 5.~
## $ `Gln/Arg` <dbl> 3.3068133, 2.9704246, 3.1039723, 0.2580126, 0.286869~
## $ Gly <dbl> 59.60045, 58.21350, 57.65550, 66.11592, 83.48793, 82~
## $ Asp <dbl> 2.3255601, 2.5187825, 2.4126894, 0.5196090, 0.770297~
## $ Glu <dbl> 31.6012219, 31.9529499, 29.6124309, 0.7158399, 1.459~
## $ Thr <dbl> 3.6553179, 3.2991876, 3.7922678, 0.2722407, 0.605377~
## $ Ala <dbl> 22.832045, 22.878213, 26.319410, 1.558777, 1.925683,~
## $ Pro <dbl> 318.4714880, 309.0975224, 314.1835222, 4.9019494, 5.~
## $ Cys <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.000000~
## $ Lys <dbl> 5.526448, 6.202491, 4.736371, 2.165367, 2.188678, 2.~
## $ Tyr <dbl> 95.12991, 90.49597, 96.35402, 15.21761, 16.79873, 17~
## $ Met <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.~
## $ Val <dbl> 17.711121, 17.242679, 18.731475, 3.099217, 3.132582,~
## $ Ile <dbl> 24.26801, 24.62159, 25.36100, 22.05858, 24.04160, 24~
## $ Leu <dbl> 25.0888213, 23.7935900, 25.8345528, 0.3067792, 0.511~
## $ Phe <dbl> 51.791264, 48.307623, 51.469286, 3.014791, 3.351971,~
## $ Trp <dbl> 36.56362, 35.95126, 37.81154, 26.58388, 29.87836, 30~
# Make a explorative summary
::skim(dat) skimr
Name | dat |
Number of rows | 94 |
Number of columns | 27 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 20 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sample_ID | 0 | 1 | 8 | 11 | 0 | 94 | 0 |
Sample | 0 | 1 | 5 | 5 | 0 | 94 | 0 |
First Injection | 0 | 1 | 5 | 6 | 0 | 94 | 0 |
Sample no | 0 | 1 | 2 | 2 | 0 | 8 | 0 |
Protein | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Strain | 0 | 1 | 2 | 7 | 0 | 5 | 0 |
conc | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Day | 0 | 1.00 | 3.38 | 2.30 | 1.00 | 1.00 | 2.00 | 6.00 | 6.00 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
His | 1 | 0.99 | 9.17 | 17.53 | 0.00 | 0.00 | 0.79 | 2.06 | 59.98 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Asn | 1 | 0.99 | 9.70 | 12.56 | 0.61 | 2.06 | 2.76 | 12.97 | 56.05 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
Ser | 1 | 0.99 | 8.31 | 6.85 | 2.55 | 4.73 | 5.89 | 8.30 | 47.89 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Gln/Arg | 1 | 0.99 | 4.18 | 4.86 | 0.00 | 0.87 | 3.01 | 4.67 | 33.49 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
Gly | 1 | 0.99 | 75.34 | 12.36 | 48.38 | 68.05 | 76.24 | 82.79 | 134.98 | <U+2583><U+2587><U+2583><U+2581><U+2581> |
Asp | 1 | 0.99 | 10.23 | 10.64 | 0.19 | 2.10 | 5.96 | 15.02 | 37.30 | <U+2587><U+2583><U+2581><U+2582><U+2581> |
Glu | 1 | 0.99 | 24.65 | 29.81 | 0.52 | 1.44 | 9.45 | 32.59 | 104.82 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
Thr | 1 | 0.99 | 2.83 | 3.09 | 0.27 | 0.60 | 1.09 | 4.37 | 18.32 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
Ala | 1 | 0.99 | 30.80 | 35.96 | 0.95 | 3.52 | 14.27 | 42.82 | 172.54 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
Pro | 1 | 0.99 | 111.49 | 174.20 | 0.00 | 6.27 | 20.78 | 62.16 | 647.38 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Cys | 1 | 0.99 | 0.73 | 2.36 | 0.00 | 0.00 | 0.00 | 0.00 | 14.34 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Lys | 1 | 0.99 | 18.36 | 27.02 | 0.00 | 2.17 | 5.48 | 20.61 | 114.74 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Tyr | 1 | 0.99 | 43.20 | 43.24 | 0.82 | 16.80 | 26.27 | 51.81 | 229.70 | <U+2587><U+2581><U+2582><U+2581><U+2581> |
Met | 1 | 0.99 | 0.78 | 1.82 | 0.00 | 0.00 | 0.00 | 0.00 | 11.50 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Val | 1 | 0.99 | 9.63 | 12.18 | 0.31 | 2.56 | 3.17 | 13.16 | 65.19 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Ile | 1 | 0.99 | 16.13 | 10.40 | 0.00 | 3.48 | 22.05 | 24.10 | 35.35 | <U+2585><U+2581><U+2581><U+2587><U+2581> |
Leu | 1 | 0.99 | 14.64 | 25.21 | 0.00 | 0.51 | 1.57 | 19.97 | 134.90 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Phe | 1 | 0.99 | 31.41 | 33.76 | 0.47 | 3.52 | 15.24 | 60.94 | 163.17 | <U+2587><U+2583><U+2582><U+2581><U+2581> |
Trp | 1 | 0.99 | 22.47 | 14.84 | 0.00 | 2.28 | 29.48 | 30.87 | 44.19 | <U+2585><U+2581><U+2581><U+2587><U+2582> |
5.1.3 CLEAN DATA
# Reorganization of data
<- pivot_longer(dat, cols=9:27, names_to="aa", values_to = "concentration")
dat
# Change variables
<- dat %>% mutate(Day = as.factor(Day))
dat
# Look at cleaned data
::skim(dat) skimr
Name | dat |
Number of rows | 1786 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 8 |
factor | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sample_ID | 0 | 1 | 8 | 11 | 0 | 94 | 0 |
Sample | 0 | 1 | 5 | 5 | 0 | 94 | 0 |
First Injection | 0 | 1 | 5 | 6 | 0 | 94 | 0 |
Sample no | 0 | 1 | 2 | 2 | 0 | 8 | 0 |
Protein | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Strain | 0 | 1 | 2 | 7 | 0 | 5 | 0 |
conc | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
aa | 0 | 1 | 3 | 7 | 0 | 19 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Day | 0 | 1 | FALSE | 3 | 6: 760, 1: 570, 2: 456 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
concentration | 19 | 0.99 | 23.37 | 52.19 | 0 | 1.12 | 5.55 | 26.67 | 647.38 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
glimpse(dat)
## Rows: 1,786
## Columns: 10
## $ Sample_ID <chr> "Wg2-C1-D1", "Wg2-C1-D1", "Wg2-C1-D1", "Wg2-C1-D1", ~
## $ Sample <chr> "DC1-1", "DC1-1", "DC1-1", "DC1-1", "DC1-1", "DC1-1"~
## $ `First Injection` <chr> "50Lise", "50Lise", "50Lise", "50Lise", "50Lise", "5~
## $ `Sample no` <chr> "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1"~
## $ Day <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ Protein <chr> "Casein", "Casein", "Casein", "Casein", "Casein", "C~
## $ Strain <chr> "Wg2", "Wg2", "Wg2", "Wg2", "Wg2", "Wg2", "Wg2", "Wg~
## $ conc <chr> "uM", "uM", "uM", "uM", "uM", "uM", "uM", "uM", "uM"~
## $ aa <chr> "His", "Asn", "Ser", "Gln/Arg", "Gly", "Asp", "Glu",~
## $ concentration <dbl> 18.453747, 13.546499, 6.097704, 3.306813, 59.600447,~
5.1.4 SAVE
# Save cleaned data
save(dat, file = "R_objects/stat_test_data_AA.RData")
# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())
5.2 TEST INDEPENDENCE
5.2.1 CATEGORICAL VARIABLES
Chi squared test of independence
# load data
load("R_objects/stat_test_data_AA.RData")
# This converts all character columns to factor (if only specific columns should be converted replace "where(is_character)" with "c(<col1>,<col2>,...)"
<- dat %>% mutate(across(where(is_character), as_factor))
dat
# Create vector of categorical variables with more than two categories and less than half of the number of samples.
<- dat %>%
cat_var select_if(is.factor) %>%
select(where(~n_distinct(.) > 1)) %>%
select(where(~n_distinct(.) < (nrow(dat)/2))) %>%
colnames()
length(cat_var) # If less than 2 variables the following code should not be run
## [1] 8
# Test the variables pairwise
%>%
cat_var combn(2) %>%
t() %>%
as_tibble() %>%
%>%
rowwise mutate(chisq_test = list(
table(dat[[V1]], dat[[V2]]) %>% chisq.test()
),chisq_pval = chisq_test$p.value
)
## Warning: There were 15 warnings in `mutate()`.
## The first warning was:
## i In argument: `chisq_test = list(table(dat[[V1]], dat[[V2]]) %>%
## chisq.test())`.
## i In row 1.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## i Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 14 remaining warnings.
## # A tibble: 28 x 4
## # Rowwise:
## V1 V2 chisq_test chisq_pval
## <chr> <chr> <list> <dbl>
## 1 Sample_ID Sample <htest> 0
## 2 Sample_ID First Injection <htest> 0
## 3 Sample_ID Sample no <htest> 0
## 4 Sample_ID Day <htest> 0
## 5 Sample_ID Protein <htest> 3.58e-311
## 6 Sample_ID Strain <htest> 0
## 7 Sample_ID aa <htest> 1 e+ 0
## 8 Sample First Injection <htest> 0
## 9 Sample Sample no <htest> 0
## 10 Sample Day <htest> 0
## # i 18 more rows
5.2.2 QUANTITATIVE VARIABLES
# create vector with the relevant variables
<- dat %>% select(where(is.numeric)) %>% colnames()
CON.VARS
CON.VARS
## [1] "concentration"
5.3 Casein - ALL amino acids
5.3.1 DATA
# load data
load("R_objects/stat_test_data_AA.RData")
# data subset
<- dat %>%
dat.clean filter(., Protein == "Casein", !is.na(concentration))
# Set names of variables
<- c("Strain", "Day", "aa")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 266 x 7
## Day Strain aa variable n mean sd
## <fct> <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 1 MS22333 Ala concentration 3 41.9 1.27
## 2 1 MS22333 Asn concentration 3 32.7 1.88
## 3 1 MS22333 Asp concentration 3 4.69 0.286
## 4 1 MS22333 Cys concentration 3 0.394 0.161
## 5 1 MS22333 Gln/Arg concentration 3 3.07 0.123
## 6 1 MS22333 Glu concentration 3 81.1 1.42
## 7 1 MS22333 Gly concentration 3 74.3 2.42
## 8 1 MS22333 His concentration 3 49.9 1.40
## 9 1 MS22333 Ile concentration 3 22.9 0.925
## 10 1 MS22333 Leu concentration 3 28.9 0.328
## # i 256 more rows
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 53 x 12
## Day Strain aa Sample_ID Sample `First Injection` `Sample no` Protein
## <fct> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 6 MS22333 Asn 33-C4-D6 EC6-4 32Lise C4 Casein
## 2 6 MS22333 Gln/Arg 33-C3-D6 EC6-3 31Lise C3 Casein
## 3 6 MS22333 His 33-C3-D6 EC6-3 31Lise C3 Casein
## 4 6 MS22333 Met 33-C4-D6 EC6-4 32Lise C4 Casein
## 5 6 MS22333 Phe 33-C3-D6 EC6-3 31Lise C3 Casein
## 6 6 MS22333 Thr 33-C3-D6 EC6-3 31Lise C3 Casein
## 7 6 MS22333 Tyr 33-C4-D6 EC6-4 32Lise C4 Casein
## 8 6 MS22333 Val 33-C4-D6 EC6-4 32Lise C4 Casein
## 9 6 No Ala No-C3-D6 AC6-3 39Lise C3 Casein
## 10 6 No Asn No-C2-D6 AC6-2 38Lise C2 Casein
## # i 43 more rows
## # i 4 more variables: conc <chr>, concentration <dbl>, is.outlier <lgl>,
## # is.extreme <lgl>
###VISUALIZE
#Name order in plot
$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain,
dat.cleanlevels = c("No", "pAK80", "SK11", "Wg2", "MS22333"),
labels =c ("Non-inoculated",
"MS22418 (PrtP-)",
"MS22421 (PrtP/SK11)",
"MS22425 (PrtP/Wg2)",
"MS22427 (PrtP/MS22333)"))
$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))
dat.clean
#Boxplots
<- dat.clean %>%
bxp.AA.CCDM ggboxplot(.,
x = "Day",
y = "concentration",
color = "Strain",
facet.by = "aa")+
labs(title="CCDM", x="Time [days]", y="Amino acid concentration [uM]")+
scale_y_continuous()+
facet_wrap(~aa,
nrow=5,
scales = "free_y")+
scale_color_manual(values = c("Non-inoculated"="#E89242FF",
"MS22418 (PrtP-)"="#82491EFF",
"MS22421 (PrtP/SK11)"="#24325FFF",
"MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)theme_light()+
theme(legend.position="bottom")
#Save plot
ggsave(filename = "plots/bxp.AA.CCDM.svg", plot = bxp.AA.CCDM, width = 16.5, height = 20, units =c("cm"), dpi = 600)
Figure S5A
5.4 Potato - ALL amino acids
5.4.1 DATA
# load data
load("R_objects/stat_test_data_AA.RData")
# data subset
<- dat %>%
dat.clean filter(., Protein == "Potato", !is.na(concentration))
# Set names of variables
<- c("Strain", "Day", "aa")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 266 x 7
## Day Strain aa variable n mean sd
## <fct> <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 1 MS22333 Ala concentration 3 10.3 0.274
## 2 1 MS22333 Asn concentration 3 3.74 1.77
## 3 1 MS22333 Asp concentration 3 7.71 0.054
## 4 1 MS22333 Cys concentration 3 0 0
## 5 1 MS22333 Gln/Arg concentration 3 9.75 0.105
## 6 1 MS22333 Glu concentration 3 9.10 0.297
## 7 1 MS22333 Gly concentration 3 80.8 1.97
## 8 1 MS22333 His concentration 3 0.911 0.11
## 9 1 MS22333 Ile concentration 3 23.0 0.481
## 10 1 MS22333 Leu concentration 3 1.74 0.057
## # i 256 more rows
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 17 x 12
## Day Strain aa Sample_ID Sample `First Injection` `Sample no` Protein
## <fct> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 6 MS22333 Ile 33_P1-D6 EP6-1 Lise25 P1 Potato
## 2 6 MS22333 Leu 33_P1-D6 EP6-1 Lise25 P1 Potato
## 3 6 MS22333 Val 33_P1-D6 EP6-1 Lise25 P1 Potato
## 4 6 SK11 Asn SK11_P1-D6 CP6-1 Lise17 P1 Potato
## 5 6 SK11 Glu SK11_P1-D6 CP6-1 Lise17 P1 Potato
## 6 6 SK11 Gly SK11_P1-D6 CP6-1 Lise17 P1 Potato
## 7 6 SK11 Thr SK11_P4-D6 CP6-4 Lise20 P4 Potato
## 8 6 SK11 Val SK11_P1-D6 CP6-1 Lise17 P1 Potato
## 9 6 Wg2 Gln/Arg Wg2_P3-D6 DP6-3 Lise11 P3 Potato
## 10 6 Wg2 Ser Wg2_P1-D6 DP6-1 Lise9 P1 Potato
## 11 6 Wg2 Thr Wg2_P1-D6 DP6-1 Lise9 P1 Potato
## 12 6 Wg2 Tyr Wg2_P3-D6 DP6-3 Lise11 P3 Potato
## 13 6 pAK80 Gly pAK80_P4-~ BP6-4 Lise4 P4 Potato
## 14 6 pAK80 Leu pAK80_P3-~ BP6-3 Lise3 P3 Potato
## 15 6 pAK80 Lys pAK80_P3-~ BP6-3 Lise3 P3 Potato
## 16 6 pAK80 Trp pAK80_P2-~ BP6-2 Lise2 P2 Potato
## 17 6 pAK80 Val pAK80_P3-~ BP6-3 Lise3 P3 Potato
## # i 4 more variables: conc <chr>, concentration <dbl>, is.outlier <lgl>,
## # is.extreme <lgl>
###VISUALIZE
#Name order in plot
$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain,
dat.cleanlevels = c("No", "pAK80", "SK11", "Wg2", "MS22333"),
labels =c ("Non-inoculated",
"MS22418 (PrtP-)",
"MS22421 (PrtP/SK11)",
"MS22425 (PrtP/Wg2)",
"MS22427 (PrtP/MS22333)"))
$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))
dat.clean
#Boxplots
<- dat.clean %>%
bxp.AA.PCDM ggboxplot(.,
x = "Day",
y = "concentration",
color = "Strain",
facet.by = "aa")+
labs(title="PCDM", x="Time [days]", y="Amino acid concentration [uM]")+
scale_y_continuous()+
facet_wrap(~aa,
nrow=5,
scales = "free_y")+
scale_color_manual(values = c("Non-inoculated"="#E89242FF",
"MS22418 (PrtP-)"="#82491EFF",
"MS22421 (PrtP/SK11)"="#24325FFF",
"MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)theme_light()+
theme(legend.position="bottom")
#Save plot
ggsave(filename = "plots/bxp.AA.PCDM.svg", plot = bxp.AA.PCDM, width = 16.5, height = 20, units =c("cm"), dpi = 600)
Figure S5B
5.5 ALL - Create normalized heatmap
# load data
load("R_objects/stat_test_data_AA.RData")
5.5.1 NORMALIZED DATA
<- dat %>%
dat.norm group_by(Sample) %>%
mutate(conc_norm = 100*concentration/sum(concentration))
table(is.na(dat.norm$concentration))
##
## FALSE TRUE
## 1767 19
# Extract metadata and aa table
<- dat.norm %>%
dat.wide filter(!is.na(concentration)) %>%
pivot_wider(values_from = conc_norm,
names_from = aa,
id_cols = c(`Sample no`, Strain, Protein, Day, Sample),
values_fill = 0)
<- dat.wide %>%
meta ungroup() %>%
select(`Sample no`:Sample) %>%
data.framerow.names(meta) <- meta$Sample
#aa.num defines the aa data in prercent
<- dat.wide %>%
aa.num ungroup() %>%
select(His:Trp) %>%
data.framerow.names(aa.num) <- meta$Sample
#aa.norm is the calculated z-score used in the heatmap
<- aa.num
aa.norm
for (i in 1:ncol(aa.num)){
<- mean(aa.num[meta$Strain == "No",i])
x.no <- (aa.num[,i]-x.no)/sd(aa.num[,i])
aa.norm[,i] }
5.5.2 VISUALIZE: Heatmap
#Annotations
=data.frame("Day"=meta$Day, "Medium" = meta$Protein, "Strain" = meta$Strain)
ano_df2 rownames(ano_df2) = rownames(meta)
$Day <- as.factor(ano_df2$Day)
ano_df2$Medium <- factor(ano_df2$Medium, levels = c("Casein", "Potato"), labels=c("CCDM", "PCDM"))
ano_df2$Strain <- factor(ano_df2$Strain, levels = c("No", "pAK80", "SK11", "Wg2", "MS22333"), labels=c("A: Non-inoculated", "B: MS22418 (PrtP-)", "C: MS22421 (PrtP/SK11)", "D: MS22425 (PrtP/Wg2)", "E: MS22427 (PrtP/MS22333)"))
ano_df2
=list("Medium"=c("CCDM"="#A6EEE6FF", "PCDM"="#FAE48BFF"), "Day" = c("1"= "#d2d3daff", "2"= "#86889bff", "6"="#282931ff"), "Strain" =c("A: Non-inoculated"="#E89242FF", "B: MS22418 (PrtP-)"="#82491EFF", "C: MS22421 (PrtP/SK11)"="#24325FFF", "D: MS22425 (PrtP/Wg2)"="#B7E4F9FF", "E: MS22427 (PrtP/MS22333)"="#FB6467FF"))
annot_colors
#Plot
<- pheatmap(aa.norm,
heatmap annotation_row = ano_df2,
cutree_cols = 3, cutree_rows = 5, border_color = NA,
color=colorRampPalette(c("navy", "white", "#d50c00ff"))(50), annotation_colors = annot_colors)
#Save
ggsave(filename = "plots/pmap_aa.svg", plot = heatmap,width = 20, height = 32, units = "cm",dpi = 600)
Figure 3
5.6 ALL - PERMANOVA and PCoA
The heatmap’s normalized amino acid concentrations are used, with the same distances between samples.
5.6.1 PERMANOVA
# Calculate distances
<- vegdist(aa.num, method = "euclidian")
ed
adonis2(ed ~ Day, data = ano_df2)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = ed ~ Day, data = ano_df2)
## Df SumOfSqs R2 F Pr(>F)
## Day 2 5324 0.08926 4.4103 0.005 **
## Residual 90 54328 0.91074
## Total 92 59652 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed ~ Strain, data = ano_df2)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = ed ~ Strain, data = ano_df2)
## Df SumOfSqs R2 F Pr(>F)
## Strain 4 29063 0.4872 20.902 0.001 ***
## Residual 88 30589 0.5128
## Total 92 59652 1.0000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed ~ Medium, data = ano_df2)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = ed ~ Medium, data = ano_df2)
## Df SumOfSqs R2 F Pr(>F)
## Medium 1 4415 0.07401 7.2736 0.006 **
## Residual 91 55237 0.92599
## Total 92 59652 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed ~ Strain*Day*Medium, data = ano_df2)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = ed ~ Strain * Day * Medium, data = ano_df2)
## Df SumOfSqs R2 F Pr(>F)
## Strain 4 29063 0.48720 123.0467 0.001 ***
## Day 2 5072 0.08502 42.9452 0.001 ***
## Medium 1 4496 0.07536 76.1354 0.001 ***
## Strain:Day 7 3214 0.05388 7.7763 0.001 ***
## Strain:Medium 4 11093 0.18596 46.9646 0.001 ***
## Day:Medium 2 1298 0.02177 10.9943 0.001 ***
## Strain:Day:Medium 7 1579 0.02647 3.8194 0.001 ***
## Residual 65 3838 0.06434
## Total 92 59652 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5.6.1.1 VISUALIZE: PCoA plot
# load data
load("R_objects/stat_test_data.RData")
# Create subset
<- dat
dat.clean <- dat.clean %>% select(all_of(ends_with("acid"))) dat.num
Eigenvalues for axes
# Calculating Bray-Curtis PCoA with capscale
<- capscale(aa.num ~ 1, comm = aa.num, distance = "euclidian", metaMDS = TRUE) tmp2
## Square root transformation
## Wisconsin double standardization
tmp2
## Call: capscale(formula = aa.num ~ 1, distance = "euclidian", comm =
## aa.num, metaMDSdist = TRUE)
##
## Inertia Rank
## Total 1.257
## Unconstrained 1.257 18
## Inertia is squared Euclidean distance
## Species scores projected from 'aa.num'
##
## Eigenvalues for unconstrained axes:
## MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7 MDS8
## 0.6189 0.2642 0.1496 0.0648 0.0433 0.0377 0.0204 0.0165
## (Showing 8 of 18 unconstrained eigenvalues)
##
## metaMDSdist transformed data: wisconsin(sqrt(aa.num))
Prepare data for plotting
# Collect data for plotting
<- data.frame(tmp2$CA$u)
mds.samples <- data.frame(tmp2$CA$v)
mds.scfa
# Prepare point zero and labels for arrows
$label1 <- row.names(mds.scfa)
mds.scfa$x <- 0
mds.scfa$y <- 0
mds.scfa
# Bind with main data
<- cbind(ano_df2, mds.samples) dat.mds
<- list()
plots <- data.frame(Day = c(1,2,6))
sets for (i in 1:nrow(sets)) {
<- dat.mds[dat.mds$Day == sets$Day[i],]
tmp <- ggplot() +
plots[[i]] geom_point(data = tmp,
mapping = aes(x = MDS1, y = MDS2,
color = Strain,
shape = Medium),
size = 3,
alpha = 0.7) +
theme_pubr(legend = "top") +
labs(x = "Axis 1", y = "Axis 2") +
guides(fill = "none") +
guides(size = "none") +
scale_color_manual(values = c("A: Non-inoculated"="#E89242FF",
"B: MS22418 (PrtP-)"="#82491EFF",
"C: MS22421 (PrtP/SK11)"="#24325FFF",
"D: MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"E: MS22427 (PrtP/MS22333)"="#FB6467FF"
)+
) coord_cartesian(xlim = c(min(mds.samples$MDS1)*1.1,
max(mds.samples$MDS1)*1.1),
ylim = c(min(mds.samples$MDS2)*1.1,
max(mds.samples$MDS2)*1.1))
names(plots)[i] <- paste(sets$Medium[i], "day", sets$Day[i], sep = " ")
}4]] <- ggplot() +
plots[[geom_segment(data = mds.scfa, mapping = aes(x=x, y=y, xend=0.35*MDS1, yend=0.35*MDS2),
lineend = "butt",
linejoin = "round",
linewidth = 0.5,
arrow = arrow(length = unit(0.3, 'cm'))) +
geom_label_repel(data = mds.scfa,
mapping = aes(x = 0.35*MDS1, y = 0.35*MDS2), #
label = mds.scfa$label1,
size = 4,
min.segment.length = 0,
segment.alpha = 0.8,
box.padding = 0.3,
force = 1) +
theme_pubr(legend = "top") +
labs(x = "Axis 1", y = "Axis 2") +
guides(fill = "none") +
coord_cartesian(xlim = c(min(mds.samples$MDS1)*1.1, max(mds.samples$MDS1)*1.1), ylim = c(min(mds.samples$MDS2)*1.1,max(mds.samples$MDS2)*1.1))
names(plots)[4] <- "AA weights"
<- ggarrange(plotlist = plots, widths = 12, heights = 8,
p.out common.legend = TRUE,
legend = "bottom",
labels = names(plots)
)
#Save plot
ggsave(filename = "plots/SCFA_PCoA.svg", plot = p.out, width = 30.48, height = 20.32, units = "cm",bg = "white", dpi = 600)
Figure 4
5.6.2 Proteolytic strains - PERMANOVA and PCOA plot
5.6.2.1 PERMANOVA
# Create subset to compare Wg2 and MS22333 samples
<- dat.wide %>%
sub.aa filter(Strain %in% c("Wg2", "MS22333"))
# Extract metadata and aa table for subset
<- sub.aa %>%
meta ungroup() %>%
select(`Sample no`:Sample) %>%
data.framerow.names(meta) <- meta$Sample
# sub.aa.num defines the aa data in prercent
<- sub.aa %>%
sub.aa.num ungroup() %>%
select(His:Trp) %>%
data.framerow.names(sub.aa.num) <- meta$Sample
# Calculate distances
<- vegdist(sub.aa.num, method = "euclidian")
ed.sub
adonis2(ed.sub ~ as.factor(Day), data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = ed.sub ~ as.factor(Day), data = meta)
## Df SumOfSqs R2 F Pr(>F)
## as.factor(Day) 2 1595.3 0.11949 2.5106 0.072 .
## Residual 37 11755.4 0.88051
## Total 39 13350.7 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed.sub ~ Strain, data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = ed.sub ~ Strain, data = meta)
## Df SumOfSqs R2 F Pr(>F)
## Strain 1 257.4 0.01928 0.7471 0.401
## Residual 38 13093.3 0.98072
## Total 39 13350.7 1.00000
adonis2(ed.sub ~ Protein, data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = ed.sub ~ Protein, data = meta)
## Df SumOfSqs R2 F Pr(>F)
## Protein 1 10306.8 0.772 128.67 0.001 ***
## Residual 38 3043.9 0.228
## Total 39 13350.7 1.000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed.sub ~ Protein*Day*Strain, data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = ed.sub ~ Protein * Day * Strain, data = meta)
## Df SumOfSqs R2 F Pr(>F)
## Protein 1 10306.8 0.77200 2517.3348 0.001 ***
## Day 2 1595.3 0.11949 194.8173 0.001 ***
## Strain 1 257.4 0.01928 62.8747 0.001 ***
## Protein:Day 2 879.2 0.06586 107.3707 0.001 ***
## Protein:Strain 1 55.4 0.00415 13.5394 0.001 ***
## Day:Strain 2 86.4 0.00648 10.5573 0.001 ***
## Protein:Day:Strain 2 55.5 0.00415 6.7738 0.001 ***
## Residual 28 114.6 0.00859
## Total 39 13350.7 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
adonis2(ed.sub ~ Protein:Day, data = meta)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = ed.sub ~ Protein:Day, data = meta)
## Df SumOfSqs R2 F Pr(>F)
## Protein:Day 5 12781.3 0.95735 152.63 0.001 ***
## Residual 34 569.4 0.04265
## Total 39 13350.7 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5.6.2.2 VISUALIZE: PCoA plot
Eigenvalues for axes
# Calculating Bray-Curtis PCoA with capscale
<- capscale(sub.aa.num ~ 1, comm = sub.aa.num, distance = "euclidian", metaMDS = TRUE) tmp.sub
## Wisconsin double standardization
tmp.sub
## Call: capscale(formula = sub.aa.num ~ 1, distance = "euclidian", comm =
## sub.aa.num, metaMDSdist = TRUE)
##
## Inertia Rank
## Total 0.6945
## Unconstrained 0.6945 18
## Inertia is squared Euclidean distance
## Species scores projected from 'sub.aa.num'
##
## Eigenvalues for unconstrained axes:
## MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7 MDS8
## 0.3510 0.1746 0.0584 0.0360 0.0279 0.0123 0.0107 0.0075
## (Showing 8 of 18 unconstrained eigenvalues)
##
## metaMDSdist transformed data: wisconsin(sub.aa.num)
# Collect data for plotting
<- data.frame(tmp.sub$CA$u)
mds.samples.sub
# Bind with main data
<- cbind(ano_df2[ano_df2$Strain %in% c("D: MS22425 (PrtP/Wg2)", "E: MS22427 (PrtP/MS22333)"),], mds.samples.sub)
dat.mds.sub
# Plot
<- ggplot() +
p.sub geom_point(data = dat.mds.sub,
mapping = aes(x = MDS1, y = MDS2,
color = Strain,
shape = Medium),
size = 3,
alpha = 0.7) +
theme_pubr(legend = "top") +
labs(x = "Axis 1", y = "Axis 2") +
guides(fill = "none") +
guides(size = "none") +
facet_wrap("Day", ncol = 1) +
scale_color_manual(values = c("D: MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"E: MS22427 (PrtP/MS22333)"="#FB6467FF")
+
) coord_cartesian(xlim = c(min(mds.samples.sub$MDS1)*1.1,
max(mds.samples.sub$MDS1)*1.1),
ylim = c(min(mds.samples.sub$MDS2)*1.1,
max(mds.samples.sub$MDS2)*1.1))+
theme(legend.position="bottom")
p.sub
5.7 ALL - Total free amino acid concentraiton (TFAA)
5.7.1 DATA
# load data
load("R_objects/stat_test_data_AA.RData")
#Calculate TFAA for each sample
<- dat %>%
dat group_by(Sample, Protein, Strain, Day) %>%
summarise(TFAA = sum(concentration))
## `summarise()` has grouped output by 'Sample', 'Protein', 'Strain'. You can
## override using the `.groups` argument.
<- dat %>%
dat.clean filter(!is.na(TFAA))
5.7.2 PREPARE VARIABLES
# Set names of variables
<- c("Strain", "Protein", "Day")
PREDICTOR <- "TFAA"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 28 x 7
## Protein Strain Day variable n mean sd
## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 Casein MS22333 1 TFAA 3 1145. 19.8
## 2 Casein MS22333 2 TFAA 3 1292. 32.2
## 3 Casein MS22333 6 TFAA 4 1376. 79.7
## 4 Potato MS22333 1 TFAA 3 275. 7.73
## 5 Potato MS22333 2 TFAA 3 479. 25.4
## 6 Potato MS22333 6 TFAA 4 451. 12.9
## 7 Casein No 1 TFAA 3 175. 10.5
## 8 Casein No 6 TFAA 4 172. 16.4
## 9 Potato No 1 TFAA 3 200. 12.1
## 10 Potato No 6 TFAA 3 201. 3.38
## # i 18 more rows
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
5.7.3 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 5 x 7
## Protein Strain Day Sample TFAA is.outlier is.extreme
## <chr> <chr> <fct> <chr> <dbl> <lgl> <lgl>
## 1 Potato MS22333 6 EP6-1 432. TRUE FALSE
## 2 Casein No 6 AC6-3 197. TRUE FALSE
## 3 Potato SK11 6 CP6-4 249. TRUE FALSE
## 4 Casein Wg2 6 DC6-4 1982. TRUE FALSE
## 5 Casein pAK80 6 BC6-4 622. TRUE FALSE
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals. The Shapiro-Wilk test is performed on all data groups.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality for data groups
%>%
dat.clean group_by(Day, Protein, Strain) %>%
shapiro_test(vars = "TFAA") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 0 x 7
## # i 7 variables: Protein <chr>, Strain <chr>, Day <fct>, variable <chr>,
## # statistic <dbl>, p <dbl>, p.adj <dbl>
Test homogneity of variance assumption
plot(model, 1)
#Levene_test
%>%
dat.clean group_by(Protein, Strain) %>%
levene_test(TFAA ~ Day) %>%
adjust_pvalue(method = "fdr")
## # A tibble: 10 x 7
## Protein Strain df1 df2 statistic p p.adj
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Casein MS22333 2 7 0.827 0.476 0.748
## 2 Casein No 1 5 0.0669 0.806 0.806
## 3 Casein SK11 2 7 0.406 0.681 0.757
## 4 Casein Wg2 2 7 0.541 0.605 0.756
## 5 Casein pAK80 2 7 1.22 0.350 0.748
## 6 Potato MS22333 2 7 0.798 0.487 0.748
## 7 Potato No 1 4 0.543 0.502 0.748
## 8 Potato SK11 2 7 1.19 0.358 0.748
## 9 Potato Wg2 2 7 0.711 0.524 0.748
## 10 Potato pAK80 2 7 1.62 0.263 0.748
5.7.4 PERFORM TEST
Tests on the whole data set did not find normality, but data groups were normal distributed according to the Shapiro-Wilk test of normality. Therefore, ANOVA and t-test are used for the statistics of this data set.
# for each protein source
<- dat.clean %>%
res.protein group_by(Strain, Day) %>%
anova_test(TFAA ~ Protein) %>%
adjust_pvalue(method = "fdr")
res.protein
## # A tibble: 14 x 10
## Strain Day Effect DFn DFd F p `p<.05` ges p.adj
## <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 MS22333 1 Protein 1 4 5005. 0.000000239 "*" 0.999 2.12e-6
## 2 MS22333 2 Protein 1 4 1181. 0.00000428 "*" 0.997 1.20e-5
## 3 MS22333 6 Protein 1 6 524. 0.000000454 "*" 0.989 2.12e-6
## 4 No 1 Protein 1 4 7.29 0.054 "" 0.646 6.87e-2
## 5 No 6 Protein 1 5 8.43 0.034 "*" 0.628 4.76e-2
## 6 SK11 1 Protein 1 4 2.44 0.193 "" 0.379 2.25e-1
## 7 SK11 2 Protein 1 4 1.65 0.268 "" 0.292 2.89e-1
## 8 SK11 6 Protein 1 6 361. 0.00000138 "*" 0.984 4.83e-6
## 9 Wg2 1 Protein 1 4 4141. 0.000000349 "*" 0.999 2.12e-6
## 10 Wg2 2 Protein 1 4 64.9 0.001 "*" 0.942 1.75e-3
## 11 Wg2 6 Protein 1 6 11.4 0.015 "*" 0.654 2.33e-2
## 12 pAK80 1 Protein 1 4 111. 0.000462 "*" 0.965 9.24e-4
## 13 pAK80 2 Protein 1 4 166. 0.000209 "*" 0.976 4.88e-4
## 14 pAK80 6 Protein 1 6 0.246 0.637 "" 0.039 6.37e-1
<- dat.clean %>%
pwc.protein filter(Protein %in% c("Casein", "Potato")) %>%
group_by(Strain, Day) %>%
pairwise_t_test(TFAA ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc.protein
## # A tibble: 10 x 11
## Strain Day .y. group1 group2 n1 n2 p p.signif p.adj
## <chr> <fct> <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl>
## 1 MS22333 1 TFAA Casein Potato 3 3 0.000000239 **** 0.00000212
## 2 MS22333 2 TFAA Casein Potato 3 3 0.00000428 **** 0.0000120
## 3 MS22333 6 TFAA Casein Potato 4 4 0.000000454 **** 0.00000212
## 4 No 6 TFAA Casein Potato 4 3 0.0337 * 0.0472
## 5 SK11 6 TFAA Casein Potato 4 4 0.00000138 **** 0.00000483
## 6 Wg2 1 TFAA Casein Potato 3 3 0.000000349 **** 0.00000212
## 7 Wg2 2 TFAA Casein Potato 3 3 0.00129 ** 0.00226
## 8 Wg2 6 TFAA Casein Potato 4 4 0.015 * 0.0233
## 9 pAK80 1 TFAA Casein Potato 3 3 0.000462 *** 0.000924
## 10 pAK80 2 TFAA Casein Potato 3 3 0.000209 *** 0.000488
## # i 1 more variable: p.adj.signif <chr>
# for each strain
<- dat.clean %>%
res.strain group_by(Day, Protein) %>%
anova_test(TFAA ~ Strain) %>%
adjust_pvalue(method = "fdr")
res.strain
## # A tibble: 6 x 10
## Protein Day Effect DFn DFd F p `p<.05` ges p.adj
## <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Casein 1 Strain 4 10 2945. 2.63e-15 "*" 0.999 1.58e-14
## 2 Potato 1 Strain 4 10 34.7 7.78e- 6 "*" 0.933 9.34e- 6
## 3 Casein 2 Strain 3 8 233. 3.99e- 8 "*" 0.989 7.98e- 8
## 4 Potato 2 Strain 3 8 0.926 4.71e- 1 "" 0.258 4.71e- 1
## 5 Casein 6 Strain 4 15 23.9 2.35e- 6 "*" 0.864 3.52e- 6
## 6 Potato 6 Strain 4 14 355. 6.67e-14 "*" 0.99 2.00e-13
<- dat.clean %>%
pwc.strain filter(Protein %in% c("Casein", "Potato")) %>%
group_by(Day, Protein) %>%
pairwise_t_test(TFAA ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p.adj < 0.05) pwc.strain
## # A tibble: 33 x 11
## Protein Day .y. group1 group2 n1 n2 p p.signif p.adj
## <chr> <fct> <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl>
## 1 Casein 1 TFAA MS22333 No 3 3 1.49e-15 **** 2.58e-14
## 2 Casein 1 TFAA MS22333 pAK80 3 3 1.22e-15 **** 2.58e-14
## 3 Casein 1 TFAA MS22333 SK11 3 3 1.48e-15 **** 2.58e-14
## 4 Casein 1 TFAA MS22333 Wg2 3 3 6.82e-12 **** 3.22e-11
## 5 Casein 1 TFAA No Wg2 3 3 4.02e-13 **** 2.61e-12
## 6 Casein 1 TFAA pAK80 Wg2 3 3 2.84e-13 **** 2.46e-12
## 7 Casein 1 TFAA SK11 Wg2 3 3 3.96e-13 **** 2.61e-12
## 8 Potato 1 TFAA MS22333 No 3 3 8.01e- 6 **** 1.81e- 5
## 9 Potato 1 TFAA MS22333 pAK80 3 3 6.58e- 7 **** 1.90e- 6
## 10 Potato 1 TFAA No pAK80 3 3 2.45e- 2 * 3.86e- 2
## # i 23 more rows
## # i 1 more variable: p.adj.signif <chr>
5.7.5 VISUALIZE
#Name order in plot
$Day <- as.factor(dat.clean$Day)
dat.clean$Strain <- factor(dat.clean$Strain,
dat.cleanlevels = c("No", "pAK80", "SK11", "Wg2", "MS22333"),
labels =c ("Non-inoculated",
"MS22418 (PrtP-)",
"MS22421 (PrtP/SK11)",
"MS22425 (PrtP/Wg2)",
"MS22427 (PrtP/MS22333)"))
$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))
dat.clean
#Plot - boxplot
<- dat.clean %>%
bxp.TFAA filter(., Protein %in% c("Casein", "Potato")) %>%
filter (., Day %in% c(1, 2, 6)) %>%
ggboxplot(.,
x = "Day",
y = "TFAA",
color = "Strain",
facet.by = "Protein")+
labs(x="Time [days]", y="TFAA [uM]")+
scale_y_continuous()+
facet_wrap(~Protein,
nrow=1,
labeller = as_labeller(c("Casein"="CCDM",
"Potato"="PCDM"))
+
)scale_color_manual(values = c("Non-inoculated"="#E89242FF",
"MS22418 (PrtP-)"="#82491EFF",
"MS22421 (PrtP/SK11)"="#24325FFF",
"MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)theme_light()+
theme(legend.position="bottom")
#Save plot
ggsave(filename = "plots/bxp.TFAA.svg", plot = bxp.TFAA, width = 13, height = 10, units =c("cm"), dpi = 600)
Figure 2
5.8 Sum of Phe for all days - Strain and Protein- Concentration
5.8.1 DATA
# load data
load("R_objects/stat_test_data_AA.RData")
# Create subset
<- dat %>% filter(aa %in% c("Phe"), !is.na(concentration)) dat.clean
5.8.2 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Strain")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 10 x 6
## Protein Strain variable n mean sd
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 Casein MS22333 concentration 10 89.4 8.72
## 2 Casein No concentration 7 3.32 0.194
## 3 Casein SK11 concentration 10 2.28 1.22
## 4 Casein Wg2 concentration 10 67.7 34.3
## 5 Casein pAK80 concentration 10 5.83 8.78
## 6 Potato MS22333 concentration 10 50.9 20.1
## 7 Potato No concentration 6 4.75 0.304
## 8 Potato SK11 concentration 10 19.7 17.4
## 9 Potato Wg2 concentration 10 42.0 19.8
## 10 Potato pAK80 concentration 10 9.22 4.31
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
5.8.3 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 7 x 12
## Protein Strain Sample_ID Sample `First Injection` `Sample no` Day conc
## <chr> <chr> <chr> <chr> <chr> <chr> <fct> <chr>
## 1 Casein Wg2 Wg2_C4-D6 DC6-4 Lise16 C4 6 uM
## 2 Casein pAK80 pAK80_C1-D6 BC6-1 Lise5 C1 6 uM
## 3 Casein pAK80 pAK80_C2-D6 BC6-2 Lise6 C2 6 uM
## 4 Casein pAK80 pAK80_C3-D6 BC6-3 Lise7 C3 6 uM
## 5 Casein pAK80 pAK80_C4-D6 BC6-4 Lise8 C4 6 uM
## 6 Potato No No-P2-D1 AP1-2 66Lise P2 1 uM
## 7 Potato SK11 Sk11-P3-D2 CP2-3 85Lise P3 2 uM
## # i 4 more variables: aa <chr>, concentration <dbl>, is.outlier <lgl>,
## # is.extreme <lgl>
#Test coef for outlier
%>%
dat.clean filter(Strain == "Wg2" & Protein == "Casein" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.8)
## [1] FALSE FALSE FALSE TRUE
%>%
dat.clean filter(Strain == "pAK80" & Protein == "Casein" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE FALSE
%>%
dat.clean filter(Strain == "SK11" & Protein == "Potato" & Day == 2) %>%
pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE
#Remove extreme outliers with coef > 2.59
<- c("Wg2_C4-D6")
extreme_outliers
<- subset(dat.clean, !Sample %in% extreme_outliers) dat.clean
# Create plot without extreme outliers
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.756 3.87e-11
# Compute Shapiro-Wilk test of normality for data groups
%>%
dat.clean group_by(Protein, Strain) %>%
shapiro_test(vars = "concentration") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 7 x 6
## Protein Strain variable statistic p p.adj
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Casein No concentration 0.755 0.0143 0.0205
## 2 Casein SK11 concentration 0.775 0.00720 0.0133
## 3 Casein Wg2 concentration 0.549 0.0000137 0.0000687
## 4 Casein pAK80 concentration 0.456 0.00000110 0.0000110
## 5 Potato MS22333 concentration 0.709 0.00112 0.00280
## 6 Potato No concentration 0.711 0.00798 0.0133
## 7 Potato SK11 concentration 0.600 0.0000556 0.000185
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 9 83 1.99 0.0509
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
5.8.4 PERFORM TEST
The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.
# for each protein source
<- dat.clean %>%
res.protein group_by(Strain) %>%
kruskal_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 5 x 8
## Strain .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 MS22333 concentration 20 14.3 1 0.000157 Kruskal-Wallis 0.000392
## 2 No concentration 13 9 1 0.0027 Kruskal-Wallis 0.004
## 3 SK11 concentration 20 14.3 1 0.000157 Kruskal-Wallis 0.000392
## 4 Wg2 concentration 20 4.17 1 0.0413 Kruskal-Wallis 0.0413
## 5 pAK80 concentration 20 8.69 1 0.0032 Kruskal-Wallis 0.004
<- dat.clean %>%
pwc group_by(Strain) %>%
pairwise_wilcox_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 5 x 10
## Strain .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 MS22333 conc~ Casein Potato 10 10 100 1.08e-5 2.70e-5 ****
## 2 No conc~ Casein Potato 7 6 0 1 e-3 1.67e-3 **
## 3 SK11 conc~ Casein Potato 10 10 0 1.08e-5 2.70e-5 ****
## 4 Wg2 conc~ Casein Potato 10 10 77 4.3 e-2 4.3 e-2 *
## 5 pAK80 conc~ Casein Potato 10 10 11 2 e-3 2.5 e-3 **
# for each strain
<- dat.clean %>%
res.strain group_by(Protein) %>%
kruskal_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 2 x 8
## Protein .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein concentration 47 37.2 4 0.000000166 Kruskal-Wallis 3.32e-7
## 2 Potato concentration 46 34.5 4 0.00000058 Kruskal-Wallis 5.80e-7
<- dat.clean %>%
pwc group_by(Protein) %>%
pairwise_wilcox_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 16 x 10
## Protein .y. group1 group2 n1 n2 statistic p p.adj
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Casein concentration MS22333 No 10 7 70 0.000103 0.000257
## 2 Casein concentration MS22333 pAK80 10 10 100 0.0000108 0.000036
## 3 Casein concentration MS22333 SK11 10 10 100 0.0000108 0.000036
## 4 Casein concentration MS22333 Wg2 10 10 90 0.002 0.00308
## 5 Casein concentration No Wg2 7 10 0 0.000103 0.000257
## 6 Casein concentration pAK80 Wg2 10 10 0 0.0000108 0.000036
## 7 Casein concentration SK11 Wg2 10 10 0 0.0000108 0.000036
## 8 Potato concentration MS22333 No 10 6 60 0.00025 0.000455
## 9 Potato concentration MS22333 pAK80 10 10 100 0.0000108 0.000036
## 10 Potato concentration MS22333 SK11 10 10 91 0.001 0.00167
## 11 Potato concentration No pAK80 6 10 6 0.007 0.00933
## 12 Potato concentration No SK11 6 10 0 0.00025 0.000455
## 13 Potato concentration No Wg2 6 10 0 0.00025 0.000455
## 14 Potato concentration pAK80 SK11 10 10 14 0.005 0.00714
## 15 Potato concentration pAK80 Wg2 10 10 0 0.0000108 0.000036
## 16 Potato concentration SK11 Wg2 10 10 20 0.023 0.0288
## # i 1 more variable: p.adj.signif <chr>
5.9 Sum of Thr for all days - Strain and Protein- Concentration
5.9.1 DATA
# load data
load("R_objects/stat_test_data_AA.RData")
# Create subset
<- dat %>% filter(aa %in% c("Thr"), !is.na(concentration)) dat.clean
5.9.2 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Strain")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 10 x 6
## Protein Strain variable n mean sd
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 Casein MS22333 concentration 10 6.58 1.86
## 2 Casein No concentration 7 0.614 0.68
## 3 Casein SK11 concentration 10 0.586 0.239
## 4 Casein Wg2 concentration 10 6.74 4.33
## 5 Casein pAK80 concentration 10 0.765 0.445
## 6 Potato MS22333 concentration 10 5.05 2.59
## 7 Potato No concentration 6 2.79 0.347
## 8 Potato SK11 concentration 10 1.35 2.40
## 9 Potato Wg2 concentration 10 2.37 1.56
## 10 Potato pAK80 concentration 10 0.743 0.176
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
5.9.3 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 6 x 12
## Protein Strain Sample_ID Sample `First Injection` `Sample no` Day conc
## <chr> <chr> <chr> <chr> <chr> <chr> <fct> <chr>
## 1 Casein No No-C3-D6 AC6-3 39Lise C3 6 uM
## 2 Casein SK11 SK11_C1-D6 CC6-1 Lise21 C1 6 uM
## 3 Casein SK11 SK11_C2-D6 CC6-2 Lise22 C2 6 uM
## 4 Casein Wg2 Wg2_C4-D6 DC6-4 Lise16 C4 6 uM
## 5 Casein pAK80 pAK80_C3-D6 BC6-3 Lise7 C3 6 uM
## 6 Potato SK11 Sk11-P3-D2 CP2-3 85Lise P3 2 uM
## # i 4 more variables: aa <chr>, concentration <dbl>, is.outlier <lgl>,
## # is.extreme <lgl>
#Test coef for outliers (No-C3-D6 Wg2_C4-D6 pAK80_C3-D6 Sk11-P3-D2)
%>%
dat.clean filter(Strain == "No" & Protein == "Casein" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE TRUE FALSE
%>%
dat.clean filter(Strain == "Wg2" & Protein == "Casein" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE TRUE
%>%
dat.clean filter(Strain == "pAK80" & Protein == "Casein" & Day == 6) %>%
pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE FALSE
%>%
dat.clean filter(Strain == "SK11" & Protein == "Potato" & Day == 2) %>%
pull(concentration) %>% is_outlier(coef = 2.59)
## [1] FALSE FALSE FALSE
#Remove extreme outliers with coef > 2.59
<- c("Wg2_C4-D6")
extreme_outliers
<- subset(dat.clean, !Sample %in% extreme_outliers) dat.clean
# Create plot without extreme outliers
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.752 3.20e-11
# Compute Shapiro-Wilk test of normality for data groups
%>%
dat.clean group_by(Protein, Strain) %>%
shapiro_test(vars = "concentration") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 4 x 6
## Protein Strain variable statistic p p.adj
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Casein No concentration 0.538 0.0000483 0.000242
## 2 Casein Wg2 concentration 0.683 0.000543 0.00181
## 3 Casein pAK80 concentration 0.703 0.000967 0.00242
## 4 Potato SK11 concentration 0.442 0.000000770 0.00000770
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 9 83 3.10 0.00292
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
5.9.4 PERFORM TEST
The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.
# for each protein source
<- dat.clean %>%
res.protein group_by(Strain) %>%
kruskal_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 2 x 8
## Strain .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 No concentration 13 9 1 0.0027 Kruskal-Wallis 0.00675
## 2 Wg2 concentration 20 10.6 1 0.00115 Kruskal-Wallis 0.00575
<- dat.clean %>%
pwc group_by(Strain) %>%
pairwise_wilcox_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 2 x 10
## Strain .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 No conce~ Casein Potato 7 6 0 1 e-3 0.0025 **
## 2 Wg2 conce~ Casein Potato 10 10 93 4.87e-4 0.00244 ***
# for each strain
<- dat.clean %>%
res.strain group_by(Protein) %>%
kruskal_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 2 x 8
## Protein .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein concentration 47 35.6 4 0.000000353 Kruskal-Wallis 7.06e-7
## 2 Potato concentration 46 27.6 4 0.0000148 Kruskal-Wallis 1.48e-5
<- dat.clean %>%
pwc group_by(Protein) %>%
pairwise_wilcox_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 14 x 10
## Protein .y. group1 group2 n1 n2 statistic p p.adj
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Casein concentration MS22333 No 10 7 70 0.000103 2.94e-4
## 2 Casein concentration MS22333 pAK80 10 10 100 0.0000108 4.32e-5
## 3 Casein concentration MS22333 SK11 10 10 100 0.0000108 4.32e-5
## 4 Casein concentration No pAK80 7 10 14 0.043 6.14e-2
## 5 Casein concentration No Wg2 7 10 0 0.000103 2.94e-4
## 6 Casein concentration pAK80 Wg2 10 10 0 0.0000108 4.32e-5
## 7 Casein concentration SK11 Wg2 10 10 0 0.0000108 4.32e-5
## 8 Potato concentration MS22333 pAK80 10 10 100 0.0000108 4.32e-5
## 9 Potato concentration MS22333 SK11 10 10 90 0.002 4 e-3
## 10 Potato concentration MS22333 Wg2 10 10 85 0.007 1.08e-2
## 11 Potato concentration No pAK80 6 10 60 0.00025 6.25e-4
## 12 Potato concentration No SK11 6 10 54 0.007 1.08e-2
## 13 Potato concentration pAK80 Wg2 10 10 6 0.000325 7.22e-4
## 14 Potato concentration SK11 Wg2 10 10 13 0.004 7.27e-3
## # i 1 more variable: p.adj.signif <chr>
5.10 Phe & Day6 - Strain and Protein- Concentration
5.10.1 DATA
# load data
load("R_objects/stat_test_data_AA.RData")
# Create subset
<- dat %>% filter(aa %in% c("Phe"), !is.na(concentration) & Day == 6) dat.clean
5.10.2 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Strain")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 10 x 6
## Protein Strain variable n mean sd
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 Casein MS22333 concentration 4 97.5 7.15
## 2 Casein No concentration 4 3.28 0.162
## 3 Casein SK11 concentration 4 0.907 0.519
## 4 Casein Wg2 concentration 4 87.9 50.2
## 5 Casein pAK80 concentration 4 9.68 14.1
## 6 Potato MS22333 concentration 4 63.9 2.89
## 7 Potato No concentration 3 4.64 0.113
## 8 Potato SK11 concentration 4 15.8 2.57
## 9 Potato Wg2 concentration 4 61.4 3.42
## 10 Potato pAK80 concentration 4 5.67 1.66
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
5.10.3 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 4 x 12
## Protein Strain Sample_ID Sample `First Injection` `Sample no` Day conc
## <chr> <chr> <chr> <chr> <chr> <chr> <fct> <chr>
## 1 Casein MS22333 33-C3-D6 EC6-3 31Lise C3 6 uM
## 2 Casein No No-C3-D6 AC6-3 39Lise C3 6 uM
## 3 Casein Wg2 Wg2_C4-D6 DC6-4 Lise16 C4 6 uM
## 4 Casein pAK80 pAK80_C4-D6 BC6-4 Lise8 C4 6 uM
## # i 4 more variables: aa <chr>, concentration <dbl>, is.outlier <lgl>,
## # is.extreme <lgl>
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.582 0.00000000239
# Compute Shapiro-Wilk test of normality for data groups
%>%
dat.clean group_by(Protein, Strain) %>%
shapiro_test(vars = "concentration") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 1 x 6
## Protein Strain variable statistic p p.adj
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Casein Wg2 concentration 0.644 0.00206 0.0206
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 9 29 0.872 0.560
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
5.10.4 PERFORM TEST
The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.
# for each protein source
<- dat.clean %>%
res.protein group_by(Strain) %>%
kruskal_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 3 x 8
## Strain .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 MS22333 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0522
## 2 No concentration 7 4.5 1 0.0339 Kruskal-Wallis 0.0565
## 3 SK11 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0522
<- dat.clean %>%
pwc group_by(Strain) %>%
pairwise_wilcox_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 2 x 10
## Strain .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 MS22333 concent~ Casein Potato 4 4 16 0.029 0.0725 *
## 2 SK11 concent~ Casein Potato 4 4 0 0.029 0.0725 *
# for each strain
<- dat.clean %>%
res.strain group_by(Protein) %>%
kruskal_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 2 x 8
## Protein .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein concentration 20 16.4 4 0.00258 Kruskal-Wallis 0.00277
## 2 Potato concentration 19 16.2 4 0.00277 Kruskal-Wallis 0.00277
<- dat.clean %>%
pwc group_by(Protein) %>%
pairwise_wilcox_test(concentration ~ Strain) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 12 x 10
## Protein .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Casein concen~ MS223~ No 4 4 16 0.029 0.0483 ns
## 2 Casein concen~ MS223~ pAK80 4 4 16 0.029 0.0483 ns
## 3 Casein concen~ MS223~ SK11 4 4 16 0.029 0.0483 ns
## 4 Casein concen~ No SK11 4 4 16 0.029 0.0483 ns
## 5 Casein concen~ No Wg2 4 4 0 0.029 0.0483 ns
## 6 Casein concen~ pAK80 Wg2 4 4 0 0.029 0.0483 ns
## 7 Casein concen~ SK11 Wg2 4 4 0 0.029 0.0483 ns
## 8 Potato concen~ MS223~ pAK80 4 4 16 0.029 0.0483 ns
## 9 Potato concen~ MS223~ SK11 4 4 16 0.029 0.0483 ns
## 10 Potato concen~ pAK80 SK11 4 4 0 0.029 0.0483 ns
## 11 Potato concen~ pAK80 Wg2 4 4 0 0.029 0.0483 ns
## 12 Potato concen~ SK11 Wg2 4 4 0 0.029 0.0483 ns
6 DATA SET: Volatile organic compounds
6.1 DATA
LOAD
<- read_excel("./data/Data_collection.xlsx", sheet="VOC", na = "NA") dat
LOOK
# Take a glimpse
glimpse(dat)
## Rows: 42
## Columns: 19
## $ Sample <chr> "Blanc", "pAK80-P1-D~
## $ Protease <chr> NA, "pAK80", "pAK80"~
## $ Protein <chr> NA, "Potato", "Potat~
## $ `First Injection` <chr> "blanc", "PAK80 P1",~
## $ `MS Quantitation Peak` <chr> "MS Quantitation Pea~
## $ Acetaldehyde <dbl> 1192.328, 7963.007, ~
## $ Acetone <dbl> 139.560, 5813.107, 6~
## $ Ethanol <dbl> 33.586, 48647.408, 4~
## $ `Pyrimidine-2,4(1H,3H)-dione, 5-amino-6-nitroso-` <dbl> 1535.197, 22074.072,~
## $ Hexanal <dbl> 46.831, 392.587, 294~
## $ `Butane, 2-nitro-` <dbl> 0.000, 4532.694, 422~
## $ `1-Heptyn-6-one` <dbl> 44.339, 3133.171, 28~
## $ `2-Heptanone, 4-methyl-` <dbl> 49.768, 5853.148, 60~
## $ `Benzene, 1,3-bis(1,1-dimethylethyl)-` <dbl> 0.000, 7768.256, 797~
## $ `2-Decenal, (Z)-` <dbl> 44.544, 3821.739, 37~
## $ Benzaldehyde <dbl> 19.609, 49008.878, 4~
## $ `Benzaldehyde, 3,5-dimethyl-` <dbl> 41.008, 292.719, 264~
## $ Dodecanal <dbl> 68.303, 55.985, 49.4~
## $ `Phenol, 3,5-bis(1,1-dimethylethyl)-` <dbl> 15.394, 27498.312, 2~
# Make a explorative summary
::skim(dat) skimr
Name | dat |
Number of rows | 42 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sample | 0 | 1.00 | 5 | 11 | 0 | 41 | 0 |
Protease | 2 | 0.95 | 2 | 5 | 0 | 5 | 0 |
Protein | 2 | 0.95 | 6 | 6 | 0 | 2 | 0 |
First Injection | 0 | 1.00 | 4 | 9 | 0 | 39 | 0 |
MS Quantitation Peak | 0 | 1.00 | 20 | 20 | 0 | 1 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Acetaldehyde | 0 | 1 | 5806.49 | 3793.70 | 875.40 | 1804.22 | 6193.44 | 9202.37 | 11517.26 | <U+2587><U+2581><U+2582><U+2583><U+2585> |
Acetone | 0 | 1 | 2554.57 | 1994.45 | 82.69 | 940.60 | 1938.98 | 3980.78 | 7033.34 | <U+2587><U+2583><U+2582><U+2582><U+2582> |
Ethanol | 0 | 1 | 44058.68 | 10287.32 | 33.59 | 43942.42 | 46138.78 | 48479.88 | 50492.31 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
Pyrimidine-2,4(1H,3H)-dione, 5-amino-6-nitroso- | 0 | 1 | 24576.67 | 10220.58 | 437.84 | 18565.65 | 25421.17 | 33446.90 | 36567.13 | <U+2581><U+2582><U+2582><U+2583><U+2587> |
Hexanal | 0 | 1 | 564.65 | 1151.96 | 15.13 | 76.80 | 153.09 | 388.92 | 4369.41 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Butane, 2-nitro- | 0 | 1 | 4023.55 | 947.46 | 0.00 | 4011.18 | 4154.18 | 4380.86 | 4973.10 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
1-Heptyn-6-one | 0 | 1 | 10583.28 | 4996.73 | 44.34 | 8080.94 | 11324.10 | 15299.75 | 16982.06 | <U+2583><U+2582><U+2583><U+2586><U+2587> |
2-Heptanone, 4-methyl- | 0 | 1 | 5886.08 | 1505.63 | 49.77 | 5565.87 | 6036.96 | 6659.30 | 7868.71 | <U+2581><U+2581><U+2581><U+2587><U+2585> |
Benzene, 1,3-bis(1,1-dimethylethyl)- | 0 | 1 | 7400.38 | 2070.80 | 0.00 | 6563.71 | 7782.64 | 9009.80 | 9940.67 | <U+2581><U+2581><U+2581><U+2587><U+2587> |
2-Decenal, (Z)- | 0 | 1 | 3630.04 | 890.75 | 26.88 | 3505.29 | 3742.92 | 4046.71 | 4788.62 | <U+2581><U+2581><U+2581><U+2587><U+2586> |
Benzaldehyde | 0 | 1 | 62949.62 | 102111.00 | 19.61 | 4096.76 | 22991.31 | 61164.68 | 380542.86 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Benzaldehyde, 3,5-dimethyl- | 0 | 1 | 2941.93 | 2647.16 | 41.01 | 302.84 | 2168.34 | 5696.22 | 7707.58 | <U+2587><U+2582><U+2581><U+2583><U+2583> |
Dodecanal | 0 | 1 | 1680.07 | 2568.86 | 24.72 | 56.97 | 116.02 | 2614.14 | 8272.26 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Phenol, 3,5-bis(1,1-dimethylethyl)- | 0 | 1 | 16717.16 | 9217.09 | 15.39 | 9063.44 | 14473.79 | 24839.21 | 29888.71 | <U+2583><U+2585><U+2583><U+2582><U+2587> |
CLEAN
# Remove blanc samples
<- dat %>%
dat filter(Sample != "Blanc")
# reorganisation of data
<- dat %>%
dat pivot_longer(.,cols=6:19, names_to="compound", values_to = "concentration")
# Look at cleaned data
::skim(dat) skimr
Name | dat |
Number of rows | 560 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
character | 6 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sample | 0 | 1 | 8 | 11 | 0 | 40 | 0 |
Protease | 0 | 1 | 2 | 5 | 0 | 5 | 0 |
Protein | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
First Injection | 0 | 1 | 4 | 9 | 0 | 38 | 0 |
MS Quantitation Peak | 0 | 1 | 20 | 20 | 0 | 1 | 0 |
compound | 0 | 1 | 7 | 47 | 0 | 14 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
concentration | 0 | 1 | 14492.58 | 33347.63 | 15.13 | 3162.34 | 5867.55 | 12984.02 | 380542.9 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
SAVE
# Save cleaned data
save(dat, file = "R_objects/stat_test_data_VOC.RData")
# clear the environment and release memory
rm(list = ls(all.names = TRUE)[ls(all.names = TRUE) != "params"])
invisible(gc())
6.2 TEST INDEPENDENCE
6.2.1 CATEGORICAL VARIABLES
Chi squared test of independence
# load data
load("R_objects/stat_test_data_VOC.RData")
# This converts all character columns to factor (if only specific columns should be converted replace "where(is_character)" with "c(<col1>,<col2>,...)"
<- dat %>% mutate(across(where(is_character), as_factor))
dat
# Create vector of categorical variables with more than two categories and less than half of the number of samples.
<- dat %>%
cat_var select_if(is.factor) %>%
select(where(~n_distinct(.) > 1)) %>%
select(where(~n_distinct(.) < (nrow(dat)/2))) %>%
colnames()
length(cat_var) # If less than 2 variables the following code should not be run
## [1] 5
# Test the variables pairwise
%>%
cat_var combn(2) %>%
t() %>%
as_tibble() %>%
%>%
rowwise mutate(chisq_test = list(
table(dat[[V1]], dat[[V2]]) %>% chisq.test()
),chisq_pval = chisq_test$p.value
)
## Warning: There were 5 warnings in `mutate()`.
## The first warning was:
## i In argument: `chisq_test = list(table(dat[[V1]], dat[[V2]]) %>%
## chisq.test())`.
## i In row 1.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## i Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 4 remaining warnings.
## # A tibble: 10 x 4
## # Rowwise:
## V1 V2 chisq_test chisq_pval
## <chr> <chr> <list> <dbl>
## 1 Sample Protease <htest> 0
## 2 Sample Protein <htest> 1.81e-93
## 3 Sample First Injection <htest> 0
## 4 Sample compound <htest> 1 e+ 0
## 5 Protease Protein <htest> 1 e+ 0
## 6 Protease First Injection <htest> 0
## 7 Protease compound <htest> 1 e+ 0
## 8 Protein First Injection <htest> 2.74e-83
## 9 Protein compound <htest> 1 e+ 0
## 10 First Injection compound <htest> 1 e+ 0
6.2.2 QUANTITATIVE VARIABLES
Spearman’s test without first testing the normality of the data
# create vector with the relevant variables
<- dat %>% select(where(is.numeric)) %>% colnames()
CON.VARS
CON.VARS
## [1] "concentration"
6.3 Acetaldehyde - Protease and Protein - Concentration
6.3.1 DATA
# load data
load("R_objects/stat_test_data_VOC.RData")
# Create subset
<- dat %>% filter(compound %in% c("Acetaldehyde")) dat.clean
6.3.2 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Protease")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 10 x 6
## Protease Protein variable n mean sd
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein concentration 4 9982. 896.
## 2 No Casein concentration 4 1373. 337.
## 3 SK11 Casein concentration 4 2421. 212.
## 4 Wg2 Casein concentration 4 9696. 1182.
## 5 pAK80 Casein concentration 4 2819. 1809.
## 6 33 Potato concentration 4 6675. 1685.
## 7 No Potato concentration 4 1469. 628.
## 8 SK11 Potato concentration 4 10803. 570.
## 9 Wg2 Potato concentration 4 6684. 2403.
## 10 pAK80 Potato concentration 4 8344. 1728.
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
6.3.3 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 3 x 9
## Protease Protein Sample `First Injection` `MS Quantitation Peak` compound
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 No Casein No-C3-D6 NO C3 MS Quantitation Peak Acetalde~
## 2 SK11 Casein SK11-C2-D6 SK11 C2 MS Quantitation Peak Acetalde~
## 3 No Potato No-P2-D6 NO P2 MS Quantitation Peak Acetalde~
## # i 3 more variables: concentration <dbl>, is.outlier <lgl>, is.extreme <lgl>
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.969 0.342
# Compute Shapiro-Wilk test of normality for data groups
%>%
dat.clean group_by(Protein, Protease) %>%
shapiro_test(vars = "concentration") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 1 x 6
## Protease Protein variable statistic p p.adj
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 No Potato concentration 0.652 0.00275 0.0275
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 9 30 1.86 0.0975
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
6.3.4 PERFORM TEST
The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.
# for each protein source
<- dat.clean %>%
res.protein group_by(Protease) %>%
kruskal_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 4 x 8
## Protease .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 33 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0348
## 2 SK11 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0348
## 3 Wg2 concentration 8 4.08 1 0.0433 Kruskal-Wallis 0.0541
## 4 pAK80 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0348
<- dat.clean %>%
pwc group_by(Protease) %>%
pairwise_wilcox_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 3 x 10
## Protease .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 33 concen~ Casein Potato 4 4 16 0.029 0.0483 *
## 2 SK11 concen~ Casein Potato 4 4 0 0.029 0.0483 *
## 3 pAK80 concen~ Casein Potato 4 4 0 0.029 0.0483 *
# for each strain
<- dat.clean %>%
res.strain group_by(Protein) %>%
kruskal_test(concentration ~ Protease) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 2 x 8
## Protein .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein concentration 20 15.6 4 0.00358 Kruskal-Wallis 0.00407
## 2 Potato concentration 20 15.3 4 0.00407 Kruskal-Wallis 0.00407
<- dat.clean %>%
pwc group_by(Protein) %>%
pairwise_wilcox_test(concentration ~ Protease) %>%
adjust_pvalue(method = "fdr")
%>% filter(p.adj < 0.05) pwc
## # A tibble: 14 x 10
## Protein .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Casein concen~ 33 No 4 4 16 0.029 0.0414 ns
## 2 Casein concen~ 33 pAK80 4 4 16 0.029 0.0414 ns
## 3 Casein concen~ 33 SK11 4 4 16 0.029 0.0414 ns
## 4 Casein concen~ No SK11 4 4 0 0.029 0.0414 ns
## 5 Casein concen~ No Wg2 4 4 0 0.029 0.0414 ns
## 6 Casein concen~ pAK80 Wg2 4 4 0 0.029 0.0414 ns
## 7 Casein concen~ SK11 Wg2 4 4 0 0.029 0.0414 ns
## 8 Potato concen~ 33 No 4 4 16 0.029 0.0414 ns
## 9 Potato concen~ 33 SK11 4 4 0 0.029 0.0414 ns
## 10 Potato concen~ No pAK80 4 4 0 0.029 0.0414 ns
## 11 Potato concen~ No SK11 4 4 0 0.029 0.0414 ns
## 12 Potato concen~ No Wg2 4 4 0 0.029 0.0414 ns
## 13 Potato concen~ pAK80 SK11 4 4 0 0.029 0.0414 ns
## 14 Potato concen~ SK11 Wg2 4 4 16 0.029 0.0414 ns
6.3.5 VISUALIZE
#Name order in plot
$Protease <- factor(dat.clean$Protease,
dat.cleanlevels = c("No", "pAK80", "SK11", "Wg2", "33"),
labels =c ("A: Non-inoculated",
"B: MS22418 (PrtP-)",
"C: MS22421 (PrtP/SK11)",
"D: MS22425 (PrtP/Wg2)",
"E: MS22427 (PrtP/MS22333)"))
$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))
dat.clean
#Plot - boxplot
<- dat.clean %>%
bxp.Acetaldehyde ggboxplot(.,
x = "Protease",
y = "concentration",
color = "Protease",
facet.by = "Protein")+
labs(, x="Strain", y="Relative amount")+
scale_y_continuous()+
facet_wrap(~Protein,
nrow=1,
labeller = as_labeller(c("Casein"="CCDM",
"Potato"="PCDM"))
+
)scale_color_manual(values = c("A: Non-inoculated"="#E89242FF",
"B: MS22418 (PrtP-)"="#82491EFF",
"C: MS22421 (PrtP/SK11)"="#24325FFF",
"D: MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"E: MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)scale_x_discrete(labels=c("A", "B", "C", "D", "E")
+
)theme_light()
bxp.Acetaldehyde
save(bxp.Acetaldehyde, file = "R_objects/Acetaldehyde.RData")
6.4 Benzaldehyde - Protease and Protein - Concentration
6.4.1 DATA
# load data
load("R_objects/stat_test_data_VOC.RData")
# Create subset
<- dat %>% filter(compound %in% c("Benzaldehyde")) dat.clean
6.4.2 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Protease")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 10 x 6
## Protease Protein variable n mean sd
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein concentration 4 352750. 29410.
## 2 No Casein concentration 4 1518. 143.
## 3 SK11 Casein concentration 4 4684. 1498.
## 4 Wg2 Casein concentration 4 120599. 9077.
## 5 pAK80 Casein concentration 4 12228. 15317.
## 6 33 Potato concentration 4 27207. 3723.
## 7 No Potato concentration 4 4013. 1030.
## 8 SK11 Potato concentration 4 69328. 13373.
## 9 Wg2 Potato concentration 4 21156. 1706.
## 10 pAK80 Potato concentration 4 47467. 6500.
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
6.4.3 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 3 x 9
## Protease Protein Sample `First Injection` `MS Quantitation Peak` compound
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 pAK80 Casein pAK80-C4-D6 4 PAK80 C MS Quantitation Peak Benzald~
## 2 No Potato No-P1-D6 NO P1 MS Quantitation Peak Benzald~
## 3 SK11 Potato SK11-P4-D6 4 SK11 P MS Quantitation Peak Benzald~
## # i 3 more variables: concentration <dbl>, is.outlier <lgl>, is.extreme <lgl>
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.843 0.0000628
# Compute Shapiro-Wilk test of normality for data groups
%>%
dat.clean group_by(Protein, Protease) %>%
shapiro_test(vars = "concentration") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 2 x 6
## Protease Protein variable statistic p p.adj
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 pAK80 Casein concentration 0.649 0.00251 0.0251
## 2 No Potato concentration 0.683 0.00714 0.0357
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 9 30 3.41 0.00539
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
6.4.4 PERFORM TEST
The data are not normal distributed. The Shapiro-Wilk normality test found that at least one group was not normal distributed. As a result, for statistical data analysis, the Kruskal-Wallis one-way analysis of variance test is used, followed by a Wilcoxon test for pairwise group comparisons.
# for each protein source
<- dat.clean %>%
res.protein group_by(Protease) %>%
kruskal_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 5 x 8
## Protease .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 33 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0209
## 2 No concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0209
## 3 SK11 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0209
## 4 Wg2 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0209
## 5 pAK80 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0209
<- dat.clean %>%
pwc group_by(Protease) %>%
pairwise_wilcox_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 5 x 10
## Protease .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 33 concent~ Casein Potato 4 4 16 0.029 0.029 *
## 2 No concent~ Casein Potato 4 4 0 0.029 0.029 *
## 3 SK11 concent~ Casein Potato 4 4 0 0.029 0.029 *
## 4 Wg2 concent~ Casein Potato 4 4 16 0.029 0.029 *
## 5 pAK80 concent~ Casein Potato 4 4 0 0.029 0.029 *
# for each strain
<- dat.clean %>%
res.strain group_by(Protein) %>%
kruskal_test(concentration ~ Protease) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 2 x 8
## Protein .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Casein concentration 20 17.5 4 0.00154 Kruskal-Wallis 0.00154
## 2 Potato concentration 20 18.3 4 0.00109 Kruskal-Wallis 0.00154
<- dat.clean %>%
pwc group_by(Protein) %>%
pairwise_wilcox_test(concentration ~ Protease) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 19 x 10
## Protein .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Casein concen~ 33 No 4 4 16 0.029 0.0305 ns
## 2 Casein concen~ 33 pAK80 4 4 16 0.029 0.0305 ns
## 3 Casein concen~ 33 SK11 4 4 16 0.029 0.0305 ns
## 4 Casein concen~ 33 Wg2 4 4 16 0.029 0.0305 ns
## 5 Casein concen~ No pAK80 4 4 0 0.029 0.0305 ns
## 6 Casein concen~ No SK11 4 4 0 0.029 0.0305 ns
## 7 Casein concen~ No Wg2 4 4 0 0.029 0.0305 ns
## 8 Casein concen~ pAK80 Wg2 4 4 0 0.029 0.0305 ns
## 9 Casein concen~ SK11 Wg2 4 4 0 0.029 0.0305 ns
## 10 Potato concen~ 33 No 4 4 16 0.029 0.0305 ns
## 11 Potato concen~ 33 pAK80 4 4 0 0.029 0.0305 ns
## 12 Potato concen~ 33 SK11 4 4 0 0.029 0.0305 ns
## 13 Potato concen~ 33 Wg2 4 4 16 0.029 0.0305 ns
## 14 Potato concen~ No pAK80 4 4 0 0.029 0.0305 ns
## 15 Potato concen~ No SK11 4 4 0 0.029 0.0305 ns
## 16 Potato concen~ No Wg2 4 4 0 0.029 0.0305 ns
## 17 Potato concen~ pAK80 SK11 4 4 0 0.029 0.0305 ns
## 18 Potato concen~ pAK80 Wg2 4 4 16 0.029 0.0305 ns
## 19 Potato concen~ SK11 Wg2 4 4 16 0.029 0.0305 ns
6.4.5 VISUALIZE
#Name order in plot
$Protease <- factor(dat.clean$Protease,
dat.cleanlevels = c("No", "pAK80", "SK11", "Wg2", "33"),
labels =c ("A: Non-inoculated",
"B: MS22418 (PrtP-)",
"C: MS22421 (PrtP/SK11)",
"D: MS22425 (PrtP/Wg2)",
"E: MS22427 (PrtP/MS22333)"))
$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))
dat.clean
#Plot - boxplot
<- dat.clean %>%
bxp.Benzaldehyde ggboxplot(.,
x = "Protease",
y = "concentration",
color = "Protease",
facet.by = "Protein")+
labs(x="Strain", y="Relative amount")+
scale_y_continuous()+
facet_wrap(~Protein,
nrow=1,
labeller = as_labeller(c("Casein"="CCDM",
"Potato"="PCDM"))
+
)scale_color_manual(values = c("A: Non-inoculated"="#E89242FF",
"B: MS22418 (PrtP-)"="#82491EFF",
"C: MS22421 (PrtP/SK11)"="#24325FFF",
"D: MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"E: MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)scale_x_discrete(labels=c("A", "B", "C", "D", "E")
+
)theme_light()
bxp.Benzaldehyde
save(bxp.Benzaldehyde, file = "R_objects/Benzaldehyde.RData")
6.5 Hexanal - Protease and Protein - Concentration
6.5.1 DATA
# load data
load("R_objects/stat_test_data_VOC.RData")
# Create subset
<- dat %>% filter(compound %in% c("Hexanal")) dat.clean
6.5.2 PREPARE VARIABLES
# Set names of variables
<- c("Protein","Protease")
PREDICTOR <- "concentration"
OUTCOME <- NULL
SUBJECT
# Create formula
<- ifelse(length(PREDICTOR) > 1, paste(PREDICTOR, collapse = "*"), PREDICTOR)
PREDICTOR.F <- as.formula(paste(OUTCOME,PREDICTOR.F, sep = " ~ "))
FORMULA
# Summary samples in groups
%>% group_by(across(all_of(PREDICTOR))) %>% get_summary_stats(!!sym(OUTCOME), type = "mean_sd") dat.clean
## # A tibble: 10 x 6
## Protease Protein variable n mean sd
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 33 Casein concentration 4 157. 158.
## 2 No Casein concentration 4 145. 120.
## 3 SK11 Casein concentration 4 84.8 55.0
## 4 Wg2 Casein concentration 4 106. 67.1
## 5 pAK80 Casein concentration 4 53.7 37.5
## 6 33 Potato concentration 4 451. 233.
## 7 No Potato concentration 4 4015. 462.
## 8 SK11 Potato concentration 4 175. 58.9
## 9 Wg2 Potato concentration 4 327. 171.
## 10 pAK80 Potato concentration 4 394. 89.9
# Create plot
<- dat.clean %>%
bxp ggboxplot(x = if_else(length(PREDICTOR) > 1, PREDICTOR[2],PREDICTOR[1]),
y = OUTCOME,
color = PREDICTOR[1],
facet.by = if(length(PREDICTOR) == 3) PREDICTOR[3],
palette = "jco")
bxp
6.5.3 PRELIMINARY TEST
Identify outliers
# Test for outliers
%>%
dat.clean group_by(across(all_of(PREDICTOR))) %>%
identify_outliers(!!sym(OUTCOME))
## # A tibble: 3 x 9
## Protease Protein Sample `First Injection` `MS Quantitation Peak` compound
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 No Casein No-C2-D6 NO C2 MS Quantitation Peak Hexanal
## 2 Wg2 Casein Wg2-C4-D6 4 W2G C MS Quantitation Peak Hexanal
## 3 33 Potato 33-P2-D6 33P2 MS Quantitation Peak Hexanal
## # i 3 more variables: concentration <dbl>, is.outlier <lgl>, is.extreme <lgl>
Check normality
QQ plot and Shapiro-Wilk test of normality are used to analyze the model residuals.
# Build the linear model
<- lm(FORMULA, data = dat.clean)
model
# Create a QQ plot of residuals
ggqqplot(residuals(model))
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.897 0.00161
# Compute Shapiro-Wilk test of normality for data groups
%>%
dat.clean group_by(Protein, Protease, compound) %>%
shapiro_test(vars = "concentration") %>%
adjust_pvalue(method = "fdr") %>%
filter(p.adj < 0.05)
## # A tibble: 0 x 7
## # i 7 variables: Protease <chr>, Protein <chr>, compound <chr>, variable <chr>,
## # statistic <dbl>, p <dbl>, p.adj <dbl>
Test homogneity of variance assumption
plot(model, 1)
%>% levene_test(FORMULA) dat.clean
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 9 30 2.83 0.0154
# Save result
<- dat.clean %>% levene_test(FORMULA) %>% pull(p) > 0.05 EQUAL.VAR
6.5.4 PERFORM TEST
The data are not normal distributed, but normality is found within each group by the Shapiro-Wilk normality test. However, the same groups of the VOC data were not normal distributed. Therefore, the analysis of this data is performed as a Kruskal-Wallis test following a Wilcoxon pairwise test.
# for each protein source
<- dat.clean %>%
res.protein group_by(Protease) %>%
kruskal_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.protein
## # A tibble: 3 x 8
## Protease .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 No concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0522
## 2 Wg2 concentration 8 4.08 1 0.0433 Kruskal-Wallis 0.0722
## 3 pAK80 concentration 8 5.33 1 0.0209 Kruskal-Wallis 0.0522
<- dat.clean %>%
pwc group_by(Protease) %>%
pairwise_wilcox_test(concentration ~ Protein) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 2 x 10
## Protease .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 No concen~ Casein Potato 4 4 0 0.029 0.0725 *
## 2 pAK80 concen~ Casein Potato 4 4 0 0.029 0.0725 *
# for each strain
<- dat.clean %>%
res.strain group_by(Protein) %>%
kruskal_test(concentration ~ Protease) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) res.strain
## # A tibble: 1 x 8
## Protein .y. n statistic df p method p.adj
## <chr> <chr> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 Potato concentration 20 12.6 4 0.0133 Kruskal-Wallis 0.0266
<- dat.clean %>%
pwc filter(Protein == "Potato") %>%
group_by(Protein) %>%
pairwise_wilcox_test(concentration ~ Protease) %>%
adjust_pvalue(method = "fdr")
%>% filter(p < 0.05) pwc
## # A tibble: 5 x 10
## Protein .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Potato concentr~ 33 No 4 4 0 0.029 0.058 ns
## 2 Potato concentr~ No pAK80 4 4 16 0.029 0.058 ns
## 3 Potato concentr~ No SK11 4 4 16 0.029 0.058 ns
## 4 Potato concentr~ No Wg2 4 4 16 0.029 0.058 ns
## 5 Potato concentr~ pAK80 SK11 4 4 16 0.029 0.058 ns
6.5.5 VISUALIZE
#Name order in plot
$Protease <- factor(dat.clean$Protease,
dat.cleanlevels = c("No", "pAK80", "SK11", "Wg2", "33"),
labels =c ("A: Non-inoculated",
"B: MS22418 (PrtP-)",
"C: MS22421 (PrtP/SK11)",
"D: MS22425 (PrtP/Wg2)",
"E: MS22427 (PrtP/MS22333)"))
$Protein <- factor(dat.clean$Protein, levels = c("Casein", "Potato"))
dat.clean
#Plot - boxplot
<- dat.clean %>%
bxp.Hexanal ggboxplot(.,
x = "Protease",
y = "concentration",
color = "Protease",
facet.by = "Protein")+
labs(x="Strain", y="Relative amount")+
scale_y_continuous()+
facet_wrap(~Protein,
nrow=1,
labeller = as_labeller(c("Casein"="CCDM",
"Potato"="PCDM"))
+
)scale_color_manual(values = c("A: Non-inoculated"="#E89242FF",
"B: MS22418 (PrtP-)"="#82491EFF",
"C: MS22421 (PrtP/SK11)"="#24325FFF",
"D: MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"E: MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)scale_x_discrete(labels=c("A", "B", "C", "D", "E")
+
)theme_light()
bxp.Hexanal
save(bxp.Hexanal, file = "R_objects/Hexanal.RData")
6.6 VISUALIZE: All VOC plots
load(file = "R_objects/Acetaldehyde.RData")
load(file = "R_objects/Benzaldehyde.RData")
load(file = "R_objects/Hexanal.RData")
<- ggarrange(bxp.Acetaldehyde, bxp.Benzaldehyde, bxp.Hexanal, heights = c(1,1,1),
fig5 ncol = 1,
labels = c("A","B","C"),
align = "v",
legend = "right", common.legend = TRUE
)
#Save Figure 5
ggsave(filename = "plots/fig5_VOCs.svg", plot = fig5, width = 16.5, height = 20, units =c("cm"), dpi = 600)
Figure 5
7 DATA SET: Milk acidification
7.1 DATA
7.1.1 LOAD
<- read_excel("./data/Data_collection.xlsx", sheet="Supl_fig_rep20191013", na = "NA")
dat1 <- read_excel("./data/Data_collection.xlsx", sheet="Supl_fig_rep20191031", na = "NA") dat2
7.1.2 LOOK
# Take a glimpse
glimpse(dat1)
## Rows: 602
## Columns: 7
## $ Time <dbl> 0.00000000, 0.03333333, 0.06666667, 0.10000000, 0.13333~
## $ pAK80 <dbl> 6.701807, 6.691638, 6.690138, 6.692273, 6.693289, 6.694~
## $ `PrtP/Wg2` <dbl> 6.674124, 6.674505, 6.674023, 6.673349, 6.672475, 6.671~
## $ `PrtP/SK11` <dbl> 6.663352, 6.664178, 6.663682, 6.662663, 6.661853, 6.660~
## $ `PrtP/MS22333` <dbl> 6.678489, 6.679384, 6.678436, 6.677651, 6.676611, 6.674~
## $ Milk <dbl> 6.701210, 6.700569, 6.699893, 6.699427, 6.698843, 6.698~
## $ Milk_with_ery <dbl> 6.710694, 6.714046, 6.716038, 6.717559, 6.718817, 6.719~
glimpse(dat2)
## Rows: 1,353
## Columns: 7
## $ Time <dbl> 0.00000000, 0.01666667, 0.03333333, 0.05000000, 0.06666~
## $ pAK80 <dbl> 6.611264, 6.614421, 6.616817, 6.619895, 6.623252, 6.626~
## $ `PrtP/Wg2` <dbl> 6.679517, 6.686099, 6.688319, 6.690095, 6.690137, 6.689~
## $ `PrtP/SK11` <dbl> 6.699851, 6.699850, 6.699850, 6.699849, 6.699849, 6.699~
## $ `PrtP/MS22333` <dbl> 6.667879, 6.665835, 6.664838, 6.663531, 6.662640, 6.661~
## $ Milk <dbl> 6.652904, 6.659897, 6.663321, 6.666914, 6.668894, 6.670~
## $ Milk_with_ery <dbl> 6.696377, 6.695598, 6.694867, 6.694411, 6.693895, 6.693~
# Make a explorative summary
::skim(dat1) skimr
Name | dat1 |
Number of rows | 602 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Time | 0 | 1 | 10.02 | 5.80 | 0.00 | 5.01 | 10.02 | 15.03 | 20.03 | <U+2587><U+2587><U+2587><U+2587><U+2587> |
pAK80 | 0 | 1 | 6.34 | 0.22 | 5.99 | 6.15 | 6.32 | 6.54 | 6.70 | <U+2587><U+2587><U+2586><U+2586><U+2587> |
PrtP/Wg2 | 0 | 1 | 5.06 | 0.97 | 4.15 | 4.23 | 4.48 | 6.11 | 6.67 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
PrtP/SK11 | 0 | 1 | 5.08 | 0.98 | 4.12 | 4.22 | 4.51 | 6.17 | 6.66 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
PrtP/MS22333 | 0 | 1 | 5.00 | 1.02 | 4.02 | 4.10 | 4.48 | 6.11 | 6.68 | <U+2587><U+2582><U+2581><U+2581><U+2583> |
Milk | 0 | 1 | 6.49 | 0.23 | 5.95 | 6.35 | 6.57 | 6.69 | 6.70 | <U+2582><U+2581><U+2581><U+2582><U+2587> |
Milk_with_ery | 0 | 1 | 6.72 | 0.00 | 6.71 | 6.72 | 6.72 | 6.72 | 6.72 | <U+2581><U+2587><U+2583><U+2585><U+2587> |
::skim(dat2) skimr
Name | dat2 |
Number of rows | 1353 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Time | 0 | 1 | 11.27 | 6.51 | 0.00 | 5.63 | 11.27 | 16.90 | 22.54 | <U+2587><U+2587><U+2587><U+2587><U+2587> |
pAK80 | 0 | 1 | 6.19 | 0.35 | 5.61 | 5.84 | 6.25 | 6.52 | 6.65 | <U+2586><U+2582><U+2583><U+2585><U+2587> |
PrtP/Wg2 | 0 | 1 | 4.90 | 0.94 | 4.12 | 4.19 | 4.34 | 5.65 | 6.69 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
PrtP/SK11 | 0 | 1 | 4.90 | 0.96 | 4.09 | 4.16 | 4.33 | 5.73 | 6.70 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
PrtP/MS22333 | 0 | 1 | 4.76 | 0.97 | 3.99 | 4.03 | 4.19 | 5.41 | 6.67 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
Milk | 0 | 1 | 5.77 | 0.65 | 4.77 | 5.18 | 5.73 | 6.39 | 6.68 | <U+2586><U+2585><U+2585><U+2583><U+2587> |
Milk_with_ery | 0 | 1 | 6.40 | 0.36 | 5.77 | 5.95 | 6.65 | 6.69 | 6.70 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
7.1.3 CLEAN DATA
#Reorganize data
<- dat1 %>%
dat1 pivot_longer(.,cols=2:7, names_to="Sample", values_to = "Rep1")
<- dat2 %>%
dat2 pivot_longer(.,cols=2:7, names_to="Sample", values_to = "Rep2")
<- full_join(dat1, dat2, by=c("Time", "Sample"))
dat
=pivot_longer(dat, cols=3:4, names_to = "Replicate", values_to = "pH")
dat
<- dat %>%
dat filter(., !is.na(pH))
<- filter(dat, Sample %in% c("pAK80", "PrtP/Wg2", "PrtP/SK11", "PrtP/MS22333", "Milk_with_ery"))
dat.clean
# Look at cleaned data
::skim(dat.clean) skimr
Name | dat.clean |
Number of rows | 9775 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sample | 0 | 1 | 5 | 13 | 0 | 5 | 0 |
Replicate | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Time | 0 | 1 | 10.88 | 6.32 | 0.00 | 5.42 | 10.85 | 16.28 | 22.54 | <U+2587><U+2587><U+2587><U+2587><U+2586> |
pH | 0 | 1 | 5.49 | 1.06 | 3.99 | 4.25 | 5.84 | 6.57 | 6.72 | <U+2587><U+2581><U+2581><U+2583><U+2587> |
7.2 VISUALIZE
$Sample <- factor(dat.clean$Sample,
dat.cleanlevels = c("Milk_with_ery", "pAK80", "PrtP/SK11", "PrtP/Wg2", "PrtP/MS22333"),
labels =c ("Non-inoculated",
"MS22418 (PrtP-)",
"MS22421 (PrtP/SK11)",
"MS22425 (PrtP/Wg2)",
"MS22427 (PrtP/MS22333)"))
<- dat.clean %>%
p.out ggplot(., aes(x=Time, y=pH, color=Sample, linetype=Replicate))+
geom_line(size=1)+
ylim(3.5,7)+
theme_light()+
scale_color_manual(values = c("Non-inoculated"="#E89242FF",
"MS22418 (PrtP-)"="#82491EFF",
"MS22421 (PrtP/SK11)"="#24325FFF",
"MS22425 (PrtP/Wg2)"="#B7E4F9FF",
"MS22427 (PrtP/MS22333)"="#FB6467FF")
+
)labs(color="Strain")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
# SAVE plot
ggsave(filename = "plots/Milk_acidification.svg", plot = p.out, width = 16.5, height = 9, units =c("cm"), dpi = 600)
Figure S2
8 INFO
8.1 FINAL COMMENT
This completes the fundamental statistical test of the sample variables.
8.2 REFERENCES
These are the references of R and the applied R packages.
citation()
##
## To cite R in publications use:
##
## R Core Team (2021). R: A language and environment for statistical
## computing. R Foundation for Statistical Computing, Vienna, Austria.
## URL https://www.R-project.org/.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {R: A Language and Environment for Statistical Computing},
## author = {{R Core Team}},
## organization = {R Foundation for Statistical Computing},
## address = {Vienna, Austria},
## year = {2021},
## url = {https://www.R-project.org/},
## }
##
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.
citation("tidyverse")
##
## Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E,
## Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi
## K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the
## tidyverse." _Journal of Open Source Software_, *4*(43), 1686. doi:
## 10.21105/joss.01686 (URL: https://doi.org/10.21105/joss.01686).
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Welcome to the {tidyverse}},
## author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
## year = {2019},
## journal = {Journal of Open Source Software},
## volume = {4},
## number = {43},
## pages = {1686},
## doi = {10.21105/joss.01686},
## }
citation("rstatix")
##
## To cite package 'rstatix' in publications use:
##
## Alboukadel Kassambara (2023). rstatix: Pipe-Friendly Framework for
## Basic Statistical Tests. R package version 0.7.2.
## https://CRAN.R-project.org/package=rstatix
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {rstatix: Pipe-Friendly Framework for Basic Statistical Tests},
## author = {Alboukadel Kassambara},
## year = {2023},
## note = {R package version 0.7.2},
## url = {https://CRAN.R-project.org/package=rstatix},
## }
citation("pheatmap")
##
## To cite package 'pheatmap' in publications use:
##
## Raivo Kolde (2019). pheatmap: Pretty Heatmaps. R package version
## 1.0.12. https://CRAN.R-project.org/package=pheatmap
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {pheatmap: Pretty Heatmaps},
## author = {Raivo Kolde},
## year = {2019},
## note = {R package version 1.0.12},
## url = {https://CRAN.R-project.org/package=pheatmap},
## }
##
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
citation("vegan")
##
## To cite package 'vegan' in publications use:
##
## Jari Oksanen, Gavin L. Simpson, F. Guillaume Blanchet, Roeland Kindt,
## Pierre Legendre, Peter R. Minchin, R.B. O'Hara, Peter Solymos, M.
## Henry H. Stevens, Eduard Szoecs, Helene Wagner, Matt Barbour, Michael
## Bedward, Ben Bolker, Daniel Borcard, Gustavo Carvalho, Michael
## Chirico, Miquel De Caceres, Sebastien Durand, Heloisa Beatriz
## Antoniazi Evangelista, Rich FitzJohn, Michael Friendly, Brendan
## Furneaux, Geoffrey Hannigan, Mark O. Hill, Leo Lahti, Dan McGlinn,
## Marie-Helene Ouellette, Eduardo Ribeiro Cunha, Tyler Smith, Adrian
## Stier, Cajo J.F. Ter Braak and James Weedon (2022). vegan: Community
## Ecology Package. R package version 2.6-4.
## https://CRAN.R-project.org/package=vegan
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {vegan: Community Ecology Package},
## author = {Jari Oksanen and Gavin L. Simpson and F. Guillaume Blanchet and Roeland Kindt and Pierre Legendre and Peter R. Minchin and R.B. O'Hara and Peter Solymos and M. Henry H. Stevens and Eduard Szoecs and Helene Wagner and Matt Barbour and Michael Bedward and Ben Bolker and Daniel Borcard and Gustavo Carvalho and Michael Chirico and Miquel {De Caceres} and Sebastien Durand and Heloisa Beatriz Antoniazi Evangelista and Rich FitzJohn and Michael Friendly and Brendan Furneaux and Geoffrey Hannigan and Mark O. Hill and Leo Lahti and Dan McGlinn and Marie-Helene Ouellette and Eduardo {Ribeiro Cunha} and Tyler Smith and Adrian Stier and Cajo J.F. {Ter Braak} and James Weedon},
## year = {2022},
## note = {R package version 2.6-4},
## url = {https://CRAN.R-project.org/package=vegan},
## }
citation("kableExtra")
##
## To cite package 'kableExtra' in publications use:
##
## Hao Zhu (2021). kableExtra: Construct Complex Table with 'kable' and
## Pipe Syntax. R package version 1.3.4.
## https://CRAN.R-project.org/package=kableExtra
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
## author = {Hao Zhu},
## year = {2021},
## note = {R package version 1.3.4},
## url = {https://CRAN.R-project.org/package=kableExtra},
## }
citation("readxl")
##
## To cite package 'readxl' in publications use:
##
## Hadley Wickham and Jennifer Bryan (2022). readxl: Read Excel Files. R
## package version 1.4.1. https://CRAN.R-project.org/package=readxl
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {readxl: Read Excel Files},
## author = {Hadley Wickham and Jennifer Bryan},
## year = {2022},
## note = {R package version 1.4.1},
## url = {https://CRAN.R-project.org/package=readxl},
## }
citation("ggpubr")
##
## To cite package 'ggpubr' in publications use:
##
## Alboukadel Kassambara (2023). ggpubr: 'ggplot2' Based Publication
## Ready Plots. R package version 0.6.0.
## https://CRAN.R-project.org/package=ggpubr
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggpubr: 'ggplot2' Based Publication Ready Plots},
## author = {Alboukadel Kassambara},
## year = {2023},
## note = {R package version 0.6.0},
## url = {https://CRAN.R-project.org/package=ggpubr},
## }
citation("ggsci")
##
## To cite package 'ggsci' in publications use:
##
## Nan Xiao (2018). ggsci: Scientific Journal and Sci-Fi Themed Color
## Palettes for 'ggplot2'. R package version 2.9.
## https://CRAN.R-project.org/package=ggsci
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggsci: Scientific Journal and Sci-Fi Themed Color Palettes for
## 'ggplot2'},
## author = {Nan Xiao},
## year = {2018},
## note = {R package version 2.9},
## url = {https://CRAN.R-project.org/package=ggsci},
## }
citation("scales")
##
## To cite package 'scales' in publications use:
##
## Hadley Wickham and Dana Seidel (2022). scales: Scale Functions for
## Visualization. R package version 1.2.1.
## https://CRAN.R-project.org/package=scales
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {scales: Scale Functions for Visualization},
## author = {Hadley Wickham and Dana Seidel},
## year = {2022},
## note = {R package version 1.2.1},
## url = {https://CRAN.R-project.org/package=scales},
## }
citation("ggrepel")
##
## To cite package 'ggrepel' in publications use:
##
## Kamil Slowikowski (2023). ggrepel: Automatically Position
## Non-Overlapping Text Labels with 'ggplot2'. R package version 0.9.3.
## https://CRAN.R-project.org/package=ggrepel
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggrepel: Automatically Position Non-Overlapping Text Labels with
## 'ggplot2'},
## author = {Kamil Slowikowski},
## year = {2023},
## note = {R package version 0.9.3},
## url = {https://CRAN.R-project.org/package=ggrepel},
## }
8.3 SESSION INFO
The analysis was run in the following environment:
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252
## [3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
## [5] LC_TIME=Danish_Denmark.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggrepel_0.9.3 pheatmap_1.0.12 scales_1.2.1 ggsci_2.9
## [5] ggpubr_0.6.0 kableExtra_1.3.4 vegan_2.6-4 lattice_0.20-44
## [9] permute_0.9-7 rstatix_0.7.2 forcats_1.0.0 stringr_1.5.0
## [13] dplyr_1.1.0 purrr_1.0.1 readr_2.1.3 tidyr_1.3.0
## [17] tibble_3.1.8 ggplot2_3.4.0 tidyverse_1.3.2 readxl_1.4.1
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-152 fs_1.6.0 lubridate_1.9.1
## [4] webshot_0.5.5 RColorBrewer_1.1-3 httr_1.4.7
## [7] repr_1.1.6 tools_4.1.1 backports_1.4.1
## [10] bslib_0.5.1 utf8_1.2.2 R6_2.5.1
## [13] DBI_1.1.3 mgcv_1.8-36 colorspace_2.1-0
## [16] withr_2.5.0 gridExtra_2.3 tidyselect_1.2.0
## [19] compiler_4.1.1 textshaping_0.3.6 cli_3.6.0
## [22] rvest_1.0.3 xml2_1.3.3 labeling_0.4.2
## [25] bookdown_0.32 sass_0.4.7 systemfonts_1.0.4
## [28] digest_0.6.31 rmarkdown_2.24 svglite_2.1.1
## [31] base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.5.4
## [34] highr_0.10 dbplyr_2.3.3 fastmap_1.1.0
## [37] rlang_1.0.6 rstudioapi_0.15.0 farver_2.1.1
## [40] jquerylib_0.1.4 generics_0.1.3 jsonlite_1.8.4
## [43] car_3.1-2 googlesheets4_1.0.1 magrittr_2.0.3
## [46] Matrix_1.5-3 Rcpp_1.0.10 munsell_0.5.0
## [49] fansi_1.0.4 abind_1.4-5 lifecycle_1.0.3
## [52] stringi_1.7.12 yaml_2.3.7 carData_3.0-5
## [55] MASS_7.3-54 grid_4.1.1 parallel_4.1.1
## [58] crayon_1.5.2 cowplot_1.1.1 haven_2.5.1
## [61] splines_4.1.1 hms_1.1.3 knitr_1.42
## [64] pillar_1.9.0 ggsignif_0.6.4 reprex_2.0.2
## [67] glue_1.6.2 evaluate_0.21 modelr_0.1.11
## [70] vctrs_0.5.2 rmdformats_1.0.4 tzdb_0.3.0
## [73] cellranger_1.1.0 gtable_0.3.1 cachem_1.0.6
## [76] xfun_0.36 skimr_2.1.5 broom_1.0.5
## [79] ragg_1.2.5 googledrive_2.0.0 viridisLite_0.4.2
## [82] gargle_1.2.1 cluster_2.1.2 timechange_0.2.0