# Statistics Courses

## Current training courses in statistics, data mining and machine learning for soil scientists, agricultural scientists, environmental scientists and natural scientists in 2024

- Two-day statistics intensive courses I and II for soil scientists from March 7th until March 8th 2024 (in English)
- Two-day statistics intensive course III for soil scientists from March 13th until March 14th 2024 (in English)
- One-day statistics intensive course IV for soil scientists on March 15th 2024 (in English)

## Background information on training courses (seminars, further education and training) in applied statistics, data mining and machine learning

The media and scientific journals of different disciplines repeatedly address the topic of erroneous research due to insufficient statistical knowledge (see e.g., Ainsworth (2007, Nature 448, 849)). This raises the likelihood that there is a demand for further statistical training in many areas where statistical analyses are required.

The Department of Environmental Chemistry regularly offers training in applied statistics, data mining and machine learning for soil scientists, agricultural scientists, environmental scientists and natural scientists. The following courses raise awareness for statistical pitfalls and offer training in different subareas:

###### I Fundamentals of statistics and introduction to R

###### II Exploratory statistics: statistical modelling and regressions using R

###### III Analyses of variance using R

###### IV Multivariate statistics I: PCA, PCR, PLSR and cluster analyses using R

###### V Multivariate statistics II, data mining and machine learning

## I. Fundamentals of statistics and introduction to R

Typical problematic fields in the applied sciences may be (I) research without hypotheses; (II) inappropriate experimental design; (III) a lack of understanding of pseudoreplication; (IV) an inappropriate handling of outliers; (V) missing inspections of conditions for hypothesis testing; and (VI) an insufficient description of statistical analyses in publications. Important training topics in the field of fundamental statistics are thus:

- Fundamentals of descriptive statistics
- Boxplots, histograms and Q-Q plots
- Distributions
- Scales

- Observational study vs. randomized controlled experiment
- Experimental designs
- Fundamentals of experimental design
- True replicates vs. pseudoreplicates
- Dealing with pseudoreplicates

- Fundamentals of inferential statistics
- Population and sample
- Tests of normality (e.g., the Shapiro-Wilk test) and of variance homogeneity (e.g., the F test)
- Confidence intervals
- Classical tests
- One-sample tests (t-test, Wilcoxon signed rank test)
- Two-sample tests for two independent samples (two-sample t-test,

Welch-test, Wilcoxon rank sum test) - Tests for paired samples (paired t-test, Matched-pairs Wilcoxon test)

- Correlations between variables of metric or ordinal data (Pearson correlation, Spearman rank correlation)
- Partial correlations
- Chi-square homogeneity test and chi-square goodness of fit test

R and RStudio are powerful tools in the field of applied statistics. Important training topics are the handling of scalars, vectors, matrices and data frames; the reading and writing of data; and the carrying out of the tests given above.

**Selection of references Textbooks**

- Dormann, C. 2020. Environmental Data analysis. Springer, New York.
- Jones, E., Harden, S., Crawley, M.J. 2022. The R Book. Third Edition. Wiley, West Sussex.
- Welham, S.J., Gezan, S.A., Clark, S.J., Mead, A. 2014. Statistical Methods in Biology. Design and Analysis of Experiments and Regression, CRC Press, Boca Raton.

**Articles**

- Ludwig, B., Linsler, D., Höper, H., Schmidt, H., Piepho, H.-P., Vohland, M. 2016. Pitfalls in the use of middle-infrared spectroscopy: representativeness and ranking criteria for the estimation of soil properties. Geoderma
**268**, 165-175. - Piepho, H.P., Möhring, J., Williams, E.R. 2013. Why randomize agricultural experiments? Journal of Agronomy and Crop Science
**199**, 374-383.

## II. Exploratory statistics: statistical modelling and regressions using R

Typical problematic fields in this area may be (I) a lack of knowledge of the importance of residual inspections; (II) a lack of understanding of the differences between a minimal adequate model and a maximal model; (III) a lack of knowledge of the differences between a calibration, a cross-validation and a validation of a model; and (IV) a lack of understanding of important special topics such as Box-Cox transformations, polynomial and logistic regressions and model comparisons. Important training topics are thus:

- Comparison of correlation and regression
- Simple and multiple linear regressions
- Residual inspections
- Model simplifications
- Model criticism

- Statistical modelling: saturated model, maximal model, minimal adequate model and null model
- Lack of fit test
- Transformations (e.g., Box-Cox transformation)
- Model formulae in R
- Dealing with variability and predictions
- Linear models and matrices
- Non-linear regression
- Logistic regression

**Selection of references Textbooks**

- Jones, E., Harden, S., Crawley, M.J. 2022. The R Book. Third Edition. Wiley, West Sussex.
- Mead, R., Curnow, R.N., Hasted, A.M. 2002. Statistical Methods in Agriculture and Experimental Biology. Third Edition. Chapman & Hall/CRC, Boca Raton.
- Welham, S.J., Gezan, S.A., Clark, S.J., Mead, A. 2014. Statistical Methods in Biology. Design and Analysis of Experiments and Regression, CRC Press, Boca Raton.

**Articles**

- Linsler, D., Nüsse, A., Buchen, C., Helfrich, M., Piepho, H.-P., Ludwig, B. 2018. Effects of chemical and physical grassland renovation on the temporal dynamics of organic carbon stocks and water-stable aggregate distribution in a temperate grassland soil. Soil Use and Management
**34**, 490-499. - Piepho, H.P. 2009. Data transformation in statistical analysis of field trials with changing treatment variance. Agronomy Journal
**101**, 865-869.

## III. Analyses of variance using R

Typical problematic fields in the area of analyses of variance may be (I) a lack of knowledge of the great importance of statistical independence of data as condition for the analyses of variance (dealing with spatially and/or temporally dependent data); (II) a lack of understanding of the importance of residual inspections and how to deal with missing normality or with variance heterogeneity; (III) a research without hypotheses with a focus on mechanically carried out post-hoc tests; (IV) a lack of knowledge how to handle unbalanced designs; (V) a lack of understanding how to handle more complicated designs (split plot); and (VI) inaccuracies in factor formulations. Training topics are thus:

- Fundamentals of one-way analysis of variance (ANOVA)
- Conditions and calculation
- Structure of ANOVA tables
- Residual inspections

- Post-hoc tests
- Multiple mean comparisons (pairwise t-tests with a correction for multiple testing, Tukey's HSD test, Fisher's LSD test)
- Problems with multiple mean comparisons and research without hypotheses

- Welch's ANOVA and Kruskal-Wallis test
- Multi-way ANOVA
- Consideration of block effects
- Importance of interactions of factors
- Model simplifications

- Contrasts instead of multiple mean comparisons
- Formulating factors and unbalanced model
- Combined anova and regression analysis
- Split-plots
- Introduction to mixed effects models

**Selection of references Textbooks**

- Bretz, F., Hothorn, T., Westfall, P. 2011. Multiple comparisons using R. CRC Press, Boca Raton.
- Jones, E., Harden, S., Crawley, M.J. 2022. The R Book. Third Edition. Wiley, West Sussex.
- Welham, S.J., Gezan, S.A., Clark, S.J., Mead, A. 2014. Statistical Methods in Biology. Design and Analysis of Experiments and Regression, CRC Press, Boca Raton.

**Articles**

- Kozak, M., Piepho, H.P. 2017. What's normal anyway? Residual plots are more telling than significance tests when checking ANOVA assumptions. Journal of Agronomy and Crop Science
**203**. - Ludwig, B., Song, X., Gunina, A., Greenberg, I., Dippold, M.A., Piepho, H.P. 2021. Importance of sources of variability, scales and experimental design: A case study about the effects of biochar and slurry application on soil properties in agricultural silty loam soils. European Journal of Soil Science
**72**, 1954-1968. https://doi.org/10.1111/ejss.13120 - Vormstein, S., Kaiser, M., Piepho, H.-P., Ludwig, B. 2020. Aggregate formation and organo-mineral association affect characteristics of soil organic matter across soil horizons and parent materials in temperate broadleaf forest. Biogeochemistry
**148**, 169-189.

## IV. Multivariate statistics I: PCA, PCR, PLSR and cluster analyses using R

Typical problematic fields in the area of multivariate statistics may be (I) a lack of knowledge of the opportunities and limitations of multivariate approaches; and (II) insufficient description of the multivariate analyses in publications. Training topics are thus:

- Matrix operations
- Calculation of eigenvalues and eigenvectors
- Centering and z-transformation
- Variance-covariance and correlation
- Unsupervised learning: principal component analysis (PCA)
- Calculations, presentations and interpretations

- Covariance, correlation and Euclidean distance
- Unsupervised learning: partitioning and hierarchical cluster analyses

- Supervised learning: principal component regression (PCR)
- Pretreatment of data (use of the Savitzky-Golay filter)
- Supervised learning: Partial least squares regression (PLSR)

**Selection of references Textbooks**

- Everitt, B., Hothorn, T. P. 2011. An Introduction to Applied Multivariate Analysis with R. Springer, New York.
- Lantz, B. 2019. Machine Learning with R. Packt Publishing, Birmingham.
- Wehrens, R. 2020. Chemometrics with R. Second Edition. Springer, New York.

**Articles**

- Ludwig, B., Vormstein, S., Niebuhr, J., Heinze, S., Marschner, B., Vohland, M. 2017. Usefulness of near infrared spectroscopy for an estimation of general soil properties and enzyme activities for two forest sites along three transects. Geoderma
**288**, 37-46. - Ludwig, B., Wölfel, P., Greenberg, I., Piepho, H.-P., Spörlein, P. 2022. Application of mixed-effects modelling and rule-based models to explain copper variation in soil profiles of southern Germany. European Journal of Soil Science
**73**, e13258. doi.org/10.1111/ejss.13258

## V. Multivariate statistics II, data mining and machine learning

Typical problematic fields in the area of multivariate statistics, data mining and machine learning may be (I) a lack of knowledge of the benefits and the limitations of these approaches; (II) an insufficient consideration of the importance of sample size and variability with respect to the selection of the algorithm; and (III) an overfitting. Training topics in the areas classification and regressions are thus:

- Factor analyses
- Perceptron
- Hard-margin and soft-margin support vector machines
- Artificial neural networks
- Random forest

**Selection of references Textbooks**

- Everitt, B., Hothorn, T. P. 2011. An Introduction to Applied Multivariate Analysis with R. Springer, New York.
- Lantz, B. 2019. Machine Learning with R. Packt Publishing, Birmingham.
- Wehrens, R. 2020. Chemometrics with R. Second Edition. Springer, New York.

**Articles**

- Cawley, G.C. Talbot, N.L.C. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research
**11**, 2079-2107. - Ludwig, B., Murugan, R., Parama, V.R.R., Vohland, M. 2019. Accuracy of estimating soil properties with mid-infrared spectroscopy: implications of different chemometric approaches and software packages related to calibration sample size. Soil Science Society of America Journal
**83**, 1542-1552. - Ludwig, B., Greenberg, I., Vohland, M., & Michel, K. 2023. Optimised use of data fusion and memory-based learning with an Austrian soil library for predictions with infrared data. European Journal of Soil Science,
**74**, e13394. doi.org/10.1111/ejss.13394