# Statistics Courses

## Current training courses in statistics, data mining and machine learning for soil scientists, agricultural scientists, environmental scientists and natural scientists in 2020

## Background information on training courses (seminars, further education and training) in applied statistics, data mining and machine learning

The media and scientific journals of different disciplines repeatedly address the topic of erroneous research due to insufficient statistical knowledge (see e.g., Ainsworth (2007, Nature 448, 849)). This raises the likelihood that there is a demand for further statistical training in many areas where statistical analyses are required.

###### I Fundamentals of statistics and introduction to R

###### II Exploratory statistics: statistical modelling and regressions using R

###### III Analyses of variance using R

###### IV Multivariate statistics I: PCA, PCR, PLSR and cluster analyses using R

###### V Multivariate statistics II, data mining and machine learning

## I. Fundamentals of statistics and introduction to R

Typical problematic fields in the applied sciences may be (I) research without hypotheses; (II) inappropriate experimental design; (III) a lack of understanding of pseudoreplication; (IV) an inappropriate handling of outliers; (V) missing inspections of conditions for hypothesis testing; and (VI) an insufficient description of statistical analyses in publications. Important training topics in the field of fundamental statistics are thus:

- Fundamentals of descriptive statistics
- Boxplots, histograms and Q-Q plots
- Distributions
- Scales

- Observational study vs. randomized controlled experiment
- Experimental designs
- Fundamentals of experimental design
- True replicates vs. pseudoreplicates
- Dealing with pseudoreplicates

- Fundamentals of inferential statistics
- Population and sample
- Tests of normality (e.g., the Shapiro-Wilk test) and of variance homogeneity (e.g., the F test)
- Confidence intervals
- Classical tests
- One-sample tests (t-test, Wilcoxon signed rank test)
- Two-sample tests for two independent samples (two-sample t-test,

Welch-test, Wilcoxon rank sum test) - Tests for paired samples (paired t-test, Matched-pairs Wilcoxon test)

- Correlations between variables of metric or ordinal data (Pearson correlation, Spearman rank correlation)
- Partial correlations
- Chi-square homogeneity test and chi-square goodness of fit test

R and RStudio are powerful tools in the field of applied statistics. Important training topics are the handling of scalars, vectors, matrices and data frames; the reading and writing of data; and the carrying out of the tests given above.

**Selection of referencesTextbooks**

- Crawley, M.J. 2012. The R Book. Second Edition. Wiley, West Sussex.
- Dalgaard, P. 2008. Introductory Statistics with R. Springer, New York.
- Welham, S.J., Gezan, S.A., Clark, S.J., Mead, A. 2014. Statistical Methods in Biology. Design and Analysis of Experiments and Regression, CRC Press, Boca Raton.

**Articles**

- Ludwig, B., Linsler, D., Höper, H., Schmidt, H., Piepho, H.-P., Vohland, M. 2016. Pitfalls in the use of middle-infrared spectroscopy: representativeness and ranking criteria for the estimation of soil properties.
*Geoderma***268**, 165-175. - Piepho, H.P., Möhring, J., Williams, E.R. 2013. Why randomize agricultural experiments?
*Journal of Agronomy**and Crop Science***199**, 374-383.

## II. Exploratory statistics: statistical modelling and regressions using R

Typical problematic fields in this area may be (I) a lack of knowledge of the importance of residual inspections; (II) a lack of understanding of the differences between a minimal adequate model and a maximal model; (III) a lack of knowledge of the differences between a calibration, a cross-validation and a validation of a model; and (IV) a lack of understanding of important special topics such as Box-Cox transformations, polynomial and logistic regressions and model comparisons. Important training topics are thus:

- Comparison of correlation and regression
- Simple and multiple linear regressions
- Residual inspections
- Model simplifications
- Model criticism

- Statistical modelling: saturated model, maximal model, minimal adequate model and null model
- Lack of fit test
- Transformations (e.g., Box-Cox transformation)
- Model formulae in R
- Dealing with variability and predictions
- Linear models and matrices
- Non-linear regression
- Logistic regression

**Selection of referencesTextbooks**

- Crawley, M.J. 2012. The R Book. Second Edition. Wiley, West Sussex.
- Mead, R., Curnow, R.N., Hasted, A.M. 2002. Statistical Methods in Agriculture and Experimental Biology. Third Edition. Chapman & Hall/CRC, Boca Raton.
- Welham, S.J., Gezan, S.A., Clark, S.J., Mead, A. 2014. Statistical Methods in Biology. Design and Analysis of Experiments and Regression, CRC Press, Boca Raton.

**Articles**

- Linsler, D., Nüsse, A., Buchen, C., Helfrich, M., Piepho, H.-P., Ludwig, B. 2018. Effects of chemical and physical grassland renovation on the temporal dynamics of organic carbon stocks and water-stable aggregate distribution in a temperate grassland soil.
*Soil Use and Management***34**, 490-499. - Piepho, H.P. 2009. Data transformation in statistical analysis of field trials with changing treatment variance.
*Agronomy Journal***101**, 865-869.

## III. Analyses of variance using R

Typical problematic fields in the area of analyses of variance may be (I) a lack of knowledge of the great importance of statistical independence of data as condition for the analyses of variance (dealing with spatially and/or temporally dependent data); (II) a lack of understanding of the importance of residual inspections and how to deal with missing normality or with variance heterogeneity; (III) a research without hypotheses with a focus on mechanically carried out post-hoc tests; (IV) a lack of knowledge how to handle unbalanced designs; (V) a lack of understanding how to handle more complicated designs (split plot); and (VI) inaccuracies in factor formulations. Training topics are thus:

- Fundamentals of one-way analysis of variance (ANOVA)
- Conditions and calculation
- Structure of ANOVA tables
- Residual inspections

- Post-hoc tests
- Multiple mean comparisons (pairwise t-tests with a correction for multiple testing, Tukey's HSD test, Fisher's LSD test)
- Problems with multiple mean comparisons and research without hypotheses

- Welch's ANOVA and Kruskal-Wallis test
- Multi-way ANOVA
- Consideration of block effects
- Importance of interactions of factors
- Model simplifications

- Contrasts instead of multiple mean comparisons
- Formulating factors and unbalanced model
- Combined anova and regression analysis
- Split-plots
- Introduction to mixed effects models

**Selection of referencesTextbooks**

- Bretz, F., Hothorn, T., Westfall, P. 2011. Multiple comparisons using R. CRC Press, Boca Raton.
- Crawley, M.J. 2012. The R Book. Second Edition. Wiley, West Sussex.
- Mead, R., Curnow, R.N., Hasted, A.M. 2002. Statistical Methods in Agriculture and Experimental Biology. Third Edition. Chapman & Hall/CRC, Boca Raton.
- Welham, S.J., Gezan, S.A., Clark, S.J., Mead, A. 2014. Statistical Methods in Biology. Design and Analysis of Experiments and Regression, CRC Press, Boca Raton.

**Articles**

- Kozak, M., Piepho, H.P. 2017. What's normal anyway? Residual plots are more telling than significance tests when checking ANOVA assumptions.
*Journal of**Agronomy and Crop Science***203**. - Onofri, A., Carbonell, E.A., Mortimer, M., Piepho, H.P. 2010. Current statistical issues in weed research.
*Weed Research***50**, 5-24. - Vormstein, S., Kaiser, M., Piepho, H.P., Joergensen, R.G., Ludwig, B. 2017. Effects of the concentration, size, and distribution of beech fine roots on the C turnover in homogenized and minimally disturbed top- and subsoil material of a sandy Cambisol.
*European Journal of Soil Science***68**, 177-188. - Vormstein, S., Kaiser, M., Piepho, H.-P., Ludwig, B. 2020. Aggregate formation and organo-mineral association affect characteristics of soil organic matter across soil horizons and parent materials in temperate broadleaf forest.
*Biogeochemistry***148**, 169-189.

## IV. Multivariate statistics I: PCA, PCR, PLSR and cluster analyses using R

Typical problematic fields in the area of multivariate statistics may be (I) a lack of knowledge of the opportunities and limitations of multivariate approaches; and (II) insufficient description of the multivariate analyses in publications. Training topics are thus:

- Matrix operations
- Calculation of eigenvalues and eigenvectors
- Centering and z-transformation
- Variance-covariance and correlation
- Unsupervised learning: principal component analysis (PCA)
- Calculations, presentations and interpretations

- Covariance, correlation and Euclidean distance
- Unsupervised learning: partitioning and hierarchical cluster analyses

- Supervised learning: principal component regression (PCR)
- Pretreatment of data (use of the Savitzky-Golay filter)
- Supervised learning: Partial least squares regression (PLSR)

**Selection of referencesTextbooks**

- Everitt, B., Hothorn, T. P. 2011. An Introduction to Applied Multivariate Analysis with R. Springer, New York.
- Lantz, B. 2019. Machine Learning with R. Packt Publishing, Birmingham.
- Mark, H., Workman, J. 2007. Chemometrics in Spectroscopy. Elsevier, Amsterdam.
- Wehrens, R. 2011. Chemometrics with R. Springer, New York.

**Articles**

- Ludwig, B., Vormstein, S., Niebuhr, J., Heinze, S., Marschner, B., Vohland, M. 2017. Usefulness of near infrared spectroscopy for an estimation of general soil properties and enzyme activities for two forest sites along three transects.
*Geoderma***288**, 37-46.

## V. Multivariate statistics II, data mining and machine learning

Typical problematic fields in the area of multivariate statistics, data mining and machine learning may be (I) a lack of knowledge of the benefits and the limitations of these approaches; (II) an insufficient consideration of the importance of sample size and variability with respect to the selection of the algorithm; and (III) an overfitting. Training topics in the areas classification and regressions are thus:

- Factor analyses
- Perceptron
- Hard-margin and soft-margin support vector machines
- Artificial neural networks
- Random forest

**Selection of referencesTextbooks**

- Everitt, B., Hothorn, T. P. 2011. An Introduction to Applied Multivariate Analysis with R. Springer, New York.
- Lantz, B. 2019. Machine Learning with R. Packt Publishing, Birmingham.
- Wehrens, R. 2011. Chemometrics with R. Springer, New York.

**Articles**

- Cawley, G.C. Talbot, N.L.C. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation.
*Journal of Machine Learning Research***11**, 2079-2107. - Ludwig, B., Murugan, R., Parama, V.R.R., Vohland, M. 2018. Use of different chemometric approaches for an estimation of contents of soil properties in an Indian arable field with near infrared spectroscopy.
*Journal of Plant Nutrition and Soil Science***181**, 704-713. - Ludwig, B., Murugan, R., Parama, V.R.R., Vohland, M. 2019. Accuracy of estimating soil properties with mid-infrared spectroscopy: implications of different chemometric approaches and software packages related to calibration sample size.
*Soil Science Society of America Journal***83**, 1542-1552.