# Statistics Courses

## Current training courses in statistics, data mining and machine learning for soil scientists, agricultural scientists, environmental scientists and natural scientists in 2020

## Background information on training courses (seminars, further education and training) in applied statistics, data mining and machine learning

The media and scientific journals of different disciplines repeatedly address the topic of erroneous research due to insufficient statistical knowledge (see e.g., Ainsworth (2007, Nature 448, 849)). This raises the likelihood that there is a demand for further statistical training in many areas where statistical analyses are required.

###### I Fundamentals of statistics and introduction to R

###### II Exploratory statistics: statistical modelling and regressions using R

###### III Analyses of variance using R

###### IV Multivariate statistics I: PCA, PCR and PLSR using R

###### V Multivariate statistics II, data mining and machine learning

## I. Fundamentals of statistics and introduction to R

Typical problematic fields in the applied sciences may be (I) research without hypotheses; (II) inappropriate experimental design; (III) a lack of understanding of pseudoreplication; (IV) an inappropriate handling of outliers; (V) missing inspections of conditions for hypothesis testing; and (VI) an insufficient description of statistical analyses in publications. Important training topics in the field of fundamental statistics are thus:

- Fundamentals of descriptive statistics
- Boxplots, histograms and Q-Q plots
- Distributions
- Scales

- Observational study vs. randomized controlled experiment
- Experimental designs
- Fundamentals of experimental design
- True replicates vs. pseudoreplicates
- Dealing with pseudoreplicates

- Fundamentals of inferential statistics
- Population and sample
- Tests of normality (e.g., the Shapiro-Wilk test) and of variance homogeneity (e.g., the F test)
- Confidence intervals
- Classical tests
- One-sample tests (t-test, Wilcoxon signed rank test)
- Two-sample tests for two independent samples (two-sample t-test,

Welch-test, Wilcoxon rank sum test) - Tests for paired samples (paired t-test, Matched-pairs Wilcoxon test)

- Correlations between variables of metric or ordinal data (Pearson correlation, Spearman rank correlation)
- Partial correlations
- Chi-square homogeneity test and chi-square goodness of fit test

R and RStudio are powerful tools in the field of applied statistics. Important training topics are the handling of scalars, vectors, matrices and data frames; the reading and writing of data; and the carrying out of the tests given above.

**Selection of references**

- Crawley, M.J. 2012. The R Book. Second Edition. Wiley, West Sussex.
- Dalgaard, P. 2008. Introductory Statistics with R. Springer, New York.
- Ludwig, B., Linsler, D., Höper, H., Schmidt, H., Piepho, H.-P., Vohland, M. 2016. Pitfalls in the use of middle-infrared spectroscopy: representativeness and ranking criteria for the estimation of soil properties.
*Geoderma***268**, 165-175. - Piepho, H.P., Möhring, J., Williams, E.R. 2013. Why randomize agricultural experiments?
*Journal of Agronomy**and Crop Science***199**, 374-383.

## II. Exploratory statistics: statistical modelling and regressions using R

Typical problematic fields in this area may be (I) a lack of knowledge of the importance of residual inspections; (II) a lack of understanding of the differences between a minimal adequate model and a maximal model; (III) a lack of knowledge of the differences between a calibration, a cross-validation and a validation of a model; and (IV) a lack of understanding of important special topics such as Box-Cox transformations, polynomial and logistic regressions and model comparisons. Important training topics are thus:

- Comparison of correlation and regression
- Simple and multiple linear regressions
- Residual inspections
- Model simplifications
- Model criticism

- Statistical modelling: saturated model, maximal model, minimal adequate model and null model
- Lack of fit test
- Transformations (e.g., Box-Cox transformation)
- Model formulae in R
- Dealing with variability and predictions
- Linear models and matrices
- Non-linear regression
- Logistic regression

**Selection of references**

- Crawley, M.J. 2012. The R Book. Second Edition. Wiley, West Sussex.
- Linsler, D., Nüsse, A., Buchen, C., Helfrich, M., Piepho, H.-P., Ludwig, B. 2018. Effects of chemical and physical grassland renovation on the temporal dynamics of organic carbon stocks and water-stable aggregate distribution in a temperate grassland soil.
*Soil Use and Management***34**, 490-499. - Mead, R., Curnow, R.N., Hasted, A.M. 2002. Statistical Methods in Agriculture and Experimental Biology. Third Edition. Chapman & Hall/CRC, Boca Raton.
- Piepho, H.P. 2009. Data transformation in statistical analysis of field trials with changing treatment variance.
*Agronomy Journal***101**, 865-869.

## III. Analyses of variance using R

Typical problematic fields in the area of analyses of variance may be (I) a lack of knowledge of the great importance of statistical independence of data as condition for the analyses of variance (dealing with spatially and/or temporally dependent data); (II) a lack of understanding of the importance of residual inspections and how to deal with missing normality or with variance heterogeneity; (III) a research without hypotheses with a focus on mechanically carried out post-hoc tests; (IV) a lack of knowledge how to handle unbalanced designs; (V) a lack of understanding how to handle more complicated designs (split plot); and (VI) inaccuracies in factor formulations. Training topics are thus:

- Fundamentals of one-way analysis of variance (ANOVA)
- Conditions and calculation
- Structure of ANOVA tables
- Residual inspections

- Post-hoc tests
- Multiple mean comparisons (pairwise t-tests with a correction for multiple testing, Tukey's HSD test, Fisher's LSD test)
- Problems with multiple mean comparisons and research without hypotheses

- Welch's ANOVA and Kruskal-Wallis test
- Multi-way ANOVA
- Consideration of block effects
- Importance of interactions of factors
- Model simplifications

- Contrasts instead of multiple mean comparisons
- Formulating factors and unbalanced model
- Combined anova and regression analysis
- Split-plots
- Introduction to mixed effects models

**Selection of references**

- Bretz, F., Hothorn, T., Westfall, P. 2011. Multiple comparisons using R. CRC Press, Boca Raton.
- Crawley, M.J. 2012. The R Book. Second Edition. Wiley, West Sussex.
- Kozak, M., Piepho, H.P. 2017. What's normal anyway? Residual plots are more telling than significance tests when checking ANOVA assumptions.
*Journal of**Agronomy and Crop Science***203**. - Mead, R., Curnow, R.N., Hasted, A.M. 2002. Statistical Methods in Agriculture and Experimental Biology. Third Edition. Chapman & Hall/CRC, Boca Raton.
- Onofri, A., Carbonell, E.A., Mortimer, M., Piepho, H.P. 2010. Current statistical issues in weed research.
*Weed Research***50**, 5-24. - Vormstein, S., Kaiser, M., Piepho, H.P., Joergensen, R.G., Ludwig, B. 2017. Effects of the concentration, size, and distribution of beech fine roots on the C turnover in homogenized and minimally disturbed top- and subsoil material of a sandy Cambisol.
*European Journal of Soil Science***68**, 177-188.

## IV. Multivariate statistics I: PCA, PCR and PLSR using R

Typical problematic fields in the area of multivariate statistics may be (I) a lack of knowledge of the opportunities and limitations of multivariate approaches; and (II) insufficient description of the multivariate analyses in publications. Training topics are thus:

- Matrix operations
- Calculation of eigenvalues and eigenvectors
- Centering and z-transformation
- Variance-covariance and correlation
- Principal component analysis (PCA)
- Calculations, presentations and interpretations

- Principal component regression (PCR)
- Pretreatment of data (use of the Savitzky-Golay filter)
- Partial least squares regression (PLSR)

**Selection of references**

- Everitt, B., Hothorn, T. P. 2011. An Introduction to Applied Multivariate Analysis with R. Springer, New York.
- Ludwig, B., Vormstein, S., Niebuhr, J., Heinze, S., Marschner, B., Vohland, M. 2017. Usefulness of near infrared spectroscopy for an estimation of general soil properties and enzyme activities for two forest sites along three transects.
*Geoderma***288**, 37-46. - Mark, H., Workman, J. 2007. Chemometrics in Spectroscopy. Elsevier, Amsterdam.
- Wehrens, R. 2011. Chemometrics with R. Springer, New York.

## V. Multivariate statistics II, data mining and machine learning

Typical problematic fields in the area of multivariate statistics, data mining and machine learning may be (I) a lack of knowledge of the benefits and the limitations of these approaches; (II) an insufficient consideration of the importance of sample size and variability with respect to the selection of the algorithm; and (III) an overfitting. Training topics in the areas classification and regressions are thus:

- Covariance, correlation and Euclidean distance
- Cluster analyses
- Factor analyses
- Artificial neural networks
- Random forest
- Support vector machine

**Selection of references**

- Cawley, G.C. Talbot, N.L.C. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation.
*Journal of Machine Learning Research***11**, 2079-2107. - Everitt, B., Hothorn, T. P. 2011. An Introduction to Applied Multivariate Analysis with R. Springer, New York.
- Ludwig, B., Murugan, R., Parama, V.R.R., Vohland, M. 2018. Use of different chemometric approaches for an estimation of contents of soil properties in an Indian arable field with near infrared spectroscopy.
*Journal of Plant Nutrition and Soil Science***181**, 704-713. - Ludwig, B., Murugan, R., Parama, V.R.R., Vohland, M. 2019. Accuracy of estimating soil properties with mid-infrared spectroscopy: implications of different chemometric approaches and software packages related to calibration sample size.
*Soil Science Society of America Journal***83**, 1542-1552. - Wehrens, R. 2011. Chemometrics with R. Springer, New York.