Statistics Courses

Current training courses in statistics, data mining and machine learning for soil scientists, agricultural scientists, environmental scientists and natural scientists in 2024

Two-day statistics intensive courses I and II for soil scientists from March 7th until March 8th 2024 (in English)
Two-day statistics intensive course III for soil scientists from March 13th until March 14th 2024 (in English)
One-day statistics intensive course IV for soil scientists on March 15th 2024 (in English)

Background information on training courses (seminars, further education and training) in applied statistics, data mining and machine learning

The media and scientific journals of different disciplines repeatedly address the topic of erroneous research due to insufficient statistical knowledge (see e.g., Ainsworth (2007, Nature 448, 849)). This raises the likelihood that there is a demand for further statistical training in many areas where statistical analyses are required.

The Department of Environmental Chemistry regularly offers training in applied statistics, data mining and machine learning for soil scientists, agricultural scientists, environmental scientists and natural scientists. The following courses raise awareness for statistical pitfalls and offer training in different subareas:

I Fundamentals of statistics and introduction to R

II Exploratory statistics: statistical modelling and regressions using R

III Analyses of variance using R

IV Multivariate statistics I: PCA, PCR, PLSR and cluster analyses using R

V Multivariate statistics II, data mining and machine learning

I. Fundamentals of statistics and introduction to R

Typical problematic fields in the applied sciences may be (I) research without hypotheses; (II) inappropriate experimental design; (III) a lack of understanding of pseudoreplication; (IV) an inappropriate handling of outliers; (V) missing inspections of conditions for hypothesis testing; and (VI) an insufficient description of statistical analyses in publications. Important training topics in the field of fundamental statistics are thus:

Fundamentals of descriptive statistics
- Boxplots, histograms and Q-Q plots
- Distributions
- Scales
Observational study vs. randomized controlled experiment
Experimental designs
- Fundamentals of experimental design
- True replicates vs. pseudoreplicates
- Dealing with pseudoreplicates
Fundamentals of inferential statistics
- Population and sample
- Tests of normality (e.g., the Shapiro-Wilk test) and of variance homogeneity (e.g., the F test)
- Confidence intervals
- Classical tests
  - One-sample tests (t-test, Wilcoxon signed rank test)
  - Two-sample tests for two independent samples (two-sample t-test,
    Welch-test, Wilcoxon rank sum test)
  - Tests for paired samples (paired t-test, Matched-pairs Wilcoxon test)
Correlations between variables of metric or ordinal data (Pearson correlation, Spearman rank correlation)
Partial correlations
Chi-square homogeneity test and chi-square goodness of fit test

R and RStudio are powerful tools in the field of applied statistics. Important training topics are the handling of scalars, vectors, matrices and data frames; the reading and writing of data; and the carrying out of the tests given above.

Selection of references
Textbooks

Dormann, C. 2020. Environmental Data analysis. Springer, New York.
Jones, E., Harden, S., Crawley, M.J. 2022. The R Book. Third Edition. Wiley, West Sussex.
Welham, S.J., Gezan, S.A., Clark, S.J., Mead, A. 2014. Statistical Methods in Biology. Design and Analysis of Experiments and Regression, CRC Press, Boca Raton.

Articles

Ludwig, B., Linsler, D., Höper, H., Schmidt, H., Piepho, H.-P., Vohland, M. 2016. Pitfalls in the use of middle-infrared spectroscopy: representativeness and ranking criteria for the estimation of soil properties. Geoderma 268, 165-175.
Piepho, H.P., Möhring, J., Williams, E.R. 2013. Why randomize agricultural experiments? Journal of Agronomy and Crop Science 199, 374-383.

II. Exploratory statistics: statistical modelling and regressions using R

Typical problematic fields in this area may be (I) a lack of knowledge of the importance of residual inspections; (II) a lack of understanding of the differences between a minimal adequate model and a maximal model; (III) a lack of knowledge of the differences between a calibration, a cross-validation and a validation of a model; and (IV) a lack of understanding of important special topics such as Box-Cox transformations, polynomial and logistic regressions and model comparisons. Important training topics are thus:

Comparison of correlation and regression
Simple and multiple linear regressions
- Residual inspections
- Model simplifications
- Model criticism
Statistical modelling: saturated model, maximal model, minimal adequate model and null model
Lack of fit test
Transformations (e.g., Box-Cox transformation)
Model formulae in R
Dealing with variability and predictions
Linear models and matrices
Non-linear regression
Logistic regression

Selection of references
Textbooks

Jones, E., Harden, S., Crawley, M.J. 2022. The R Book. Third Edition. Wiley, West Sussex.
Mead, R., Curnow, R.N., Hasted, A.M. 2002. Statistical Methods in Agriculture and Experimental Biology. Third Edition. Chapman & Hall/CRC, Boca Raton.
Welham, S.J., Gezan, S.A., Clark, S.J., Mead, A. 2014. Statistical Methods in Biology. Design and Analysis of Experiments and Regression, CRC Press, Boca Raton.

Articles

Linsler, D., Nüsse, A., Buchen, C., Helfrich, M., Piepho, H.-P., Ludwig, B. 2018. Effects of chemical and physical grassland renovation on the temporal dynamics of organic carbon stocks and water-stable aggregate distribution in a temperate grassland soil. Soil Use and Management 34, 490-499.
Piepho, H.P. 2009. Data transformation in statistical analysis of field trials with changing treatment variance. Agronomy Journal 101, 865-869.

III. Analyses of variance using R

Typical problematic fields in the area of analyses of variance may be (I) a lack of knowledge of the great importance of statistical independence of data as condition for the analyses of variance (dealing with spatially and/or temporally dependent data); (II) a lack of understanding of the importance of residual inspections and how to deal with missing normality or with variance heterogeneity; (III) a research without hypotheses with a focus on mechanically carried out post-hoc tests; (IV) a lack of knowledge how to handle unbalanced designs; (V) a lack of understanding how to handle more complicated designs (split plot); and (VI) inaccuracies in factor formulations. Training topics are thus:

Fundamentals of one-way analysis of variance (ANOVA)
- Conditions and calculation
- Structure of ANOVA tables
- Residual inspections
Post-hoc tests
- Multiple mean comparisons (pairwise t-tests with a correction for multiple testing, Tukey's HSD test, Fisher's LSD test)
- Problems with multiple mean comparisons and research without hypotheses
Welch's ANOVA and Kruskal-Wallis test
Multi-way ANOVA
- Consideration of block effects
- Importance of interactions of factors
- Model simplifications
Contrasts instead of multiple mean comparisons
Formulating factors and unbalanced model
Combined anova and regression analysis
Split-plots
Introduction to mixed effects models

Selection of references
Textbooks

Bretz, F., Hothorn, T., Westfall, P. 2011. Multiple comparisons using R. CRC Press, Boca Raton.
Jones, E., Harden, S., Crawley, M.J. 2022. The R Book. Third Edition. Wiley, West Sussex.
Welham, S.J., Gezan, S.A., Clark, S.J., Mead, A. 2014. Statistical Methods in Biology. Design and Analysis of Experiments and Regression, CRC Press, Boca Raton.

Articles

Kozak, M., Piepho, H.P. 2017. What's normal anyway? Residual plots are more telling than significance tests when checking ANOVA assumptions. Journal of Agronomy and Crop Science 203.
Ludwig, B., Song, X., Gunina, A., Greenberg, I., Dippold, M.A., Piepho, H.P. 2021. Importance of sources of variability, scales and experimental design: A case study about the effects of biochar and slurry application on soil properties in agricultural silty loam soils. European Journal of Soil Science 72, 1954-1968. https://doi.org/10.1111/ejss.13120
Vormstein, S., Kaiser, M., Piepho, H.-P., Ludwig, B. 2020. Aggregate formation and organo-mineral association affect characteristics of soil organic matter across soil horizons and parent materials in temperate broadleaf forest. Biogeochemistry 148, 169-189.

IV. Multivariate statistics I: PCA, PCR, PLSR and cluster analyses using R

Typical problematic fields in the area of multivariate statistics may be (I) a lack of knowledge of the opportunities and limitations of multivariate approaches; and (II) insufficient description of the multivariate analyses in publications. Training topics are thus:

Matrix operations
Calculation of eigenvalues and eigenvectors
Centering and z-transformation
Variance-covariance and correlation
Unsupervised learning: principal component analysis (PCA)
- Calculations, presentations and interpretations
Covariance, correlation and Euclidean distance
Unsupervised learning: partitioning and hierarchical cluster analyses

Supervised learning: principal component regression (PCR)
Pretreatment of data (use of the Savitzky-Golay filter)
Supervised learning: Partial least squares regression (PLSR)

Selection of references
Textbooks

Everitt, B., Hothorn, T. P. 2011. An Introduction to Applied Multivariate Analysis with R. Springer, New York.
Lantz, B. 2019. Machine Learning with R. Packt Publishing, Birmingham.
Wehrens, R. 2020. Chemometrics with R. Second Edition. Springer, New York.

Articles

Ludwig, B., Vormstein, S., Niebuhr, J., Heinze, S., Marschner, B., Vohland, M. 2017. Usefulness of near infrared spectroscopy for an estimation of general soil properties and enzyme activities for two forest sites along three transects. Geoderma 288, 37-46.
Ludwig, B., Wölfel, P., Greenberg, I., Piepho, H.-P., Spörlein, P. 2022. Application of mixed-effects modelling and rule-based models to explain copper variation in soil profiles of southern Germany. European Journal of Soil Science 73, e13258. doi.org/10.1111/ejss.13258

V. Multivariate statistics II, data mining and machine learning

Typical problematic fields in the area of multivariate statistics, data mining and machine learning may be (I) a lack of knowledge of the benefits and the limitations of these approaches; (II) an insufficient consideration of the importance of sample size and variability with respect to the selection of the algorithm; and (III) an overfitting. Training topics in the areas classification and regressions are thus:

Factor analyses
Perceptron
Hard-margin and soft-margin support vector machines
Artificial neural networks
Random forest

Selection of references
Textbooks

Everitt, B., Hothorn, T. P. 2011. An Introduction to Applied Multivariate Analysis with R. Springer, New York.
Lantz, B. 2019. Machine Learning with R. Packt Publishing, Birmingham.
Wehrens, R. 2020. Chemometrics with R. Second Edition. Springer, New York.

Articles

Cawley, G.C. Talbot, N.L.C. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11, 2079-2107.
Ludwig, B., Murugan, R., Parama, V.R.R., Vohland, M. 2019. Accuracy of estimating soil properties with mid-infrared spectroscopy: implications of different chemometric approaches and software packages related to calibration sample size. Soil Science Society of America Journal 83, 1542-1552.
Ludwig, B., Greenberg, I., Vohland, M., & Michel, K. 2023. Optimised use of data fusion and memory-based learning with an Austrian soil library for predictions with infrared data. European Journal of Soil Science, 74, e13394. doi.org/10.1111/ejss.13394