Text and data mining

Text and data mining (TDM) is a collective term for various procedures for searching and evaluating large quantities of texts or data (corpora) from various aspects. With the help of computer-assisted analysis methods, corpora can be examined for patterns, correlations and other research-relevant relationships. Example projects of the TDM would be


For licensing, technical, and other questions about text and data mining, please contact us by email:

Researchers are allowed to use text or data mining methods within the framework of legal and licensing requirements. Since the amendment of the Copyright Act (UrhG) by the 'Act on the Adaptation of Copyright Law to the Current Requirements of the Knowledge Society (UrhWissG)' in 2018, this right has been enshrined in law with ยง60d UrhG. This also applies regardless of any conflicting clauses in individual license agreements if the agreements were concluded after February 28, 2018. Some publishers also have general regulations on the use of text and data mining in their publications (such as Cambridge University Press, Oxford University Press, Elsevier, SAGE, Springer Nature, Wiley). These generally do NOT go beyond the statutory right, but in some cases provide information on their own interfaces and their use (registration, specifications for loading and download rates, ...).

The right to TDM also includes the necessary steps of storage and processing that accompany corpus creation and enable analysis, such as digitizing, normalizing, structuring, categorizing, sorting, annotating, combining, etc. In turn, the underlying corpus may be transferred to privileged memory institutions (e.g., the library) for permanent preservation after the research is completed to ensure referencing and quality control.

Although TDM are generally permitted, the legal rules also set certain limits

  • The research purpose must be exclusively for non-commercial purposes.
  • Legal access to the data must be available, i.e. the data should be available on the basis of a license agreement concluded with the rights holder or as open access publications.
  • No existing copy protection may be circumvented. In some circumstances, there is a right to obtain means to remove protection from the restrictor.
  • Access to the corpus is permitted only in the context of quality review; the material may not be made available for any follow-up research .

Many licensors also prohibit automated, mass downloading of pdf files from their portals via crawlers, scripts, bots, etc. Such mass downloading may lead to blocking of the publisher's offer for the whole university. Therefore, please inform yourself in advance about alternative interfaces and/or contact us: tdm[at]bibliothek.uni-kassel[dot]de

The DOI registry Crossref as well as some publishers offer special interfaces where you can obtain full texts for your TDM projects. If applicable, please familiarize yourself with these interfaces - they might make your work easier:

As stated above, a claim to TDM only exists for material to which you have legal access, as either licensed material or open access content. In exceptional cases, publishers also give access to non-licensed material for the purpose of TDM (cf. e.g. Elsevier "on a case-by-case basis").

Open access to content in the sense of Open Science facilitates the implementation of TDM. Clear rights management based on standardized, machine-readable and open-content Creative Commons licenses contributes to the legally secure application of TDM methods to data and text corpora.