FAQ – frequently asked questions
Introduction and background
Research data are all data that are generated, processed or used in the course of a scientific process or are its result. Research data can exist in different formats depending on the scientific discipline.
Research data management is the process in which the generation, management and securing of this data is described or planned. It covers all areas of data management, in particular the planning of data collection, the generation and preparation of data, data integrity, its documentation and sustainable storage, and making the data accessible. This process is developed and documented using a data management plan, which is or should be part of any research project.
The data management plan is a "living document" that initially represents the central planning tool for data management in the research project and develops into the project documentation tool during the course of the project.
Data security: Professional handling of research data protects against
- data loss,
- enables a later comprehension of the research results and a future
- a future re-use of the data!
If the principles of research data management are observed during the planning and implementation of the research project, the risk of data loss can be minimized.
Physical data loss is prevented by the required number of copies, storage media and backup intervals. Long-term availability of data is ensured by using long-term readable file formats and backing up on suitable storage media.
Loss of data content is prevented by professional documentation of data collection, data preparation and description via metadata. This ensures that even after years, people not originally involved in the research project can interpret the collected data and thus reuse it if necessary. It is important that metadata which may not seem to be relevant to the immediate research interests, but which are indispensable for the subsequent use of the data - also and especially by persons who were not involved in the original collection - are also taken into account from the outset.
The need for professional research data management may result from subject-specific requirements, requirements of your own research institution, research funders, or journals. Find out about the requirements of your subject, your university or institute, your third-party funder, or the journal you wish to publish with, e.g.:
Structure, formats and documentation
In the work process, not only a large number of data sets are often created, but also respective versions due to various modification stages . With a view to efficient work, coordinated collaborative work processes, long-term traceability and, if necessary, internal or external reusability, it is advisable to define specific conventions for naming and versioning data records. If necessary, it may also make sense to define additional folder structures according to the degree of processing. The conventions should in turn be documented.
Naming conventions may look very different depending on the specifics of the research areas and data. They should reflect what type of data files (original data / raw data, cleaned files, analysis files) or what file form (working file, results file, etc.) are involved. This differentiation can also be done via versioning conventions. Uniformity, unambiguity and meaningfulness are important .
Examples for meaningful file naming are for example:
- [experiment]_[experimental design]_[subject]_[YYYYMMDD].sav
- [interviewee]_[interviewer] ]_[YYYYMMDD].mp3
To ensure compatibility between different operating systems, special characters (except underscores and hyphens) and umlauts should be avoided. File names should not exceed 21 characters.
Read-onlyversions should be created at various stages of modification (e.g., original data, cleaned data, analysis-ready data). Further edits should only be made to copies of these master files.
A well-known concept of versioning , based on the Data Documentation Initiative ( DDI) standard, is:
- Major.Minor.Revision (cf. Gesis - Guidelines).
Starting from version "v1-0-0", the following changes are made:
1. the first digit, if multiple cases, variables, waves or sample have been added or deleted
2. the second digit, if data are corrected so that the analysis is affected
3. the third digit, when simple revisions are made without relevance to meaning.
Conventions should always be adapted to the subject or project specific needs. If, for example, versions are not in a linear relationship to each other, relationships can be defined via special metadata schemas (such as the DataCite Metadata Schema) ("IsDerivedBy", "IsSourceOf")
Versioning can also be supported by appropriate software (e.g. Git).
The choice of a suitable file format is particularly important with regard to long-term storage and use of the data. Some properties are usually desired: files/formats should not be encrypted, notcompressed, notproprietary/patent protected. Accordingly, open, documented standards are preferred. For example, the following formats are usually preferred:
|Recommended format||Less suitable / unsuitable|
|.odt, .rtf, .txt||.doc/.docx|
|ASCII, .csv, .tsv, .tab||.xls/.xlsx, .mdb, .accdb|
|.por (SPSS portable)||.sav (SPSS)|
|.mp4||.mov, .avi, .wmv|
|.tiff, .jp2/.j2k/.jpx||.gif or .jpg|
Metadata is used to describe resources, in this case research data, in order to optimize their discoverability. Basic information includes, for example, title, author/primary researcher, institution, identifier, location & time period, subject, rights, file names, formats, etc. Since this information is essential for finding, understanding, and using data, standardized metadata schemas are intended to ensure that descriptions are as uniform and comprehensible as possible.
Metadata schemas are compilations of elements for describing data. Some disciplines already have specific metadata schemas, such as
- Humanities: Text Encoding Intitiative (TEI)
- Earth Sciences: ISO 19115, Darwin Core
- Natural Sciences: ICAT schema, Cristallographic Information Framework, conventions for Climate and Forecast metadata.
- Social and economic sciences: Digital Documentation Initiative (DDI)
Before you start documenting your data, ideally already as part of a data management plan, you should therefore check whether a suitable metadata schema already exists for your discipline. Information on this is provided, for example, by the Digital Curation Center (DDC). If no discipline-specific schema is available, a discipline-independent one, such as Dublin Core, MARC21 or RADAR. can also be used.
Metadata schemas thus specify what information should be delivered. For the best possible search and use of the data, it is also important to provide this information in as uniform a format as possible. A number of discipline-specific and cross-discipline so-called 'controlled vocabularies', thesauri, classifications and standards data are available for this purpose, such as:
- Standards for unique identification of individuals such as Open Researcher and Contributor ID (ORCID) or International Standard Name Identifier (ISNI, ISO 27729).
- Subject classification systems (e.g. DDC or LCC)
- Subject-specific classifications such as the Mathematics Subject Classification (MSC) or the Social Sciences Classification.
- Subject-specific thesauri such as the Thesaurus of Social Sciences (TheSoz), the Standard Thesaurus of Economics (STW) or the Getty Vocabularies (AAT, TGB, CONA, ULAN).
An overview of different systems is provided, for example, by the Basel Register of Thesauri, Ontologies & Classifications (BARTOC) and Taxonomy Warehouse.
Documentation usually goes beyond the description of data via metadata. It represents a deeper (scientific) indexing, in the context of which e.g. context of origin, variables, instruments, methods etc. are described in detail. In many cases, such a description is indispensable for understanding, verifying and, if necessary, using the data.
Publishing your data offers advantages for the scientific system, but also for you personally.
Published data are available for subsequent use in new contexts, e.g. also for interdisciplinary questions or meta-analyses. This not only creates scientific added value, but also avoids duplication of work and saves costs.
By assigning permanent identifiers, your data can be permanently referenced and cited by yourself and others. This is a prerequisite for data publications to be recognized as an independent achievement and to enter the scientific reputation system . A study by Piwowar and Vision (2013) also shows the higher citation rate of publications where the underlying research data have been published.
Last but not least, in some cases the publication simply fulfills requirements of third parties. In addition to the requirements ofresearch funders, publication service providers are also increasingly demanding that those research data on which a publication is based be made available. Some examples of such requirements are:
- Public Library of Science (PLOS): Data Availability Policy / Materials and Software Sharing Policy
- Nature Publishing Group: Availability of Data, Material and Methods Policy
- Science: Data and Materials Availability Policy / Preparing Your Supplementary Materials
- BioMed Central: Availability of supporting data
- Elsevier: Research data Policy and Text and Data Mining Policy
Publishing your data can be done in different ways:
- Discipline-specific data repositories and centers (How do I find a suitable repository?). This usually represents the best solution.
- Cross-discipline repositories such as Zenodo, Dryad or figshare (a comparison of the three repositories can be found here). This is more of a medium-term solution, as long-term archiving is not guaranteed. Across disciplines, the repository of the University of Kassel is also available (see "Archiving and publishing data")
- Data supplements of journals, e.g. Nature. This is increasingly required, but should be complemented by other archiving strategies in view of long-term availability.
Data journals such as GigaScience, Earth System Science Data, or Journal of Chemical and Engineering Data (lists of Data Journals #1, #2) do not publish data themselves, but their description-not interpretation (documentation or data curation profiles). This takes into account not least the fact that traditional articles offer hardly any space for the - important and valuable - data description.
There are both subject-specific or thematic as well as generic repositories. Subject repositories and data centers (such as Pangaea for geoscientific data, GenBank, Protein Data Bank) are often the first choice, not least with regard to visibility in the subject community, but also with regard to conformity to subject-specific standards . An overview of subject repositories is provided by the Registry of research data repositories(re3data.org) and the Open Access Directory to research data. A targeted search for subject repositories that also allow data storage is offered by the re3data-based RepositoryFinder.
When deciding on a particular repository, the following points can help you:
- Is it a repository that fits the subject matter? Is it established and connected to specific search portals?
- Does the repository offer the desired services (PIDs, open access, differentiated access rights (e.g. user agreements), realization of embargo periods)?
- Is the sustainability of the repository guaranteed? Is there an exit strategy or an agreement to preserve the data in case of e.g. discontinuation of funding?
- How are data transfer and data use regulated in terms of content and form?
The University of Kassel also provides all researchers who cannot or do not wish to use a subject-specific repository with an institutional repository (DaKS) (expected to be available from mid-January 2021), which fulfills both archiving and publication functions (see also "Archiving and publishing data"). This repository can also be used for student projects and theses.
First of all, it is important that the data is available in a suitable format . Some repositories make stricter specifications here, others merely make recommendations or are open to all formats. This makes it all the more important to start thinking about this in advance of the research. For general advice and specific links on formats, seeWhat file formats are useful?
In order for data to be found and used in a meaningful way, it must be documented in more detail through metadata . Please refer to the detailed notes atWhat are metadata, metadata schemas, and documentation?
An upload to a repository does not automatically mean immediate publication. Under certain circumstances, there may be reasons for an embargo period or partial publication. Especially in business-related research disciplines, embargoes on research results are common. Therefore, consider whether there are weighty reasons against immediate publication. See on thisDoes anything speak against publication?
Also consider the conditions under which you want to publish your data. There are different license models for this (Which license should I choose?)
Uploading your data does not equate to open access. In principle, you can also publish research data with a delay or only make the metadata accessible. In the case of actual publication, you can regulate the rights to access and edit in detail via the license or contracts (Can I then control the use of my data at all?). These possibilities can essentially be limited by:
- the specific requirements and policies of your research funders and/or publishers
- lack of/limited rights to the data
- restrictions under data protection law
- restrictions on the part of the repository
There are constellations in which data should not be published or should only be published under certain conditions. The most important prerequisite for publication is that you have the right to do so (Who may decide on the disclosure and publication of data? DoI own the copyright to my data?).
On the other hand, it may be confidential, personal data that may only be published after anonymization or with the consent of the persons concerned (What data protection restrictions must I observe?).
Personal data is defined as"individual information about personal or factual circumstances of a specific or identifiable natural person" (Section 3 (1) BDSG). They are subject to strict specifications in their collection, use and disclosure. For archiving, provision and publication, information that can be assigned to a specific or identifiable person should be removed from the research data. Depending on the data, different ways of anonymization are suitable here.
If personal data are to be processed, the consent of the data subject must usually be obtained. Among other things, the purpose must be clearly defined and the data subject must be able to assess the consequences.
In addition, research data such as company data may contain confidential information (know-how protection) or confidentiality and non-disclosure agreements may have been made that preclude publication.
Possible owners or co-owners of the rights to the data are the researchers, the employer, the client, research funders and/or (private sector) contractual partners. Who may co-decide or must be asked about the sharing or publication of research data is determined by the contractual relationship. Usually, the results of commissioned research are the property of the employer or funder. The situation is different in the case of in-house research, where researchers are allowed to determine the data themselves.
Research objects and occasionally also research data may be protected as works within the meaning of the Copyright Act. These may be works of speech, computer programs, musical works, pantomime works including works of dance, works of fine arts including works of architecture and applied arts, photographic works, cinematographic works and representations of a scientific and technical nature.
As a rule, however, research data lack the necessary level of creation and are not works. It is possible, however, that certain types of research data are covered by a performance protection right , for example photographs, motion pictures or sound recordings.
Often, however, the research data of a research project are protected by copyright as part of a database work or fall under the ancillary copyright for databases.
Research data that do not fall under a property right can generally be used by anyone for any purpose without permission or obligation to pay.
If you have copyright or ancillary copyright over research data, you can regulate various aspects of use via appropriate contracts, such as the type and manner of use, user groups and time period, purpose, etc. Since contractual regulations for individual cases would be very costly in practice, there are various solutions for standardized regulations of usage rights. For example, the Leibniz Center for Psychological Information and Documentation (ZPID) offers standard contracts for the use of psychological data and GESIS regulates access restrictions for particularly sensitive social science data via user contracts. If you do not want your data to be subject to any specific access or usage restrictions, the use of standardized licenses such as Creative Commons or Open Data Commons is a good option (Which license should I choose?).
The publication of data under a specific license allows a detailed definition of the permissible form of its use. They create legal certainty on the part of both the person providing the data and the person using it. Even when waiving any restrictions, it is therefore important to formulate them.
Although data themselves are not usually subject to copyright, there is a case for treating them as potentially worthy of protection, not least to express one's own ideas about further use. Various licensing models are available for this purpose. The most common of these is 'Creative Commons(CC). CC licenses are independent of the licensed content and cover copyrights, ancillary copyrights, and in the current version - if it exists - also database producer rights.
The license package 'Open Data Commons' of Open Knowledge International (formerly Open Knowledge Foundation) has been designed especially for the publication of data. In addition to the unconditional license (Open Data Commons Public Domain Dedication and License (PDDL)), it offers three other models:
- Open Data Commons Attribution License (ODC BY) (v 1.0) (attribution condition).
- Open Data Commons Open Database License (ODbL) (v 1.0) (sharing under equal conditions)
- Database Contents License (DbCL) (distribution under the same conditions also for database contents)
Regardless of its legal bindingness, the CC-BY license certainly comes closest to fulfilling the idea of Open Access and Open Science, whereas the 'distribution under the same conditions' can lead to compatibility problems with other licenses, the prohibition of editing can lead to restrictions in use, e.g. for data mining, or to problems with long-term archiving. Prohibiting commercial use makes it more difficult to use in commercial databases and thus potentially reduces the visibility of your research (for details see Paul Klimpel, 2012).
Finding and using research data
Not least due to the requirements and recommendations of funders, publishers and institutions for making data accessible, research data are increasingly available for subsequent use. In order to find suitable research data for one's own research area, relevant offerings from one's own field often provide the first point of contact. These can be institutional or subject-specific repositories or data journals. Repositories can be searched by subject area using the Repository Finder. A - far from exhaustive - list of data journals can be found here.
In addition, it is also possible to search dataacross multiple repositories using generic search services . A major drawback of these search services is that they often cannot adequately map the detailed metadata schemas of their sources. In addition, the respective metadata differ greatly in terms of what they identify, i.e. individual data, data sets or collections.
The best-known portals include:
Retrieves metadata from repositories and databases via OAI-PMH. Research data can be found via the document type "primary data".
Searches metadata from various sources such as CLARIN or Global GBIF.
Searches metadata of information objects, including research data (object type 'Dataset'), which are registered with DOIs at DataCite. The metadata is also partly queried by the other two services.
Contains freely accessible research results from EU-funded projects.
- Google Dataset Search (proprietary!)
- gesisDataSearch - search of social and economic research data in data repositories and metadata services
- VerbundFDB - Search of studies, research data and instruments of empirical educational research
The respective rights (licenses, user agreements, if applicable) are binding for the subsequent use itself. Among other things, they can specify who may use the data for what purpose and for how long.
In order to be able to reuse research data, the quality of the data is crucial above all. Data quality in research data management includes the following areas in particular:
- Data format (special storage formats of scientific data, such as vector format, raster format, and property format, etc.).
- Data completeness and data correctness
Leibniz Data Manager is a free prototype, which is exemplary for similar tools:
Leibniz Data Manager allows visualization of different research data formats, enabling 'screening' of datasets for their potential usefulness. As a visualization and management tool, it supports the management and access to heterogeneous research data publications, and thus researchers in selecting relevant datasets for their respective disciplines.
Currently, a prototype of the Leibniz Data Manager is available and offers numerous functions for the visualization of research data.
In order to adequately document the (subsequent) use of one's own and external research data in the sense of good scientific practice, correct data citation is essential.
In the case of third-party data, this also acknowledges the scientific achievement of its 'originator'. As with the citation of other publications, the conventions for citing data may differ formally. In terms of content, however, they are united by the requirement of unambiguous identifiability of the data source. The FORCE11 Data Citation Synthesis Group has developed recommendations for data c itation. According to them, a complete data citation includes
Author(s), year, title of research data, data repository or archive, version, worldwide Persistent Identifier.
Other optional details that may be useful as part of a citation include Edition, Feature name and URI, Resource type, Publisher, Unique numeric fingerprint (UNF), and Location (see Alex Ball & Monica Duke (2015). How to Cite Datasets and Link to Publications).
Unless otherwise noted, all text on this site and its subpages is licensed under a Creative Commons Attribution 4.0 International License.