Frequently Asked Question
- Data Storage and Digital Preservation
- Data Publishing
- Legal Aspects
- Finding and Using Research Data
What are research data?
The term “research data” generally refers to all (digital) data that represent the result of scientific work or that serve as a basis for such work. The responsible handling of research data guarantees the traceability, verification and re-usability of scientific work and thus represents a central aspect of good scientific practice. Research data is generated using a wide variety of methods, such as measurements, source research or surveys, and is therefore always subject- and project-specific. However, the diversity of file types and formats makes an interdisciplinary definition difficult.
What is research data management?
Research data management accounts for measures that create and preserve the sustainable usability of data. Therefore, it is important to plan the storage, documentation, description and archiving of the resulting data as early as possible. Ideally, such planning starts at the beginning of a research project and is regularly updated. Your plans should be documented in a data management plan. How do I create a data management plan?
Good research data management minimizes the risk of data loss and ensures long-term, person-independent usability through documented and traceable filing structures and context information. It also guarantees the usability of your data for 10 years and beyond as is required by good scientific practice. Furthermore, it increases the visibility of your research, since a data publication is fully citable as an independent publication. How do i cite research data?
Who demands research data management?
Research sponsors often demand research data management, sometimes also data sharing, in order to validate results and to avoid multiple funding. In addition, publishers are increasingly developing data policies that require the publication of at least the data on which the respective publication is based. In times of increasing digitalization of science, compliance with good scientific practice requires appropriate research data management.
A detailed description of requirements by involved parties can be found here: Which specific demands do sponsors, publishers and universities have?.
When do I start managing research data?
Research data management should not start after you have already collected your data. Rather, it is best to consider how you further want to deal with your data before you create it. Therefore, you might orientate towards the data lifecycle. Research data management tackles interdisciplinary comparable questions on how to manage data, although the answers can vary greatly depending on your discipline.
Initial questions should include: - Which file format/standard should I use?
- How can I gain financial and human resources for data maintenance?
- Who is responsible for the data to be produced?
- Whose property are the data to be produced?
Equivalent questions can be asked at all stages of the life cycle (see also Checklist for research data management (ger) by Wissgrid). Further useful tips can be found in our Instructions for Research Data Management. Many sponsors require data management plans, the basis of which is planned handling of data and which should help clarify such questions systematically in advance (How do I create a data management plan?).
How do I create a data management plan?
Creating a data management plan requires a detailed reflection on how you plan to handle the data generated in your respective project. A data management plan is to be understood as a "living document" which records the handling of research data all the way from the planning stage to the completion of a research project and which, if necessary, can and must be adapted to changes, new findings or problems.
In creating data management plans these tools and information might be helpful:
RDMO by the Leibniz Institute for Astrophysics Potsdam is currently in beta phase. It is the aim of the project to not only support you in creating data management plans, but also in structuring planning, execution and maintenance.
DMPOnline was developed by the British Digital Curation Centre (DCC). Amongst other things, it includes a template for Horizon 2020-projects and guides users along detailed questions and answers.
DMPTool is operated by the California Digital Library. The website also offers examples for data management plans. Because of the different funding schemes, it is only of limited use for German/European projects.
Checklists, Templates, Wizards
One of the most compelling templates for research data management is the set of [Twenty Questions for Research Data Management (Eng)] by Oxford zoologist David Shotton.
• Checklist by Wissgrid: Guideline for Research Data Management. Handout (Ger)
• Checklist by the University of Bielefeld (Ger)
• Data Management Wizard by KomFor (Eng)
• H2020 Template (Eng)
• CLARIN-D Wizard (Eng)
Examples and Templates for Data Management Plans
• [Exemplary Data Management Plans by the Humboldt University of Berlin H2020 (Eng), DFG (Ger) und BMBF (Ger)]
• Template for Data Management Plans for RWTH Aachen (Eng)
• Examples for DMPTool Data Management Plans (Eng)
• Examples for dcc Data Management Plans (Eng)
• Examples for Data Management Plans by the University of Leeds (Eng)
• Examples for Data Management Plans by UC San Diego (Eng)
• Examples for Data Management Plans by the National Endowment for the Humanities (Eng)
• The Video-Tutorial of HU Berlin offers a good introduction.
Instructions for Research Data Management
Good scientific practice includes the careful handling of data, which is mandatory according to the research data guidelines as well as a lot of research funding agencies. However, these conditions can also be an orientation and guideline when dealing with research data.
These instructions provide a basic overview on research data management:
Choose who is responsible for setting up and controlling your research data management. At the JLU, the heads of research projects are responsible for research data management. Therefore, check who needs to be involved and instructed to ensure proper data management for your entire project.
Check whether your discipline has specific institutional or general requirements or suggestions for research data management.
Check which requirements for storing and publishing your research data you have to meet (Which specific demands do sponsors, publishers and universities have?).
Check which data you will collect during your research.
Consider which research data is going to be published and provided for reuse.
Consider which research data shall be stored and archived (Data Storage and Digital Preservation).
Consider the possible ways you can store and archive your data. Could you use a general or a discipline-specific data repository? (How can I find a suitable repository?)
Clarify legal aspects of storing and passing on your research data. There might be privacy laws and copyright issues to consider. (Legal Aspects)
Create a data management plan in order to document your decisions and to support you in accounting for your research activities. (How do I create a data management plan?)
Adjust your data management plan during your research.
Which specific demands do sponsors, publishers and universities have?
Deutsche Forschungsgemeinschaft (DFG) (= German Research Foundation)
In its Proposals for Safeguarding Good Scientific Practice (Eng+Ger) the DFG states that: Primary data as the basis for publications shall be securely stored for ten years in a durable form in the institution of their origin.
These rules apply to all scientists. Regarding the publication of data, the DFG gives the following recommendation for the planning phase of a proposal:
"If your project includes the systematic collection of research data which could be re-used later, a plan detailing how this data will be transferred to existing databases or repositories should accompany your proposal.” (DFG - Information for the planning phase).
Further information can be found in the DFG’s Guidelines for Handling Research Data.
European Commission (EC)
The EC’s “Open Research Data Pilot” is part of the EU research and innovation program Horizon 2020. From 2017 onwards, the Open Research Data Pilot applies to all sponsored projects.
- Guidelines on FAIR Data Management in Horizon 2020 (Eng) (pdf)
- Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020 (Eng) (pdf)
- Annotated Model Grant Agreement (Eng) (pdf)
- OpenAIRE Research Data Management Briefing Paper (Eng)
- OpenAIRE Factsheet Open Research Data Pilot (Eng) (pdf)
There are three main obligations:
- You have to create a data management plan according to the template. It has to be handed in within the first six months and updated according to relevant adjustments (or at least at interim and final evaluations).
- Data storage: Your research data has to be stored in an institutional, project-specific or discipline-specific data repository as early as possible (‘underlying data’) or according to the data management plan (‘other data’).
- Publication: If possible, your data should be published using an open license (preferably CC-BY or CC-O) without use restrictions. The publication has to include the necessary contextual information and tools.
The EC depicts its own policy as "as open as possible, as closed as necessary" (see: EC Guidelines on FAIR Data Management in Horizon 2020, p. 4). Therefore, if necessary, legitimate reasons might contribute to a partial exemption from these obligations. (Is there anything against a publication?
Publishers increasingly demand you provide the research data a publication is based upon. Check the respective policies before your publication. Examples for guidelines can be found here:
- Public Library of Science (PLOS): Data Availability Policy (Eng) / Materials and Software Sharing Policy (Eng)
- Nature Publishing Group: Availability of Data, Material and Methods Policy (Eng)
- Science: Data and Materials Availability Policy (Eng) and Preparing Your Supplementary Materials (Eng)
- BioMed Central: Availability of Supporting Data (Eng)
- Elsevier: Text and Data Mining; Research Data Policy (Eng)
On 29th October 2018 Justus-Liebig-University Giessen has given itself guidelines for research data (Research Data Policy (Ger)). These guidelines define the principles on which members of the university have to handle research data.
Data Storage and Digital Preservation
Where do I store my data during the working process?
It is of the utmost importance to back up your data regularly in case of technical or human errors. It is the responsibility of the researcher to secure data. The university’s infrastructure and the Hochschulrechenzentrum (HRZ) (= university computer center) offer several possibilities for data storage:
Cloud Storage JLU-Box (Ger)
The JLUBox offers 100 GB of free cloud storage for all employees (g-identification) of Giessen University. You can share and synchronize data and work together on documents using the JLUBox. Moreover, you can provide data for students and external users.
- Network Drives (Ger)
The HRZ offers two kinds of storing data on network drives:
- The network drive for big amounts of data (data1) can be accessed by up to five users. The minimum amount of data stored on this network drive is 2 TB (2000 GB).
- The network drive for work teams (winfile) offers storage regardless of the data size. The amount of people possibly having access to the data is unrestricted.
Both network drives offer a secure way of storing data. Furthermore, both network drives can be included into your workplace computer’s data storage (Connecting a Windows PC (Ger)). You can order storage on the network drives via Giessen University's Online Shop (Ger). If you do not have access to the shop yet, you have to apply for it first.
Backup Services (Ger)
The HRZ offers regular and automated backups of servers via Tivoli Storage Manager. You have to apply for access to this system. The target group of this service is IT system administrators or other technically knowledgeable persons.
In case you need more data storage for more comprehensive research projects please contact the HRZ (HRZ) at an early stage.
How can I structure my data?
During the work process often not only a multiplicity of data sets develops, but at different modification stages also respective versions. For efficient, coordinated and collaborative work flows as well as long-term traceability and internal as well as external reusability, you should decide on specific conventions for naming and versioning your data. It might be useful to sort files by the level of processing. You should document the conventions you used.
Naming conventions can vary widely depending on specific disciplines and data. Names should reflect the kind of data (original data / raw data, cleaned data, analytical data) or the data format (work file, result file etc.). This differentiation can also be made by versioning conventions. Important are uniformity, clarity and meaningfulness.
Examples for naming data and files:
• [sediment] [sample] [instrument] [YYYYMMDD].dat
• [experiment] [reagent][instrument] [YYYYMMDD].csv
• [experiment] [experiment set-up] [test subject] [YYYYMMDD].sav
• [observation] [location] [YYYYMMDD].mp4
• [interview partner] [interviewer] [YYYYMMDD].mp3
In order to guarantee compatibility in between different operating systems, you should not use special characters (except for underscores and hyphens, no spaces) or umlauts (ä, ö, ü). File names should not exceed 21 characters.
At the different stages of modifying your data (e.g. original file, cleaned data, data ready for analysis) you should create write-protected versions. You should only process your data further in copies of these original files.
A well-known concept of versioning based on DDI (Data Documentation Initiative) is: Major.Minor.Revision (see: GESIS Guideline for Research Data Management (Ger))
Starting from version “v1-0-0” the following is changed:
- the first position, if more cases, variables, waves or samples are added or deleted
- the second position, if data are corrected with the result that the analysis is affected
- the third position, if revisions are simple without changes of meaning
Conventions should always be adjusted to the discipline and project-specific requirements. If, e.g. versions are not in a linear relationship to each other, relationships can be defined (“IsDerivedBy”, “IsSourceOf“) by using specific metadata schemes (like DataCite).
Versioning can also be supported by using appropriate software (Free Version Control Software (Eng), e.g. Git).
Where can I archive my data on a long-term basis?
Good scientific practice at both the DFG (Eng) and Research Data Policy (Ger) of the University requires the storage of your research data for a minimum of 10 years.
Therefore, everything supporting the traceability of your research should be secured in the long term. There are several possibilities:
Storage systems, such as Network Drives (Ger) are especially convenient for long-time storage because of their automatic backups. However, to their disadvantage, they are bound to the status of the person using them at the University of Giessen.
An independent solution might be to store data in a discipline-specific or interdisciplinary research data repository. [DataCite (Eng)] provides the Registry of Research Data Repository - re3data.org (Eng), which gives a good overview on research data repositories.
Storing data in a repository is not the same as publishing data (see: Does uploading data into a repository automatically lead to free access for others?).
Does uploading data into a repository automatically lead to free access for others?
Sharing your data does not imply free access. In principle, you can delay the publication of your research data or publish meta data only. If you really publish your data, you can regulate access and editing rights in detail via licenses or contracts (see also: Can I even control the use of my data?).
These possibilities can be fundamentally limited by:
- specific requirements and policies by your funders and/or publishers
- missing/limited rights to the data
- restrictions regarding data protection laws
- restrictions given by the repository
Why should I publish my data?
There are personal as well as scientific benefits to publishing data:
- Persistent Identifiers (PID) will make your data permanently referenceable and citable.
- Publishing your data is a precondition for recognizing it as independent scientific achievement and including it into the scientific reputation system.
- According to a study by Piwowar und Vision (2013) (Eng), publications are more likely to be cited if their underlying data has been published.
- A publication might meet the requirements of research funding organizations or publishing companies. Which specific demands do sponsors, publishers and universities have?
- Published data can be reused and put into new contexts, e.g. interdisciplinary research or meta-analysis. This does not only maximize scientific value but also avoids duplication of scientific research work.
- There might be legitimate reasons not to publish your data. (Is there anything against a publication?)
How can I publish my data?
There are several ways for publishing data:
Discipline-specific data repositories and centers
(How can I find a suitable repository?)are usually the most convenient way, since demand for your data is most likely to be created within your discipline.
Data supplements accompanying journals, e.g. Nature (Eng), are increasingly promoted. However, with regard to digital preservation, you should use additional strategies for long-term preservation.
Data journals such as GigaScience (Eng), Earth System Science Data (Eng) or Journal of Chemical and Engineering Data (Eng) (Data Journal lists '#1', '#2') don’t publish data but data descriptions (documentation or Data-Curation-Profiles (Ger)), not interpretations, since traditional articles don’t provide enough space for the important and valuable description of data.
How can I find a suitable repository?
There are discipline-specific as well as thematic and generic repositories. For increased visibility and presence within your discipline, and considering conformity with your discipline’s specific standards, discipline-specific repositories and data centers (such as Pangaea (Eng) for geoscientific data, GenBank (Eng), Protein Data Bank (Eng)) are usually the first choice. The Registry of research data repositories (re3data.org (Eng)) and the Open Access Directory for Research Data (Eng) give an overview of data repositories. Further interdisciplinary repositories include Zenodo (Eng) (EU funded), Dryad (Eng) or figshare (Eng).
This set of questions can help you decide which repository to choose:
- Does the repository fit in with your discipline? Is it established and included in discipline-specific search portals?
- Does the repository offer the functions you need – e.g. Persistent Identifier (PID), Open Access, differentiated access rights (e.g. user contracts, compliance with embargos)?
- Is the repository sustainable? Are there back-up plans to save data in case of e.g. the termination of funding?
- What are the formal and content-related regulations for data sharing and usage?
What do I have to consider when uploading data into a repository?
- It is important to use the right file format. Some repositories have strict requirements on which format to use, while others only make recommendations or are unrestrained. Therefore, you should decide on the right format at an early stage of your research process (How do I create a data management plan?). General information and specific links to file formats can be found here: Which file format should I choose?
- Metadata has to be documented precisely in order to make it traceable and usable. See: What are Metadata, Metadata Schemes and Documentations?
- Uploading data into a repository does not necessarily include an instant publication. There might be reasons for an embargo period or a partial publication only. Embargos are common especially in business-related academic fields. Thus, you have to contemplate possible reasons accounting against an immediate publication (Is there anything against a publication?).
- Consider upon which conditions you want to publish your data. There are different types of licensing models (Which license should I choose?).
Which file format should I choose?
Especially considering long-term storage and use of data, choosing the right data format is important. Usually, some characteristics are favored: files/formats should not be encrypted, packed, proprietary/patented. Accordingly, open, documented standards are preferred.
Common formats are:
|Data type||| Recommended Format||| Less suitable / unsuitable|
.odf, .rtf, .txt
ASCII, .csv, .tsv, .tab
.por (SPSS portable)
.mov, .avi, .wmv
.tiff, .jp2/.j2k/.jpx | .
gif oder .jpg
RDBMS, .accdb, .mdb
Examples for recommendations can be found under [UK Data Service (Eng)].
What are Metadata, Metadata Schemes and Documentations?
Metadata refer to research data in order to optimize their traceability. Therefore, basic information is used, such as :
Title, Author/Main Researcher, Institution, Identifier (e.g. OrcID, UserID), Location & Time, Topic, Rights, File Name, Formats
Since this information is essential for finding, understanding and using data, standardized metadata schemes shall guarantee the most uniform and comprehensible description. Introductions to metadata can be found in the University of Edinburgh’s Mantra-Kurs (Eng).
Metadata Schemes are compilations of elements describing data. To some extent there already are specific metadata schemes, for exemple:
- Humanities -- TEI: Text Encoding Initiative (Eng)
- Experimental Data -- ICAT Schema (Eng)
- Biology -- Darwin Core (Eng)
- For Geographic Information -- ISO 19115-1:2014 (Eng)
- Social Sciences and Economics: Digital Documentation Initiative (DDI) (Eng)
An overview of more schemes can be found at the Research Data Alliance's Metadata Directory (Eng).
Before documenting your data, you should search for available metadata schemes. This will increase disciplinary uniformity of the schemes and prevents you from having to create your own metadata scheme. Information can be found with the Digital Curation Center (Eng) (DCC). If there is no discipline-specific scheme, you can also use an interdisciplinary scheme such as Dublin Core (Eng).
Metadata schemes determine which information shall be given. For the best possible search and use of data, it is important to reduce this information into as uniform a format as possible. Therefore, there are several discipline-specific and interdisciplinary controlled vocabularies, thesauri, classifications and authority files such as:
- Standards for the unambiguous identification of people, e.g. Open Researcher and Contributor ID (Eng) (ORCID) or International Standard Name Identifier (Eng) (ISNI, ISO 27729)
- General and interdisciplinary classification systems, e.g. DDC or LCC
- Intradisciplinary classifications, e.g. Mathematics Subject Classification (Eng) (MSC) or the classification of Social Sciences (Ger))
- Discipline-specific thesauri, e.g. TheSoz (Eng), STW Thesaurus for Economics (Eng) or Getty Vocabularies (Eng) (AAT, TGB, CONA, ULAN).
An overview on different systems can be found here: Basel Register of Thesauri, Ontologies & Classifications (Eng) (BARTOC) and Taxonomy Warehouse (Eng).
A Documentation is usually more than a mere description of data via metadata. It is a more comprehensive (subject-specific) analysis, in which e.g. the context of creation, variables, instruments, methods etc. are described in detail. In a lot of cases this description is essential for understanding, checking and eventually using data.
Is there anything against a publication?
There are constellations in which you should only publish your data under certain conditions or not at all. The most important precondition for a publication is that you have the necessary rights to do so.
Moreover, the data could be confidential, personal data that can only be published in an anonymized form or with consent of the persons affected.
Which restrictions by data protection laws do I have to consider?
Personal data “means any information relating to an identified or identifiable natural person (data subject)” (§ 4 Abs. 1 GDPR (= General Data Protection Regulation (GDPR)). There are strict requirements for eliciting, using and passing on personal data. Information that can be linked to an identified or identifiable person has to be deleted from your research data before archiving, providing and publishing it. Depending on the kind of data there are several ways of anonymizing data.
Guidelines can be found at Forschungsdaten Bildung (Ger). Furthermore, there are several tools for anonymizing data, such as [ARX (Eng)], μ-ARGUS (Eng), sdc-micro (Eng) or the anonymizing tool by OpenAire (Amnesia). If you want to process personal data, you usually need the informed consent of the persons affected. The aim has to be clearly defined and the person affected has to be able to estimate the consequences. Moreover, research data such as company data can contain confidential information (know-how protection) or non-disclosure agreements might have been made that prohibit a publication.
Who can decide on transferring and publishing data?
Possible right owners or co-owners to the data are researchers, employers, customers, research funders and/or (commercial) contractors. The contractual relationship determines who has a say in transferring or publishing research data. Usually, the results of instruction-bound research are the property of the employer/funder. The situation is different with private research, the data of which researchers determine on their own.
Do I own the copyright to my data?
Research objects, as well as occasionally research data can be protected as a creative work protected by the copyright act. This includes literary works, computer programs, musical works or the like. Usually, research data lacks the necessary threshold of originality, whereby they are no creative works. Some kinds of research data are exceptions, as they are protected by the ancillary copyright, e.g. photographs, moving pictures or sound carriers. But often research data are protected by copyright as part of a databank or are protected by the ancillary copyright for databanks. Research data not protected by protective rights can usually be used by anyone for any purpose without permission or obligation to pay.
Can I even control the use of my data?
If you own a copyright or an ancillary copyright to your data, you can regulate several aspects of using your data by contracts, e.g. how to use it, user group, period of use, purpose of use etc. Since contractual individual case regulations are very complex, there are several solutions for standardized regulations on rights of use. E.g. the Leibniz Institute for Psychology Information (ZPID) offers standardized contracts for using data that has been gained in psychological research. Another example are GESIS user contracts (access restrictions for particularly sensitive social science data). If your data is not to be subject to any specific access or use restrictions, it would be advisable to use standardized licenses such as Creative Commons or Open Data Commons.
See also: Which license should I choose?
Which license should I choose?
Publishing data under a specific license allows you to specify the allowed usage of your data in detail. This creates legal certainty for both the persons providing the data and the persons using the data. Therefore, in case there are no restrictions at all, it is important to document this waiver clearly.
Although data are usually not subject to copyright law, you should nevertheless treat them as potentially worth protecting, if not only to express your own expectations of further use. Therefore, there are various license models. The most popular one is Creative Commons (CC) (Ger). CC-licenses are independent of the licensed content and cover copyright, ancillary copyright and - if existent - sui generis database rights.
The Open Knowledge International (former Open Knowledge Foundation) has created the license package Open Data Commons especially for publishing data. Apart from the unconditional license (Open Data Commons Public Domain Dedication and License (PDDL)), it offers three other packages:
- Open Data Commons Attribution License (ODC BY) (v 1.0) (condition attribution)
- Open Data Commons Open Database License (ODbL) (v 1.0) (condition of ShareAlike)
- Database Contents License (DbCL) (condition of ShareAlike; for database contents as well)]
Regardless of its legal liability, the CC-BY license best fulfills the idea of Open Access and Open Science, whereas “ShareAlike“ can lead to compatibility issues with other licenses, and the prohibition of processing can lead to restrictions on use (e.g. data mining, issues regarding long term storage). Since commercial use is prohibited, the use of commercial databases is hampered, which reduces the potential visibility of your research (for details see: Paul Klimpel (2012) (Ger)).
Whichever license you choose – you should choose wisely. An in depth analysis of the complex of themes can be found here: Andreas Wiebe & Lucie Guibault (2013) (Eng).
Finding and Using Research Data
Where can I find research data?
Not least due to requirements and recommendations by funders, publishers and institutions for making data accessible, research data is increasingly available for reuse. In order to find suitable research data for your own research, you should first have a look at relevant offers from your own discipline. There can be institutional or specialized repositories or [Data Journals (List) (Ger)]. Repositories assorted by discipline can be found here: re3data.org (Eng).
Furthermore, you can do research using generic search engines. However, to their disadvantage, they often cannot depict the detailed metadata schemes of their sources. Moreover, the respective metadata differ greatly in their identification markers – single data, data sets or data collections.
Three of the most common search engines are:
BASE - Bielefeld Academic SearchEngine (Eng)]: Metadata found in repositories and databases is obtained via OAI-PMH. Research data can be found using document type “Dataset“.
EUDAT B2 Find (Eng): Examines metadata from different sources such as CLARIN or Global GBIF.
DataCite Metadata Search (Eng): Examines Metadata from entities, e.g. research data (Resource Type “Dataset“), that is registered at DataCite including DOIs. The metadata are partly also obtained by the other two services.
Google Dataset Search (Eng): Google like search-egine for datasets.
In order to document (re)using your own or external data meeting good scientific practice, you have to cite your data correctly.
Citing external data also appreciates the scientific achievement of the “author“. As with citing other publications, conventions for citing data may formally differ. However, with regard to their content they also have to be clearly identifiable. The FORCE11 Data Citation Synthesis has developed Recommendations for Citing Data (Eng). Accordingly, a full data citation includes:
Author(s), Year, Dataset Title, Data Repository or Archive, Version, Global Persistent Identifier.
Further, possibly useful, optional additions are Edition, Feature Name and URI, Resource Type, Publisher, Unique Numeric Fingerprint (UNF) and Location (see Alex Ball & Monica Duke 2015: How to Cite Datasets and Link to Publications (Eng)).