THE RESEARCH DATA LIFECYCLE
Image courtesy of the University of Virginia Library.
Why are DMPs important?
Data Management Plans (DMPs) help demonstrate transparency and openness and return on public investment by validating that data as an output of publicly funded research are discoverable, accessible and reusable.
As well as being good research practice it can also be a research funder requirement.
The Guide below looks at the following categories:
- Organising and Documenting your Data
- Processing your Data
- Storing your Data
- Protecting your Data
- Preserving your Data
Examples / templates of DMPS
Data Curation Centre (DCC) DMPOnline.
1. ORGANISING AND DOCUMENTING YOUR DATA
FILE STRUCTURE AND FILE NAMES
Once your research gets underway, you may quickly accumulate a large volume of data and may have multiple files in different formats and versions. If you're trying to find a data file that you need, especially if it's been named inaccurately or inconsistently, it can both exasperating and a substantial waste of research time. Good file management practices are required to help you identify, locate, and use your data effectively.
- How to make your data files distinguishable from each other within their containing folder.
- How to make the location and retrieval of your data files easy for both creator and other users.
- That the files are sorted in a logical sequence and cannot be accidentally overwritten or deleted.
- Do you intend to transform the format of your data at the end of the active phase of project to facilitate data sharing and long-term preservation.
- Will your data be made available in an open format e.g. ‘…one that is platform independent, machine readable, and made available to the public without restrictions that would impede the re-use of that information’? This may be a research funder requirement.
Digital Repository of Ireland Factsheet No 3: File formats.
Sustainability of Digital Formats: Planning for Library of Congress Collections.
State Archives of North Carolina - File format guidelines for management and long-term retention of electronic records.
DBnormalization.com. Database Design Basics
Open Data Handbook: File Formats
Levels of documentation include:
- Project level documentation. What the study set out to do; the research questions and hypotheses; and what methodologies, sampling frames, instruments, or measures were used.
- File or data level documentation. How all the files that make up the dataset relate to one another e.g. a readme.txt file.
- Variable or item level documentation. e.g. not just defining the variable names at the top of the spreadsheet file, but a full label to explain the meaning of that variable.
- Decisions about the type of data to be collected and the data’s scope, quantity and format should be documented. This is likely to change as your DMP is adapted during the data life-cycle.
- This becomes critical once the data are no longer active and have been transferred to an archive for long-term preservation and sharing if appropriate. Your project documentation should be preserved with your data.
Digital Repository of Ireland: How to DRI: Contextual Information.
Metadata refers to structured information that describes, explains, locates or otherwise makes it easier to retrieve, use, or manage another resource. For each dataset, we need to know at minimum, who created the data, when the data were created or published, and a title or descriptive name used to refer to the dataset.
Types of metadata include:
- Descriptive metadata: Describes the intellectual entity through properties such as author and title and supports discovery and delivery of digital content e.g Dublin Core; MARC.
- Structural metadata: provides information about the internal structure of resources; describes relationship among materials; binds related files and scripts e.g. EAD.
- Technical metadata: technical information that applies to any file type, such as information about the software and hardware on which the digital object can be rendered or executed, or checksums and digital signatures to ensure fixity and authenticity.
- Administrative metadata: Manages administrative aspects of the digital object such as intellectual property rights and acquisition. Administrative Metadata also documents information concerning the creation, alteration and version control of the metadata itself.
- Use metadata: manages user access, user tracking and multi-versioning information.
- Preservation metadata: documents actions which have been undertaken to preserve a digital resource such as migrations and checksum calculations.
- You may wish to consider the use of a schema such as DDI during the active phase of the research project to describe newly generated data.
- Make sure that metadata associated with reused or secondary data is noted in the data level documentation.
- Electronic lab notebooks can be used to document the data and for metadata capture.
- Most technical metadata can be captured automatically by digital camera or scanner.
Digital Repository of Ireland Factsheet No. 1: Metadata and the DRI.
Digital Repository of Ireland Metadata Guidelines: Simple Dublin Core and the Digital Repository of Ireland.
Digital Repository of Ireland Metadata Guidelines: Qualified Dublin Core and the Digital Repository of Ireland.
Digital Repository of Ireland Metadata Guidelines: MARC21 encoded as MARCXML and the Digital Repository of Ireland.
Digital Repository of Ireland Metadata Guidelines: EAD, ISAD(G) and the Digital Repository of Ireland.
Digital Repository of Ireland Metadata Guidelines: MODS and the Digital Repository of Ireland.
Digital Repository of Ireland Batch Metadata Template
The Data Documentation Initiative (DDI) https://www.ddialliance.org/
Data Curation Centre Briefing Papers: What are Metadata Standards?
University of Cambridge Research Data Management Electronic Lab Notebooks.
Programmes that can help add metadata directly into the files:
2. PROCESSING YOUR DATA
It is important to consistently identify and distinguish the versions of research data files. This ensures that a clear audit trail exists for tracking the development of a data file and identifying earlier versions when needed.
- What is your strategy concerning versioning your data files (and scripts) during the project?
- Will you create and/or follow a convention for versioning your data?
- Who will be responsible for securing that a ‘Masterfile’ will be maintained, documented and versioned according to the project guidelines?
- How can different versions of a data file be distinguished?
For larger files investigate tools such as Git Large File Storage.
Research data interoperability supports the ability to reproduce or to verify the research; makes results of publicly funded research available to the public; helps provide qualified references and links to other data including secondary data; allows the impact and use of data to be tracked; and provides a structure which recognises and rewards the data creator. Interoperability considerations will impact on the processing stage of the research data lifecycle and when the data are archived and published.
- Will the processed data be assigned a unique identifier (UID) during the data collection and data analysis phases of the project?
- What attributions and credits should be associated with the project’s publicly available datasets to support citation?
- Will the data repository providing long-term preservation and access services assign a UID with the dataset?
- Will you make use of established terminologies/ontologies (i.e. structured controlled vocabularies) in the project?
- If not, how do your terminologies relate to established ones?
Digital Repository of Ireland Factsheet No. 7: Persistent Identifiers and DOIs.
ANDS Guide: Vocabularies and research data
3. STORING YOUR DATA
If data loss occurs recovery may be slow, costly, or not possible. It is critically important to store and back up your data securely. Personal computers and laptops should not be used for storing master datasets, and external storage devices are not recommended for the long-term storage of data, particularly master copies. It's also important to think about how much storage you'll require and plan for that from the outset. Also think about who will need access to your data and what that means in terms of where the data are stored.
- Consider how much storage is needed for the entire duration of the project
- For long-term storage, decide what data will be kept long-term, what storage volume this represents and how long data will be stored and preserved.
- If storage is provided by your institution, costs may be included in standard indirect costs or overheads.
- If additional storage is needed, server/disk space should be costed along with setting up and maintenance.
Digital Repository of Ireland Guidelines: First Steps in Digital File Transfer Storage.
University College London - Storing & preserving data.
Digital Repository of Ireland Reports: Building the Digital Repository of Ireland Infrastructure.
Probably the single most important data management task you’ll undertake is keeping backups of your data. Always remember there is a real risk of loss through hard drive failure or accidental deletion.
RDM Planning Considerations
- How will you back up your data and how regularly will the backups be made?
- Will all the data be backed up, or just that which has been changed? A backup of changed data is known as an incremental backup. While a backup of all data is known as a full backup.
- How often will full and incremental backups be made?
- How long will your backups be stored?
- How much drive space or how many CDs or DVDs will be required to maintain this backup schedule?
- Institutional backup may be included in standard indirect costs or overheads.
- Cost additional backups according to the number of copies to be kept, frequency of backup and storage media needed.
Data Curation Centre: Backup and Storage Management
You will need to identify the means and mechanisms you will employ for collecting, processing and storing your research data. For data that are particularly sensitive, this includes but is not limited to:
- The protection of research subjects from harm that might result from unintended disclosure or inappropriate use of confidential data
- Upholding the researcher’s assurance of confidentiality
- Adherence to requirements specified in any restricted use agreements
- Using optimal storage and use technologies that protect data securely without imposing unwarranted or excessive burdens on researchers
RDM Planning Considerations
Identify the means and mechanisms you will employ for collecting, processing and storing your research data
Provide details of measures taken to secure research data e.g. physical security of equipment and notes (at work, at home and in the field), and digital security mechanisms, such as system, program and file passwording and use of encryption.
UK Data Service. Create and manage data: storing your data.
Definition of Data erasure.
4. PROTECTING YOUR DATA
PERSONAL DATA / CONFIDENTIAL INFORMATION
If the research project will generate data that includes confidential information or information that requires informed consent, there may be a requirement to notify a privacy officer. Liaise with your Ethics Committee/Research Office/Data Protection officer. You will need to consider how personal or confidential data will be protected during the project.
Review and approval from your research institution’s Ethics Committee should be sought particularly when working with personal data and health and medical data.
- Make sure informed consent to archive is obtained at the time of fieldwork.
- Consider if your data will require the need for appropriate use of a rigorous anonymization protocol.
- Operating within a rights management framework that includes depositor and end-user licenses and legal agreements. Users of the data should agree to the ethical use and re-use of the data and undertake not to breach confidentiality by using identifiable information in a published work or to try and contact research subjects.
The International Association of Privacy Professionals: How GDPR changes the rules for research
Irish Qualitative Data Archive: IQDA Qualitative Data Anonymizer
Consent procedures should be tailored for the specific research context, methods and sample, the nature of the data (personal, sensitive, level of detail), the format of the data (surveys, written, recordings) and planned data uses and handling. This will influence the type of consent and consent process used.
The consent form should be written in plain language and free from jargon allowing the participant to respond clearly.
Consent forms should be kept for as long as the research data are retained (during the active phase of the project and when the data are archived for long-term preservation).
UK Data Service: Consent for data sharing
INTELLECTUAL PROPERTY RIGHTS / COPYRIGHT
Resolving data ownership issues early, is particularly important. Ownership over the data ultimately controls its dissemination, preservation, and destruction. Ownership issues can become even more problematic because of the various stakeholders involved in collecting and generating data. Research funders and institutions may also have policies related to the ownership of data. For this reason, it is important for researchers to come to an agreement at the beginning of a project and contact their institution if they are unsure about their intellectual property rights.
- What terms enabled the collection or use of the research data
- Contractual obligations with the research institutions(s) where the research is undertaken
- Funder obligations
- Licensing associated with the secondary use of existing data
- Permissions needed to collect/reuse the data
- Will these rights be transferred to another organisation for data distribution and archiving?
Digital Repository of Ireland Factsheet No. 2: Copyright, Licensing and Open Access.
Data Curation Centre: How to License Research Data
5. PRESERVING YOUR DATA
Planning for the long-term preservation of your data will rely on good data management planning during the active phase of your project. You will need to build in preservation planning early on and adjust it to any research outcomes that emerge during the data gathering and processing phases.
Planning for how and where the data is stored after the project ends will require:
- Deciding what data will be retained and what will be destroyed so it is irrecoverable.
- Preparing the data for transfer to a trusted digital repository for long-term preservation.
When deciding where to archive your data consider what repository provides the best preservation services to allow the long-term reuse of the data. Make sure that your data will be associated with a persistent identifier that is available and managed over time and will not change even if the object of preservation is moved or renamed. Persistent Identifiers support reference reliability and readability for both humans and machines.
Preparing the data for transfer to a trusted digital repository for long-term preservation
Deciding what data will be retained and what will be destroyed so it is irrecoverable.
That the repository of choice offers preservation services which allow the long-term reuse of any data. Trust in a repository is demonstrated by a public statement describing the practices followed and the provenance of data preserved.
That the repository of choice offers a persistent identifier service that is available and managed over time and will not change if the object of preservation is moved or renamed.
Digital Repository of Ireland CoreTrustSeal Certification (2018).
UK Data Archive: Trusted Digital Repositories.
The Australian National Data Service (ANDS): Persistent Identifiers
DATA FORMATS FOR ARCHIVING
Research data formats may change as a part of the preparation planning for long-term preservation. Raw data are data in their original state at the time of collection while processed data are data transformed and used to analyse the research questions.
- What transformations have occurred to the data during the research data lifecycle.
- Validation of research findings may require that the raw data are preserved.
- If the data has been processed, then the code and algorithms used will also need to be preserved as well as sufficient metadata that describes the transformation.
You may consider or be required by your research funder or institution to make your research data available as Open Data e.g. anyone can freely access, use, modify, and share for any purpose subject, at most, to requirements that preserve provenance and openness. This may require you to decide if some or all your data are to be published. When doing so, consider your ethical review, funder requirements, and any journal publishing requirements that may come into play. When reviewing your archiving options, discuss with the trusted-digital repository how you can control access to the preserved data.
When you decide to publish the findings of your research it may be necessary to embargo (closed access for a limited time-period) your data for the following reasons:
- Sensitive information and/or names that cannot be released at the time of publication
- Cases that could be identified, even if anonymised
- Confidential government statistics
- Information relevant to current court cases
- Information subject to copyright or other intellectual property restrictions
You may decide to manage access to the data using the following options applied in the repository where you deposit your data for long-term preservation:
Public-Public (Open access)
- Metadata is fully discoverable
- Data are accessible and immediately downloadable
Public-Private (Mediated Open Access)
- Metadata is fully discoverable
- Mediated access to data via the data custodian
Private-Private (Closed Access)
- Metadata is not publicly available
- Data is not discoverable or available to third parties
The licencing needed to make your data be available as Open Data
Will all data or only parts of it be published?
How your data should be cited when reused
Is an embargo period needed for (all or some of) the data?
Any legal/ethical restrictions that prevents the publication of all the material
Any restrictions requiring that action must be taken before the data can be made available?
Any risks of delayed publication/making data available (all or parts of) and what might be needed to do to avoid this.
ANDS Case Study: Benefiting women’s health
Open Science Framework Guides: Licensing
Data Curation Centre: How to Cite Datasets and Link to Publications
Research Data Netherlands: Addressing a researcher's data sharing concerns