A guide to reproducible archiving of data and code - ESEB

Daniel I. Bolnick (daniel.bolnick@uconn.edu), Roger Schürch (rschurch@vt.edu), Daniel Vedder (daniel.vedder@idiv.de), Max Reuter (m.reuter@ucl.ac.uk), Leron Perez (leron@stanford.edu), Robert Montgomerie (mont@queensu.ca) and Sebastian Lequime (s.j.j.lequime@rug.nl)

Last updated 1st January 2023

From mid-2020, the Journal of Evolutionary Biology (JEB) has mandated the deposition of data in a public repository. We also strongly encourage authors to archive any analysis and simulation code (e.g. R scripts, Matlab scripts, Mathematica notebooks) used to generate reported results in a public repository, and this will become a requirement in the near future.

The following text outlines our expectations and provides a concise guide on how to prepare a data and code archive. It is based on a document initiated by American Naturalist EiC Dan Bolnick and written by a number of volunteers, including JEB EiC Max Reuter, with input from the ecology, evolution and behaviour communities. These guidelines have been adapted specifically for JEB by our Data Editor Sebastian Lequime.

Rationale

Sharing data and code is vital for many reasons. It promotes appropriate interpretation of results, checking validity, future data synthesis, replication, and a teaching tool for students learning to do analyses themselves. Shared code provides greater confidence in results.

Any data, including raw data, and computer code used to generate scientific results should be easily usable by reviewers or readers. The fundamental question you should ask yourself when preparing an archive is, “If a reader downloads my data and code, will my data files and scripts be comprehensible, and will they run to completion and yield the same results on their computer?”

The recommendations below are intended as a guide towards preparing well-documented and usable data and code for deposition in a repository. You will find it easier to build your archive if you adhere to these recommendations from the start of your research, but you might also use this text as a final checklist before archiving, or when finishing a research project and you will find it easier to build reusable clean code and data if you adhere to these recommendations from the start of your research. setting it aside.

In the following, high priority points are in blue font, while black font indicates suggestions to follow ‘best practices’.

1. Clean documentation

➤ Prepare a README file with important information about your repository as a whole (code, and files contents). Text (.txt) and Markdown (.md) README files are readable by a wider variety of software tools, so have greater longevity. The file should/could contain:

Author names, contact details
Title of study
A brief summary of what the study is about
Link to publication or preprint if available
Identify who is responsible for collecting data, and writing code.
Code version (e.g., Git fingerprint, manual version number)
Overview of folders/files and their contents (referring to the paper is not sufficient)
Workflow instructions for users to run the software (e.g. explain the project workflow, and any configuration parameters of your software)
For larger software projects: instructions for developers (e.g. the structure and interactions of submodules), and any subsidiary documentation files.
Links to protocols.io or equivalent methods repositories, where applicable

➤ You might wish to additionally include a file (i.e. ‘requirements.txt‘) documenting the packages and software versions used (including the operating system) and dependencies (if these are not installed by the script itself). Alternatively, this may be included in the README file

in R: ‘sessionInfo()‘
In Python: ‘pip list –format freeze‘
In Julia: provide the Project.toml and Manifest.toml files

➤ Use informative names for folders and files (e.g. “code”, “data”, “outputs”)

➤ Give license information (either in the README or a separate file), such as Creative Commons open source license language granting readers the right to reuse code. For more information on how to choose and write a license, see choosealicense.com.

➤ If applicable, list funding sources used to generate the archived data, and include information about permits (collection, animal care, human research).

➤ Great template available here: https://github.com/gchure/reproducible_research

2. Clean code

➤ Thoroughly annotate your code with in-script comments indicating what the purpose of each set of commands is (i.e. “why?”). If the functioning of the code (i.e. “how”) is unclear, strongly consider re-writing it to be clearer/simpler.

➤ Scripts should start by loading required packages, and importing raw data in a format exactly as it is archived in your data repository.

➤ Use relative paths to files and folders (e.g. avoid setwd() with an absolute path in R), so other users can replicate your data input steps on their own computers.

➤ Test code before shipping, ideally on a pristine machine without any packages installed, but at least using a new session.

➤ If you are adapting other researcher’s published code for your own purposes, acknowledge and cite the sources you are using. Likewise, cite the authors of packages that you use in your published article.

➤ Document units and equations.

➤ Use informative names for input files, variables, and functions.

➤ Any data manipulations (merging, sorting, transforming, filtering) should be done in your script, for fully transparent documentation of any changes to the data.

➤ Organise your code by splitting it into logical sections, such as importing and cleaning data, transformations, analysis and graphics and tables.

Sections can be separate script files run in order (as explained in your README) or blocks of code within one script that are separated by clear breaks (e.g., comment lines, #————–), or a series of function calls (which can facilitate reuse of code).
Group code by function: files and functions should remain concise. Files >800 lines of code usually benefit from being split into smaller files. Similarly, functions should perform a single task.

➤ Label code sections with headers that match the figure number, table number, or text subheading of the paper.

➤ Omit extraneous code not used for generating the results of your publication, or place any such code in a Coda.

➤ Where useful, save and deposit intermediate steps as their own files. Particularly, if your scripts include computationally intensive steps, it can be helpful to provide their output as an extra file as an alternative entry point to re-running your code.

➤ If your code contains any stochastic process (e.g., random number generation, bootstrap re-sampling), set a random number seed at least once at the start of the script or, better, for each random sampling task. This will allow other users to reproduce your exact results.

➤ Include clear error messages in your code that explain what went wrong (e.g. if the user gave a text input where a numeric input was expected).

3. Clean data

Checklist for preparing data to upload to DRYAD or other repository.

Repository contents

➤ All data used to generate a published result should be included in the archive, including digital raw data (raw sequencing reads, photos, videos, sound recordings, etc.). For papers with multiple experiments, this may mean a corresponding number of data files.

➤ Save each file with a short, meaningful file name (see DRYAD recommendations here).

➤ Prepare a README_DATA text file to accompany each data file. It should provide a brief overall description of the file’s contents, and a list of all variable names with explanation (e.g. units). This should allow a new reader to understand what the entries in each column mean and relate this information to the Methods and Results of your paper. Alternatively, this may be a “Codebook” file in a table format with each variable as a row and column providing variable names (in the file), descriptions (e.g. for axis labels), units, etc.

➤ Save the README_DATA files as a text (.txt) or Markdown (.md) files and all of the data files as comma-separated variable (.csv) files.

➤ If your data are in EXCEL spreadsheets you are welcome to submit those as well (to be able to use colour coding and provide additional information, such as formulae) but each worksheet of data should also be saved as a separate .csv file.

➤ We recommend also archiving any digital material used to generate data (e.g., photos, sound recordings, videos, etc), but this may use too much memory for some repository sites. At a minimum, upload a few example files illustrating the nature of the material and a range of outcomes.

Digital raw data

➤ You should archive any digital material used to generate data (raw sequencing reads, photos, sound recordings, videos, etc) either on a specialized (e.g., Sequence Read Archive) or on a general (e.g. DRYAD) repository. DRYAD allows for 300GB of data per publication uploaded through the web interface. If your data exceeds that amount, please contact our data editor to discuss possible alternatives.

Data file contents and formatting

➤ Archived files should include raw data, not group means or other summary statistics; for convenience, summary statistics can be provided in a separate file, or generated by code archived with the data.

➤ Identify each variable (column names) with a short name. Names should preferably be <10 characters long and not contain any special characters that could interfere with reading the data and running analysis code. Use an underline (e.g. wing_length) or camel case (e.g., WingLength) to distinguish words if you think that is needed.

➤ Omit variables not analyzed in the publication, for brevity.

➤ A common data structure is to ensure that every observation is a row and every variable is a column.

➤ Each column should contain only one kind of data (e.g. do not mix numerical values and comments or categorical scores in a single column).

➤ Use “NA” or equivalent to indicate missing data (and specify what you use in the README file).

4. Completing your archive

➤ Prepare your data and code archive, and associated README files, simultaneously with manuscript preparation (analysis and writing).

➤ Data and code should be archived on version-controlled repositories (e.g. DRYAD, Zenodo). Your own GitHub account (or other privately controlled website) does not qualify as a public archive because it does not provide a DOI, and you control access and might take down the data at a later date. You can however link your GitHub repository with third-party repositories, such as Zenodo, which will provide a DOI (see a guide here: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content)

➤ Provide all of the metadata and information requested by the repository, even if this is optional and redundant with information contained in the README files. Metadata makes your archived material easier to find and understand.

➤ From the repository, get a private URL and provide this on submission of your manuscript. It will allow editors and reviewers to access your archive before your data are made public.

5. For more information

More detailed guides to reproducible code principles can be found here:

Documenting Python Code: A Complete Guide — https://realpython.com/documenting-python-code/

A guide to reproducible code in Ecology and Evolution, British Ecological Society: https://www.britishecologicalsociety.org/wp-content/uploads/2019/06/BES-Guide-Reproducible-Code-2019.pdfutm_source=web&utm_medium=web&utm_campaign=better_science

Dockta tools for building code repositories: https://github.com/stencila/dockta#readme

Version management for python projects: https://pythonpoetry.org/

Principles of Software Development — an Introduction for Computational Scientists (https://doi.org/10.5281/zenodo.5721380), with an associated code inspection checklist (https://doi.org/10.5281/zenodo.5284377).

Style guide for data files

Google style guide for Python: https://google.github.io/styleguide/pyguide.html

Other recommendations for good code style

➤ The inputs and outputs of functions should be clearly documented in their docstrings

I.e.:

function coin_flip(p: float):

“”” Flips a coin with probability ‘p’ of a heads.

Inputs:

p: float — should be between 0 and 1

Returns:

coin_state: boolean — True if heads, False if tails

“””

# coin_flip implementation

return coin_state

➤ Separating function definition from execution is highly recommended. For example, there should be a main file where biologically/statistically meaningful functions relevant to the research are used, and a file where those functions are defined (i.e. a ‘lib‘ file).

➤ Where applicable, explicitly document the assumptions/constraints of the approach used. For example, ‘estimate_confidence_interval(data)‘ should state what assumptions are made about the structure and statistics of ‘data‘ in the docstring (i.e. noise is iid and normally distributed).

➤ Provide examples and tests (ie using pytest, rUnit) that show the expected behaviour of each function if you are writing a software package.

➤ Use formatters and linters, that help tidy up your code and catch mistakes (they behave something like spell-checkers for code)

I.e. for Python: black for formatting, and flake8 for linting