A guide to reproducible archiving of data and code

Daniel I. Bolnick (daniel.bolnick@uconn.edu), Roger Schürch (rschurch@vt.edu), Daniel Vedder (daniel.vedder@idiv.de), Max Reuter (m.reuter@ucl.ac.uk), Leron Perez (leron@stanford.edu), Robert Montgomerie (mont@queensu.ca) and Sebastian Lequime (s.j.j.lequime@rug.nl)

Last updated 1st Janu­ary 2023

From mid-2020, the Journ­al of Evol­u­tion­ary Bio­logy (JEB) has man­dated the depos­ition of data in a pub­lic repos­it­ory. We also strongly encour­age authors to archive any ana­lys­is and sim­u­la­tion code (e.g. R scripts, Mat­lab scripts, Math­em­at­ica note­books) used to gen­er­ate repor­ted res­ults in a pub­lic repos­it­ory, and this will become a require­ment in the near future.

The fol­low­ing text out­lines our expect­a­tions and provides a con­cise guide on how to pre­pare a data and code archive. It is based on a doc­u­ment ini­ti­ated by Amer­ic­an Nat­ur­al­ist EiC Dan Bol­nick and writ­ten by a num­ber of volun­teers, includ­ing JEB EiC Max Reu­ter, with input from the eco­logy, evol­u­tion and beha­viour com­munit­ies. These guidelines have been adap­ted spe­cific­ally for JEB by our Data Edit­or Sebasti­an Lequime.

Rationale

Shar­ing data and code is vital for many reas­ons. It pro­motes appro­pri­ate inter­pret­a­tion of res­ults, check­ing valid­ity, future data syn­thes­is, rep­lic­a­tion, and a teach­ing tool for stu­dents learn­ing to do ana­lyses them­selves. Shared code provides great­er con­fid­ence in results.

Any data, includ­ing raw data, and com­puter code used to gen­er­ate sci­entif­ic res­ults should be eas­ily usable by review­ers or read­ers. The fun­da­ment­al ques­tion you should ask your­self when pre­par­ing an archive is, “If a read­er down­loads my data and code, will my data files and scripts be com­pre­hens­ible, and will they run to com­ple­tion and yield the same res­ults on their computer?” 

The recom­mend­a­tions below are inten­ded as a guide towards pre­par­ing well-doc­u­mented and usable data and code for depos­ition in a repos­it­ory. You will find it easi­er to build your archive if you adhere to these recom­mend­a­tions from the start of your research, but you might also use this text as a final check­list before archiv­ing, or when fin­ish­ing a research pro­ject and you will find it easi­er to build reusable clean code and data if you adhere to these recom­mend­a­tions from the start of your research. set­ting it aside. 

In the fol­low­ing, high pri­or­ity points are in blue font, while black font indic­ates sug­ges­tions to fol­low ‘best practices’.

1. Clean documentation

➤  Pre­pare a README file with import­ant inform­a­tion about your repos­it­ory as a whole (code, and files con­tents). Text (.txt) and Mark­down (.md) README files are read­able by a wider vari­ety of soft­ware tools, so have great­er longev­ity. The file should/could contain:

  • Author names, con­tact details
  • Title of study
  • A brief sum­mary of what the study is about 
  • Link to pub­lic­a­tion or pre­print if available
  • Identi­fy who is respons­ible for col­lect­ing data, and writ­ing code.
  • Code ver­sion (e.g., Git fin­ger­print, manu­al ver­sion number)
  • Over­view of folders/files and their con­tents (refer­ring to the paper is not sufficient)
  • Work­flow instruc­tions for users to run the soft­ware (e.g. explain the pro­ject work­flow, and any con­fig­ur­a­tion para­met­ers of your software)
  • For lar­ger soft­ware pro­jects: instruc­tions for developers (e.g. the struc­ture and inter­ac­tions of sub­mod­ules), and any sub­si­di­ary doc­u­ment­a­tion files.
  • Links to protocols.io or equi­val­ent meth­ods repos­it­or­ies, where applicable

➤  You might wish to addi­tion­ally include a file (i.e. ‘requirements.txt‘) doc­u­ment­ing the pack­ages and soft­ware ver­sions used (includ­ing the oper­at­ing sys­tem) and depend­en­cies (if these are not installed by the script itself). Altern­at­ively, this may be included in the README file

➤  Use inform­at­ive names for folders and files (e.g. “code”, “data”, “out­puts”)

➤  Give license inform­a­tion (either in the README or a sep­ar­ate file), such as Cre­at­ive Com­mons open source license lan­guage grant­ing read­ers the right to reuse code. For more inform­a­tion on how to choose and write a license, see choosealicense.com.

➤ If applic­able, list fund­ing sources used to gen­er­ate the archived data, and include inform­a­tion about per­mits (col­lec­tion, anim­al care, human research).

➤  Great tem­plate avail­able here: https://github.com/gchure/reproducible_research

2. Clean code

➤ Thor­oughly annot­ate your code with in-script com­ments indic­at­ing what the pur­pose of each set of com­mands is (i.e. “why?”). If the func­tion­ing of the code (i.e. “how”) is unclear, strongly con­sider re-writ­ing it to be clearer/simpler. 

➤  Scripts should start by load­ing required pack­ages, and import­ing raw data in a format exactly as it is archived in your data repository.

➤  Use rel­at­ive paths to files and folders (e.g. avoid set­wd() with an abso­lute path in R), so oth­er users can rep­lic­ate your data input steps on their own computers. 

➤  Test code before ship­ping, ideally on a pristine machine without any pack­ages installed, but at least using a new session.

➤  If you are adapt­ing oth­er researcher’s pub­lished code for your own pur­poses, acknow­ledge and cite the sources you are using. Like­wise, cite the authors of pack­ages that you use in your pub­lished article.

➤  Doc­u­ment units and equations.

➤  Use inform­at­ive names for input files, vari­ables, and functions.

➤  Any data manip­u­la­tions (mer­ging, sort­ing, trans­form­ing, fil­ter­ing) should be done in your script, for fully trans­par­ent doc­u­ment­a­tion of any changes to the data.

➤  Organ­ise your code by split­ting it into logic­al sec­tions, such as import­ing and clean­ing data, trans­form­a­tions, ana­lys­is and graph­ics and tables. 

  • Sec­tions can be sep­ar­ate script files run in order (as explained in your README) or blocks of code with­in one script that are sep­ar­ated by clear breaks (e.g., com­ment lines, #————–), or a series of func­tion calls (which can facil­it­ate reuse of code).
  • Group code by func­tion: files and func­tions should remain con­cise. Files >800 lines of code usu­ally bene­fit from being split into smal­ler files. Sim­il­arly, func­tions should per­form a single task.

➤  Label code sec­tions with head­ers that match the fig­ure num­ber, table num­ber, or text sub­head­ing of the paper.

➤  Omit extraneous code not used for gen­er­at­ing the res­ults of your pub­lic­a­tion, or place any such code in a Coda.

➤  Where use­ful, save and depos­it inter­me­di­ate steps as their own files. Par­tic­u­larly, if your scripts include com­pu­ta­tion­ally intens­ive steps, it can be help­ful to provide their out­put as an extra file as an altern­at­ive entry point to re-run­ning your code. 

➤  If your code con­tains any stochast­ic pro­cess (e.g., ran­dom num­ber gen­er­a­tion, boot­strap re-sampling), set a ran­dom num­ber seed at least once at the start of the script or, bet­ter, for each ran­dom sampling task. This will allow oth­er users to repro­duce your exact results.

➤  Include clear error mes­sages in your code that explain what went wrong (e.g. if the user gave a text input where a numer­ic input was expected).

3. Clean data

Check­list for pre­par­ing data to upload to DRYAD or oth­er repository.

Repository contents

➤ All data used to gen­er­ate a pub­lished res­ult should be included in the archive, includ­ing digit­al raw data (raw sequen­cing reads, pho­tos, videos, sound record­ings, etc.). For papers with mul­tiple exper­i­ments, this may mean a cor­res­pond­ing num­ber of data files.

➤ Save each file with a short, mean­ing­ful file name (see DRYAD recom­mend­a­tions here).

➤ Pre­pare a README_DATA text file to accom­pany each data file. It should provide a brief over­all descrip­tion of the file’s con­tents, and a list of all vari­able names with explan­a­tion (e.g. units). This should allow a new read­er to under­stand what the entries in each column mean and relate this inform­a­tion to the Meth­ods and Res­ults of your paper. Altern­at­ively, this may be a “Code­book” file in a table format with each vari­able as a row and column provid­ing vari­able names (in the file), descrip­tions (e.g. for axis labels), units, etc. 

➤ Save the README_DATA files as a text (.txt) or Mark­down (.md) files and all of the data files as comma-sep­ar­ated vari­able (.csv) files. 

➤  If your data are in EXCEL spread­sheets you are wel­come to sub­mit those as well (to be able to use col­our cod­ing and provide addi­tion­al inform­a­tion, such as for­mu­lae) but each work­sheet of data should also be saved as a sep­ar­ate .csv file.

➤ We recom­mend also archiv­ing any digit­al mater­i­al used to gen­er­ate data (e.g., pho­tos, sound record­ings, videos, etc), but this may use too much memory for some repos­it­ory sites. At a min­im­um, upload a few example files illus­trat­ing the nature of the mater­i­al and a range of outcomes. 

Digital raw data

➤ You should archive any digit­al mater­i­al used to gen­er­ate data (raw sequen­cing reads, pho­tos, sound record­ings, videos, etc) either on a spe­cial­ized (e.g., Sequence Read Archive) or on a gen­er­al (e.g. DRYAD) repos­it­ory. DRYAD allows for 300GB of data per pub­lic­a­tion uploaded through the web inter­face. If your data exceeds that amount, please con­tact our data edit­or to dis­cuss pos­sible alternatives.

Data file contents and formatting

➤ Archived files should include raw data, not group means or oth­er sum­mary stat­ist­ics; for con­veni­ence, sum­mary stat­ist­ics can be provided in a sep­ar­ate file, or gen­er­ated by code archived with the data.

➤ Identi­fy each vari­able (column names) with a short name. Names should prefer­ably be <10 char­ac­ters long and not con­tain any spe­cial char­ac­ters that could inter­fere with read­ing the data and run­ning ana­lys­is code. Use an under­line (e.g. wing_length) or camel case (e.g., WingLength) to dis­tin­guish words if you think that is needed.

➤ Omit vari­ables not ana­lyzed in the pub­lic­a­tion, for brevity.

➤ A com­mon data struc­ture is to ensure that every obser­va­tion is a row and every vari­able is a column.

➤ Each column should con­tain only one kind of data (e.g. do not mix numer­ic­al val­ues and com­ments or cat­egor­ic­al scores in a single column).

➤  Use “NA” or equi­val­ent to indic­ate miss­ing data (and spe­cify what you use in the README file).

4. Completing your archive

➤ Pre­pare your data and code archive, and asso­ci­ated README files, sim­ul­tan­eously with manu­script pre­par­a­tion (ana­lys­is and writing).

➤ Data and code should be archived on ver­sion-con­trolled repos­it­or­ies (e.g. DRYAD, Zen­odo). Your own Git­Hub account (or oth­er privately con­trolled web­site) does not qual­i­fy as a pub­lic archive because it does not provide a DOI, and you con­trol access and might take down the data at a later date. You can how­ever link your Git­Hub repos­it­ory with third-party repos­it­or­ies, such as Zen­odo, which will provide a DOI (see a guide here: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content)

➤ Provide all of the metadata and inform­a­tion reques­ted by the repos­it­ory, even if this is option­al and redund­ant with inform­a­tion con­tained in the README files. Metadata makes your archived mater­i­al easi­er to find and understand.

➤ From the repos­it­ory, get a private URL and provide this on sub­mis­sion of your manu­script. It will allow edit­ors and review­ers to access your archive before your data are made public.

5. For more information

More detailed guides to reproducible code principles can be found here: 

Doc­u­ment­ing Python Code: A Com­plete Guide — https://realpython.com/documenting-python-code/

A guide to repro­du­cible code in Eco­logy and Evol­u­tion, Brit­ish Eco­lo­gic­al Soci­ety: https://www.britishecologicalsociety.org/wp-content/uploads/2019/06/BES-Guide-Reproducible-Code-2019.pdfutm_source=web&utm_medium=web&utm_campaign=better_science

Dockta tools for build­ing code repos­it­or­ies:  https://github.com/stencila/dockta#readme

Ver­sion man­age­ment for python pro­jects:  https://pythonpoetry.org/

Prin­ciples of Soft­ware Devel­op­ment — an Intro­duc­tion for Com­pu­ta­tion­al Sci­ent­ists (https://doi.org/10.5281/zenodo.5721380), with an asso­ci­ated code inspec­tion check­list (https://doi.org/10.5281/zenodo.5284377).

Style guide for data files

Google style guide for Python: https://google.github.io/styleguide/pyguide.html

Other recommendations for good code style

➤  The inputs and out­puts of func­tions should be clearly doc­u­mented in their docstrings

  • I.e.: 

func­tion coin_flip(p: float):

“”” Flips a coin with prob­ab­il­ity ‘p’ of a heads.

Inputs:

p: float — should be between 0 and 1

Returns:

coin_state: boolean — True if heads, False if tails

 “””

# coin_flip implementation

return coin_state

➤  Sep­ar­at­ing func­tion defin­i­tion from exe­cu­tion is highly recom­men­ded. For example, there should be a main file where biologically/statistically mean­ing­ful func­tions rel­ev­ant to the research are used, and a file where those func­tions are defined (i.e. a ‘lib‘ file).

➤  Where applic­able, expli­citly doc­u­ment the assumptions/constraints of the approach used. For example, ‘estimate_confidence_interval(data)‘ should state what assump­tions are made about the struc­ture and stat­ist­ics of ‘data‘ in the doc­string (i.e. noise is iid and nor­mally distributed).

➤  Provide examples and tests (ie using pytest, rUnit) that show the expec­ted beha­viour of each func­tion if you are writ­ing a soft­ware package.

➤  Use format­ters and lin­ters, that help tidy up your code and catch mis­takes (they behave some­thing like spell-check­ers for code)

  • I.e. for Python: black for format­ting, and flake8 for linting