To make the impossible possible: publishing experimental data from decades ago

I am currently writing papers based on data obtained by people of my group that partly date back more than 15 years. Astonishingly, I can easily find and reproduce all of these data and feed them into GraphPad for state-of-the-art analyses and graph design. This is possible mainly because, from early on, we introduced simple standardised measures of data filing used by all members of the group.

A simple and transparent naming and identifier system:

DorothyBishop-Tweet
Data management is a general challenge that requires discipline and simple consistent rules that are easy to follow.

As shown in the image, each experiment has its own folder named starting with the date the experiment was initiated, potentially followed by a letter (if more than one experiment began the same day) and using hyphen and underscore consistently: “21-08-29A_”. This number code is the identifier for this experiment, and all documents in this folder (see below) will start with this date and number.

To quickly browse through experiments, all folders are situated in one “Experiments” folder (rather than subfolders that group experiments in subjective ways), so that they line all up by date and can be quickly scanned for content by using their concise experimental descriptors. To provide these descriptors, the folder names contain key information about the experiment using shortcut that is common to members of our group: e.g. “21-08-29A_Khc8Df_tub-Syt_ConA_3DIV” [Date/identifier_genotype (Kinesin heavy chain8/Df)_antibodies used (tubulin-Synaptotagmin)_primary neurons cultured on concanavalin A_for 3 days in vitro]. We usually perform two sets of experiments and, if consistent, pool the data; the second experimental folder of this pairing will contain the pooled analysis and is named accordingly: e.g. “21-09-13_Khc8Df_tubSyt_ConA_3DIV-POOL”. This naming system enables efficient identification of needed experimental folders in a matter of seconds.

DataStorage

A simple and transparent system of storing experimental information
Each folder contains an explanatory document with the same name as the folder. It has three purposes: (1) promoting proper planning of the experiment, (2) making sure the experiment has been properly executed and closed, and (3) ensuring that the experiment and its outcomes are understandable even years later. Our ‘blanco’ document contains the following items:

  • date/identifier: see above
  • rationale/objective: as a rule of thumb, we should never perform experiments we do not understand or do not agree to. In both cases we should engage in further discussion – before time is invested in vain. To this end, writing an experimental rationale/objective of 1-3 sentences is a good check point, and it provides a narrative that can be understood many years later.
  • Experimental specimens used, such as
    • genotype (how identified, what controls)
    • developmental stage
  • Experimental procedures, such as
    • culture procedures: e.g. citing the file name of the specific protocol that will be/was used (which is usually stored in a dedicated “Documents” folder, tracking changes that are introduce over time as different versions), but listing potential deviations from that protocol
    • primary/secondary antibody combinations including concentrations and animals of origin
  • Documentation, such as
    • storage information: e.g. slide box/slot numbers of slides, specific links to online repositories
    • image plates generated
  • Results: describing quality of the experiment, observations, statistical analysis, pooled analyses
  • Conclusion: as a rule of thumb, an experiment should never be considered finished without having drawn a clear conclusion: e.g. a clear outcome, the need to repeat the experiment with different parameters, or a clear reasoning as to why this approach does not work and the fundamental strategy needs to be changed.

The explanatory document is the centerpiece of each folder and is accompanied by all other experiment-related files, such as image files, Excel sheets, statistics documents etc. Importantly, all document names should start with the date/identifier! This will allow you to copy these image or data files into different locations (e.g. when preparing publications or theses) but having an easy way to refer back to their original source if further information is needed.

At The University of Manchester, we are in the lucky situation that institutional backup server space is provided, which is good practice that will hopefully become standard at all universities. Members of my group usually store their ‘Experiment’ and ‘Protocol’ folders on an external hard disc that can shuffle between work and home, but use FreeFileSync as a reliable and efficient open source software (providing automatic updates upon a one-off donation) to make regular backups to the external server. In this way, all members of the group have access at any time, can share results easily and leave them behind as their legacy when moving on to other jobs.

Extending this model towards student project supervision
To help students organise their work and facilitate communication during supervision, we have developed our data storage model further, in that students maintain a summary document. This document can be sent to me in preparation of supervisory meetings, and it helps students to maintain an overview of their project – hence to keep a clear mind and prevent sudden panic attacks of the “I do not have enough data for my thesis” kind. As I tend to tell students: “If you keep this document up-to-date, the results part of your report/thesis is mostly written, just needs to be arranged into a meaningful order.” The document is broken down into at least four parts:

  • Brainstorm: Any ideas that might arise, be it during discussions, under the shower, on the way to uni or during reading, should immediately be inserted as a bullet point into the brainstorm section. It is key to equip each item with a brief rationale (we had various cases where we could not reconstruct the idea behind an experiment) and key ideas how the experiment could be performed. In our supervisory meetings we usually go through this list and set priorities for next steps.
  • Ongoing: Experiments that have been started are entered here or dragged over from the Brainstorm list. The aim is to provide a quick insight into how things are progressing, such as providing the experimental identifiers of completed sets of experiments with a brief statement as to what the major outcome appears to be. This serves as a quick summary describing the state of ongoing work, so that supervisory meetings are less about reporting but more about interpretation and focussing on the next steps.
  • Completed: Once experiments are completed, including data pooling and a clear conclusion, they can be dragged into the ‘Completed’ section. Since data and thoughts about your experiments are fresh in your mind at this stage, this is the time to insert data from your experimental document and edit the results and conclusions in ways that can readily be used in the report/thesis. It is essential to always list the experimental identifiers, so that more detailed information can be easily retrieved from the respective experiment folder if required. At this stage, you could also bring all graphs to publication standard and potentially select representative images which can be stored, for example, in the experimental folder (but clearly indicating in the ‘Completed’ section where to find them). Performing these tasks at this stage may take you half an hour or hour, but should be more efficient than performing them during the final writing stage where you first have to re-introduce yourself to the data – apart from the fact that these tasks will pile up to a huge work load that distracts from the actual writing. If all experiments are prepared in this way, the section can evolve over time: different sets of experiments can be grouped into higher order statements, and by-and-by first sub-headings can be written. I have made good experiences with keeping all this info in bullet point format so that it can be shifted around and played with. Furthermore, it does not harm to insert relevant references at this stage (or even in the experimental document), saving you the effort of having to ‘re-discover’ them at a later stage.
  • Discussion: As is the case for experimental ideas in the ‘Brainstorm’ section, any thoughts (your own or from the literature), topics or problems that seem appropriate/helpful for the Discussion, should be listed as bullet points whenever they surface. Ideas come out of nowhere but are quickly forgotten, and it is a great feeling to have them saved as a repository to work from when reaching the writing stage. Important: never throw away any bullet points from the ‘Completed’ or ‘Discussion’ sections, but rather shift them into a spare section (I call it ‘Tidbits’). Those points might seem obsolete at a certain time – but could turn into an unforeseen gem that fills a crucial gap at a later stage.

I hope some of these thoughts are useful, and any suggestions are of course most welcome. Some people might regard this procedure micro-management, but I do not see it this way. Ownership of the process lies with the lab members, and it is an opportunity to optimise data management and supervisory communication. I can say from experience that these procedures, if adhered to with discipline, are enormously time-saving in the long run and guarantee transparency and accessibility for decades – so that data that would otherwise never see the light of day, will get published in the end. Notably, this system of data management is independent of any dedicated software, hence highly flexible to use. Furthermore, I have seen students panicking that did not follow this path, and others executing final experiments until a couple of weeks before their submission deadline, since they knew that things were under control.