More biologists, myself included, are using computational approaches to analyze large data sets and now wrestle withfinding the best system to document these analyses and their results. We are adept at recording wet-lab experiments using a “traditional” lab notebook; however, keeping track of computation work comes with new sets of challenges. Unlike wet-lab work, where it can take several days to repeat an experiment, a computational analysis can often be rerun in minutes. It may even be possible to run multiple perturbations of an analysis in a single afternoon. Thus, one challenge with computational work is to keep track of why you are running a given analysis Another challenge is to keep track of what works, and what does not. Careful documentation will keep you on task and will prevent you from getting lost.
I am a molecular biologist who began coding for my thesis project when my advisor and I decided to do a time course transcriptome analysis by RNA-seq. Past work from our lab had found that a handful of candidate genes were specifically up- or down-regulated during this time course, and we wanted to expand the analysis to the entire genome. I sequenced RNA from 8 time points, in triplicate. Since there are approximately 48,000 transcripts in the human genome this resulted in 1,152,000 data points! Given the large size of my data set, I needed to learn how to sift through the data in an efficient manner. I was a complete beginner who had never interacted with a computer at the command line. Luckily, a fellow graduate student and a data analyst in our lab took it upon themselves to teach me Python, a programming language favored by biologists for writing simple programs (i.e. scripts). I soon realized I needed a system to keep track of the many files that I was continuously generating.
There is no one way to keep a virtual lab notebook for bioinformatics. In fact, there are endless ways and everyone finds their own. Here, I outline what works for me, and I hope that it is helpful to others as well. The practices suggested here may seem tedious at first, especially since you’ll want to dig in to the computation, but they will serve you well as you perform analysis after analysis after analysis…
Stay organized
Set up a table of contents in the top directory: A directory is a file system in the computer and is just another name for a folder. For instance, your Desktop directory contains the folders and files that you see on your desktop. The location of your master directory or folder for all of your analyses is up to you, but should contain a table of contents that lays out which experiments can be found in each folder. This can be a simple .txt document. Record the name, date, and location of all analyses. (Fig.1)
Give every experiment its own directory: In addition to output files, this is also where you can store any Powerpoint, Excel, and other files relevant to the project.
Create a README: The README.txt document is invaluable. In the main (ie: top) directory of every experiment, immediately write a text document with a brief description of the directory contents, at a minimum. This can also be where you record your overall goals, approaches, and conclusions. (Fig. 2)
Keep original files in a separate folder: You will often use the same data files for multiple analyses. Rather than copying these files to your working directory (the folder that you are currently working in) every time you use them, leave them in their own folder. Doing so will ensure that you are in fact using the same data for all of your analyses. The same is true for scripts.
Treat every analysis like a wet lab experiment
I use the text editor TextWrangler to record the following:
Goal: Before you begin, briefly state the goal of the analysis, including background on other analyses that have inspired you to perform this analysis. Having a clearly stated, specific goal for each analysis will help you locate relevant information in the future.
Approach: Outline a brief overview of the approach that you will take to help you plan your analysis. There is no need to go into a lot of detail – the details will be in the scripts that you run. Rather, this is where you outline your logic and the scripts and input files that will be used to perform each task.
Conclusion: Always, always, write a brief conclusion of your analysis, even if your conclusion is “this approach is not ideal because…” Including a conclusion for each analysis or task will keep you from repeating your work or making similar mistakes in the future.
These points are also printed for my physical lab notebook, along with any figures generated from the analyses:
Employ useful naming conventions
Give every experiment a number: Numbers are a short, easy way to name files so that you know which files go together. I also prefer the number system to using dates as labels, since I often work on multiple projects in a given day. For example, instead of naming a file “output” name the file “1_output” so that you know that the file is the output of the analysis performed in experiment 1.
Use camelCase: In camelCase, each subsequent word in a file name after the first word begins with a capital letter and words are not separated by spaces. File names with more than one word should be named in the camelCaseFormat since spaces between each word can make it difficult to accurately call a file in the command line
Version control your commonly used scripts: If you edit a general purpose script for a specific, single use, save the script as originalName_descriptionOfEdit in the same directory where it was run. This technique leaves a trail that further helps you keep track of your exact changes. Alternatively, you can simply make a note in your methods section about the edit that was temporarily used. Just be sure that the original code remains intact!
Take notes on your code
Use comments in your scripts: Comments help clarify the role of the code.
# This is a comment in python, R, perl, and ruby
// This is a comment in C++ and Java
Make limited use of command line history: Tracking command line history by recording all commands entered is useful for beginners who are still learning the basics. However, these notes take up a lot of space and don’t tell you why you did something or what the result was, so try not to do this.
These are the simple rules that I try to follow in my own research. And of course,always back up your data. Some of this may not apply or may not work for you. The best way to find a system that does work is through trial and error, and by asking for tips from others. You can also check out web-based applications, such as Jupyter or R Markdown. Comment below to add your tips!
Fig. 1. Sample table of contents .txt file containing the names of all raw data and experiment folders on my computer. This list makes it easy to find the appropriate files for later analyses and production of the final manuscript.
Fig. 2. Sample README.txt file containing a description of the project in my working directory
A version of this article was originally published on the Addgene blog on June 7, 2016.