Bioinformatics software is way too hard to use. I’ll skip the how and why in this post - the important point is that poor user experience imposes a significant tax on research. A researcher will often have an insight of the form I want to run this software on those data...then spend days getting everything set up.

I think one tool missing from the researcher's toolbox is a package manager for installing bioinformatics programs[1]. This post describes one vision for how such a tool could work. It outlines a hypothetical tool BioSink that would allow researchers with minimal command line experience to quickly bootstrap an analysis.

Background

An underappreciated development in software engineering over the past decade is the maturation of package managers. Developers used to regularly struggle with system-specific ./configure and make parameters; now tools are almost always expected to install seamlessly with Aptitude, Homebrew, etc.

Packaging is also no longer just a system function. Package managers now ship with programming languages (gem for Ruby, pip for Python, Dart Pub), application frameworks (Node.js, Sublime Text), and deployment tools (Bower, Boxen, Puppet Forge).

So, we've learned much about the technical and social challenges of implementing a package manager. BioSink is largely an excercise in applying those lessons to the unique challenges of biology research.

Mini tutorial

I'll start with a hypothetical demo, as this idea is probably best explained by example. The following commands show how a user would install and run a hypothetical program PathoVariant, which identifies pathogenic variants in a human genome.

# Create a new analysis environment
biosink init

# Install the package
biosink install pathovariant

# Describe the current environment - what programs and datasets are available 
biosink env

# Now PathoVariant just works 
pathovariant --input-genome bda69cb3 \
    --ethnicity HISPANIC \
    --sensitivity .97 \
    --out results.csv \
    --manifest command1.yaml

# That was taking too long - run it on the server instead
biosink run-remote-command --server mycluster --manifest command1.yaml

# 6 months later...see if those results are still up to date
biosink check-updates --manifest command1.yaml

Note that this is only meant as a high level overview - many important details are omitted.

What it does

As you may have noticed, BioSink is not strictly a package manager - it also includes a virtual environment, a package index, and more. Here is a list of features (again at a high level):

  • Execution environment: Much like virtualenv, BioSink works in an isolated virtual environment for program execution.

  • Installing pacakges: BioSink controls the entire install process. When a package is installed, its command line executable(s) are immediately available in the local env.

  • Managing dependencies: BioSink contains a full (versioned) dependency graph, and will install package dependencies too. Importantly, this includes both software tools and any reference datasets they require.

  • Central repository: Packages are downloaded from a central repository that authors can publish to.

  • Authoring interface: Provides an easy mechanism for researchers to create packages from their work. Consists of a single file that describes the package and lists dependencies, then an install.sh script that runs in a local env.

  • Description format: command1.yaml above is a common format for describing a single command - the environment, parameters, and what datasets and dependencies were used. This can be used to replicate an analysis in any BioSink environment, and is exposed as an API for other tools to use (eg. job schedulers).

How it works

Hopefully, everything above will just seem like black magic to a beginner user. I'll give a few notes about how BioSink works under the hood:

  • Environment directory: Like Git, there is a hidden .biosink directory that contains all the installation junk that should be transparent to the user.

  • Wrapper scripts: Program authors won't want to rely on BioSink - they will want to release tools as a standalone binary, and then create a package. However, the standalone binary will need to have a different interface - the --input-genome argument above wouldn't make sense in a standalone binary. The solution is wrapper scripts:

    • Running pathovariant within a BioSink env is a wrapper around the "real" pathovariant binary.

    • This process is automated for authors - there is a standard way to specify *--input-genome is a VCF file, and it maps to the original tool's --vcf-file arg.

    • BioSink provides a show-full-command argument that shows the full shell command.

    • One consequence is that a single heavyweight program, such as the GATK, could be released as multiple packages that target different use cases - both official packages and user-submitted ones.

  • Datasets as dependencies: This is referenced a couple times above - reference datasets are treated like normal package dependencies. This allows tight coupling of software and data - which I think is an inevitable trend in bioinformatics moving forward.[3]

  • Nested dependencies: Compiled programs are statically linked and, as in npm, dependencies are nested. Installing a package can take much longer, but this makes it much easier to install in an isolated environment. Static datasets are the one exception - they are shared between packages.

  • System tools: BioSink relies on the host system for a set of tools to install and run packages: Java, GCC, Perl, etc. These are checked when BioSink is installed. Supporting a new platform will be quite challenging - would probably need to start with just Mac, RedHat and Ubuntu.

  • Dataset identifiers: Datasets are referenced by identifier, not file path. (In practice, the identifier will probably be a simple sha1 or something). This is important for running analyses in multiple environments, and provides a mechanism for supporting resources that are not simple flat files (eg. a REST server).

A business case

How could one build BioSink? Unfortunately, it falls in somewhat of an academic no mans land - it's not a project for a traditional academic lab. And since it's biotech-specific, it probably doesn't have the market potential to motivate a major cloud player.

However, I think there's a case for a development-centric startup. If you could gain widespread adoption, revenue models would emerge. Some (incomplete) examples:

  • Remember that packages contain data too. You could allow people to pay to host private packages, eg. for a pharma with proprietary data. This will undoubtedly be cheaper than setting up a full private repository.

  • Note the run-remote-command above. A company could set up a remote environment that runs commands for researchers that don't have the appropriate infrastructure on demand.

  • Less likely, but a company would be in good position if demand emerges for a marketplace for bioinformatics tools.

I'll be thinking about this in the next couple weeks - get in touch if you're interested.


[1] I suppose that a package manager would be better described as the toolbox itself.

[2] To be clear, this wouldn't be the actual name. I was actually going to call it BroBio until I Googled - the term has something to do with The Real World last week.

[3] I'm actually preparing another post about future trends in biotech data.

[4] Avoiding the phrase "app store", but you get the idea.