Google recently released a framework for genome research, Google Genomics. It's extremely exciting that Google is investing in biotech - Google Search operates on a scale multiple orders of magnitude larger than any bio research.

That said, my overall impression is that Google is building the wrong product. With the caveat that I don't know the current product roadmap, I've described here how I would design Google Genomics from scratch.

Background

This section is for newcomers - genomics people can skip to the next section

To date, the genomics community simply hasn't embraced cloud computing. Researchers work primarily on academic compute clusters, supported by systems like Platform LSF. Data is stored in large flat files, and analyses are run via command line programs tied together in shell scripts.

For its simplicity, this system works remarkably well. There are very few analyses that can't be run in an acceptible time or cost. Research takes longer and is more error prone than it could be, but academics don't face the same speed or accuracy requirements of traditional business analytics.

However, there are still pain points. For years, researchers have recognized that we should eventually shift to a new paradigm, where data is stored in a more naturally scalable cloud environment and is accessed through an API, rather than a file handle [1]. Google Genomics is one of many attempts to build such a platform; others include DNANexus, Omicia, GenomeBridge, and more.

Google’s Attempt

The Google Genomics documentation is really good; it explains the workflow well. Two sentence summaries are always dangerous, but I’d propose that a tagline for Google Genomics could be:

Use Google Genomics to run expensive analyses on Google hardware. If you upload your data then reimplement in our SDK, an analysis that takes A minutes and X dollars would take B minutes and Y dollars.

That’s straightforward, and I’m sure it will gain some adoption, but it’s just not motivating to a scientific audience. There are a few problems:

  • Data processing costs are poorly accounted and not a primary concern for academics. There's a reason that there has basically never been a successful bioinformatics company.

  • It doesn't enable any new science or collaborative workflows.

  • To use Google Genomics, you have to learn a new SDK.

I suspect many others shared my reaction: sounds great, but I'm too busy right now; I’ll check it out in a few months.

An Alternate Proposal

If I were designing Google Genomics, I'd make a fundamentally different sales pitch:

Upload your data to Google Genomics and some basic lookups, such as viewing the reads at a location, will Just Work. We also have an API for basic scripting if you want.

And I'd organize that pitch around the following design principles:

  • Easy onramp: The most important design goal should be to make uploading sequencing data is as easy as possible, so no users (including clinicans) think the product is too advanced for them. Uploads must be possible over a browser, FTP, S3 or Aspera.

  • URL-based data views: Provide a simple URL schema for users to view basic lookups in a web browser, with a simple GUI [2]. As soon as a user uploads data, she should be able to email a link to a collaborator that says "check out this variant". No other text should be required.

  • Data accounting: Make every effort to collect metadata about files that are uploaded: what sequencer, what capture platform, etc. This is the one place to allow friction in the upload process - it's fine to force users to actually know where their data comes from.

  • Easy (and slow) scripting: Launch with a simple REST API and language bindings to Python, Perl, Java, C and Go (this should be easy with Protocol Buffers). Users can write and run scripts locally, fetching data over the network. This is obvisouly much less efficient than the current Google Genomics SDK; a formal analysis framework can come later.

  • Unix-style user permissions: Data must be protected, but users will want to share data with collaborators. Provide an intuitive interface for users to manage groups and permissions, with Google accounts for authentication (and, regrettably, Google+).

  • Free trial period: Ensure that users can do something before entering a credit card. Rather than a free usage tier, I'd prefer launching a paid service and giving .edu researchers $100 credit. To aid user retention, users could be sent another $25 "gift" every few months. Eventually they will take the bait.

Strategy

Unlike Google Genomics, this product prioritizes storage over analysis. It aims to provide a lowest common denominator that most researchers will want to use to store data.

Gaining users

Adoption would be the primary focus during the rollout period; the singular goal would be to gain a critical mass of research data. What defines a "critical mass"? I'd suggest it is when researchers assume that any newly generated sequencing data is in Google Genomics, and are surprised if it isn't. This is an extremely high threshold, but I think it's attainable, mainly because of the connectedness of the genomics community.

Of course, this all rests on a single assumption: is there enough of a carrot for people to actually upload data? I think yes. For one, the ability to access an arbitrary slice of genomic data in a browser, through a unique URL, would be very useful, and is surprisingly unavailable to most datasets. Additionally, this product could also be sold as a data backup, which there is an apptetite for among researchers [3].

Shifting to analysis

This product would not be sufficient for most analysis workflows, most notably variant calling. I would instead implement a compute framework after the product gains widespread adoption for data storage, for a few reasons:

  • During the rollout phase, any data will be duplicated on local servers and Google servers. (Indeed, this is a feature!) Until the product gains trust and familiarity, researchers will continue to use current systems for critical analyses.

  • Google will be in a much better position to build a compute framework after seeing how people use the REST API organically. This also greatly reduces the chance that Google builds the wrong API and has to break back compatibility.

  • Developers will be much more likely to experiment with an SDK if the data is already loaded; no more "I'll wait a few months.

  • Finally, it's possible that an analysis framework isn't actually needed, if I/O-bound tasks can be run on VMs in Google Cloud.

What Google gets

If Google could execute, the resulting ecosystem would be extremely valuable, both financially and scientifically:

  • Data would be in a common platform; any analysis that supports Google Genomics (such as a new annotation program) would take days, not months, to scale throughout the genomics community.

  • Data would actually be on Google servers; researchers will have solved issues of consent and IRB approval.

  • The social network of genomics institutions and collaborations would be represented within the platform, further reducing friction for future analyses.

  • Researchers would become increasingly acquainted with a pay-per-analysis environment.

  • Trust, goodwill, and - perhaps most importantly - familiarity in the genomics community.

TL/DR

To this uninformed observer, the current Google Genomics seems like premature optimization. I'd rather build a simple data store that gains widespread adoption quickly - and I think there is a way they can.

That said, we should all want Google Genomics to succeed, as it could legitimately advance medicine, and I would be thrilled to be wrong.


[1] Importantly, this does not necessarily imply that data must be stored in a traditional relational database. I think flat file storage will remain preferable to RDMBSs, but files must have a flexible indexing system that allows for database-like random access.

[2] You could perhaps open source the GUI, as I suspect many researchers would want to contribute UI tweaks.

[3] Ironically, data backup seems easier to sell to genomics researchers than cheaper computation (at least in the near term).