The explosive growth of Big Data in the Life Sciences has focused the attention of research-IT organizations on the challenges of Petabyte-scale storage for the life-sciences. A growing number of organizations have discovered that open-source/white-box technologies represent a remarkably compelling alternative to “Enterprise” storage appliances offered by incumbent vendors. We assert that aggressive application of disruptive storage strategies by Life-science research organizations, can cut storage costs for high-performance storage systems by 75% while cutting power consumption by roughly 50% — relative to enterprise storage appliances. As a concrete example, the JHPCE  recently put into production a low-power 1.2PB (usable) Lustre system for about $137/TB. This corresponds to a cost of $0.0023/GB/month (if amortized over a 5 year lifetime). This is a high-performance distributed file system that is more cost-effective than Glacier ($0.007/GB) — Amazon’s archival tier of storage.

Unfortunately, few research-IT organizations possess the skills or expertise needed to assess, let alone develop, their own storage solutions —  no matter how compelling the potential value proposition.  The workshop will provide an intensive and cutting-edge primer on Petabyte-scale storage systems so that academic IT organizations can assess their options and jump-start the task of developing their own multi-Petabyte storage solutions with help from a like-minded research-IT community and industry partners. The workshop is targeted to technical experts who want to learn to build their own systems. The focus of the workshop will be  ZFS-on-linux and Lustre-on-ZFS.


In the late 1990’s linux clusters emerged from the convergence of two trends: 1) the development of linux-based open-source software stacks for networked distributed computing and 2) the availability of commodity computing hardware enabled by increasingly powerful low-cost microprocessors. The resulting technology provided a powerful alternative to the expensive proprietary SMP systems that dominated the offerings of HPC vendors. Early adopters at forward-thinking research organizations provided their institutions with competitive advantages. “Linux cluster” is now almost synonymous with high-performance computing.

20 years on, history repeats itself. We once again witness the convergence of two trends: 1) The availability of linux-based open-source software stacks for scale-out storage and 2) the availability of commodity storage hardware and disk drives. An increasing number of forward-thinking research organizations are eschewing enterprise storage solutions in favor of constructing their own multi-Petabyte storage systems that match the specific needs of their research agendas, workflows and funding profiles.

The organizers of the workshop have a track record of implementing Petabyte-scale storage systems and have observed that storage technology is no more mysterious than linux cluster technology. It is just different. Their experience is that the needed  knowledge and expertise must be extracted from busy individuals in other research-IT organizations or needs to be mined from far corners of the internet. In talking to many research-IT groups, the organizers discovered that, the lack of example systems, support-models, documentation and a like-minded community, represented barriers that were difficult for many to overcome. We propose to overcome these barriers through education and the creation of a like-minded community.

We will take a pragmatic approach and describe a handful of systems  including: 1) a low-cost low-power systems (actually in production) 2) systems optimized for high-availability and 3) systems which exploit emerging hard-drive technology.

We will publish reference architectures, best practices, and discussions specific to Petabyte-scale open-source/commodity storage systems with a focus on the life sciences. Publication will take-place online. We anticipate that documentation will start to be available shortly (please stay tuned).

We will also describe financing strategies including: 1) Convincing PIs with  NIH-R01s to collectively invest in an open-source/white-box storage system and  2) Applying for NIH-S10 and NSF-MRI grant funding. We expect that systems based on documented reference architectures with published performance  and reliability data, together with best practices and administration tools, will to help establish credibility in the eyes of individual PIs or review panels from the NIH and NSF.

Teaching Objectives

Workshop participants will take away the following:

  1. A jumpstart on Petabyte-scale storage development projects
  2. Hands-on experience with installation and basic administration of ZFS-on-Linux and Lustre-on-ZFS.
  3. An understanding of reference designs (including parts lists and approximate costs) for two types of storage systems: 1)  ZFS-on-linux and 2) Lustre-on-ZFS.
  4. An understanding of limitations and “where the bodies are buried”.
  5. Funding strategies for storage systems.
  6. An introduction to  potential corporate partners as well as experts in ZFS and Lustre.
  7. An introduction to a community of like-minded research-IT organizations.


The 2-day workshop will have four sessions. Session I will provide a broad-brush overview of technologies, trends and successful systems. There will follow an intensive three-session bootcamp suitable for technical experts with interest in developing their own solutions. A detailed schedule and slides can be found on the schedule page.


