This cloud computing research success story concerns a large computation. This is generally known as High Performance Computing (HPC) or High Throughput Computing (HTC) when it involves parallel independent computations.

The research objective was to explore a combinatoric space of small proteins (peptides) built from chains of ten mixed-chirality amino acids. There are 39 of these so the search space is 39¹⁰ possible peptides. The technical objective was to distribute the computing task across many virtual machines (VMs) on the Amazon Web Services (AWS) public cloud. The computations used the Rosetta 'protein folding' molecular design software to resolve the likely structure of a given amino acid chain. When Rosetta finds that a candidate protein folds into a stable closed loop it is called a macrocycle scaffold. Such stable proteins are interesting and potentially desirable because of their possible applications in therapeutic medicine.

Figure "Recurrent local structural motifs in designed macrocycles" from Comprehensive computational design of ordered peptide macrocycles

This case study shows that a moderately large computing task can be run quickly and cost-effectively on the public cloud. A worker Virtual Machine (VM) was configured to run on a computer maintained by AWS. This worker was replicated 160 times where each instance ran an independent Rosetta search task, looking for stable macrocycle peptides for 53 hours. The results was a library of over five million positive-result protein structures. Read more about the results in the journal Science: Hosseinzadeh, et. al.

This documented success, replicating VMs on the AWS public cloud can be applied to research computing tasks that rely on many independent computational threads. Technology components of this study include:

The Baker Lab Rosetta software suite
Powerful EC2 compute instances
The AWS Spot market
Optimization analysis of cloud computing instance types
The AWS Batch service
The AWS Research Credit Grant program

Computing at Scale on AWS

The following steps can help generalize this work and apply it to other parallel computation tasks:

Identify the research problem that requires large-scale computing
Configure an AWS User account
Configure the execution software and the data structure
Configure a cluster management service such as AWS Batch to run the job at scale on the AWS Spot Market
- The Spot Market provides VMs at typically one-third of the normal cost with a low but non-zero probability that they will be interrupted during execution. Using these instances reliably extends the budget by a factor of three.
Recover the results and dismiss the compute infrastructure

The computation in more detail:

164 C4.8XLarge instances ran for 53 hours on a single compute task producing 5.2 million positive-result protein structures after 313,000 virtual CPU hours, documented on GitHub here;
The $0.40 Spot Market cost per instance-hour showed no significant cost variation over the task duration, nor impact on market price;
Optimization by the researcher saved more than $600 over other instance choices, and the total computation cost came to $3,477.
Related links:

Cost tradeoffs

The traditional approach to HPC includes purchasing and maintaining dedicated hardware, such as the compute task described in this case study. Using the public cloud can be a comparable alternative because of low and falling cloud costs, increased convenience, read-to-use services and outsourced system administration. But when is cloud computing cost-equivalent to using on-premise computing? Consider these break-even concepts.

Hard Break-even

The lifespan of an average purchased computer is three years. In this case study, the compute task ran for 53 hours on the AWS cloud and cost a total of $3,500. An equivalent computer that costs $3,500 requires 3 years to complete the same task; at this point, the cloud cost has reached parity with the on-premise cost. It is also worth noting that the on-premise solution takes 500 times longer to finish the computation - the compute task's estimated completion time would be 246 days, which is 849 days sooner than the hard break-even of three years. A head-to-head comparison with 100% CPU utilization over three years favors the on-premise computer, although when the purchased computer is not being used at full capacity, the cloud cost becomes more comparable.

Soft Break-even

At what point does on-premise compute time hinder the researcher's progress? A soft break-even scenario brings wall-clock time, facilities costs and other factors into consideration.

In this case study, the research team had access to an on-premise cluster that was shared by many researchers. This resource would ideally complete the processing task in two weeks, although the shared nature of the cluster means both uncertainty and hassle, so the researcher chose to use AWS. This accelerated the science and reduced the local cluster's workload.

For a fixed-size resource like an on-premise cluster, the time to complete a computing task scales with the size of the computation. For example, a computation that would normally take two weeks would now take two months if the task was quadrupled. When using the public cloud, the procedure simply conscripts more worker VMs from the AWS resource pool and completion time remains unchanged (48 hours).

Science background

The research in this case study contributed new information on how to tackle the challenge of large-scale sampling of peptide scaffolds, which could help with developing new therapeutics.

Human DNA consists of 3 billion base molecules that are arranged in pairs as rungs of a helical ladder. The base pair sequence records how to construct proteins: each base (nucleotide) can have one of four abbreviated values: A, C, G or T. Three bases in a row can be thought of as three digits in base-4; for example AAG or TCA. That is, this triple is a number from 0 to 63. These triples map to one of 20 left-handed amino acids (with some values degenerate and others not assigned.)

Amino acids are 20 naturally-occurring molecules that are the building blocks of proteins and all life on Earth. 19 of these amino acids have a “left-handed” asymmetry and can be artificially manufactured in their mirror image, producing a total of 39 amino acid building blocks. Peptides are proteins made of small chains of these acids. Once a particular sequence of amino acids is bonded together end-to-end, its rotational degrees of freedom permit it to fold into a structure with favorable energy; this structure may serve some chemical or biological function. The Rosetta software can analyze the manner of this folding, thereby connecting a hypothetical amino acid sequence to a protein scaffold.

One of the practical challenges is matching these structures to naturally-occurring geometries within the organism’s molecular landscape. This is often described using a ‘lock and key’ analogy: a therapeutic molecule could be designed to fit a particular binding site, like a cell wall, which would partially enhance or restrict the metabolic process at that location. The goal would be to comprehensively sample all possible shapes, creating a number of highly stable ‘keys’ that are ready to be slightly re-configured into any desired shape.

The compute task described in this case study samples the space of possible peptide structures. These structures must prove feasible and stable before they can be analyzed further. Ultimately, a design of interest from the computation may be synthesized as an actual physical protein in the laboratory, which would then be subjected to chemical analysis to validate that its structure matches the software's prediction.

The research team would like to express their gratitude to Amazon Web Services for assistance in this exploratory project in the form of cloud credits to defray expenses.

CloudBank

AWS Batch

AWS Spot Instances

Computing at Scale using Rosetta

Computing at Scale on AWS

Cost tradeoffs

Hard Break-even

Soft Break-even

Science background