About EASME

Basics

The genetic blueprint for the essential functions of life is encoded in DNA, which is translated into proteins — the engines driving most of our metabolic processes. Recent advancements in genome sequencing have unveiled a vast diversity of protein families, but compared to the massive search space of all possible amino acid sequences, the set of known functional families is minimal. One could say nature has a limited protein “vocabulary.” The major question for computational biologists, therefore, is whether this vocabulary can be expanded to include useful proteins that went extinct long ago, or maybe never evolved in the first place. We outline a computational approach to solving this problem. By merging evolutionary algorithms, machine learning (ML), and bioinformatics, we can facilitate the development of completely novel proteins which have never existed before. We envision this work forming a new sub-field of computational evolution we dub evolutionary algorithms simulating molecular evolution (EASME).

Defining the Problem

Proteomics (the study of proteins) is a primary focus point for EASME. Protein strings are sentences written with an alphabet of 20 amino acids. Many of these amino acid strings exceed 1,000 characters in length, thus the search space of possible protein strings is unfathomably vast. Most combinations of amino acids would be unstable and do absolutely nothing, thus the search space of possible protein configurations is a vast “sea of invalidity.” Within that sea exists a tiny archipelago of functional proteins, and only a small region of one of those islands is occupied by the proteins that actually evolved and remain extant today.

The search space of proteins.

EASME aims to expand the set of extant proteins by colonizing new islands in the sea of invalidity, yielding functional protein strings that can later be produced and analyzed in a lab.

Building EASME

Our hypothesis is that an EASME model would be composed of several algorithmic components. At the base of EASME would be an EA that would instantiate diverse populations of DNA-encoded proteins. The base EA would evolve and mutate genes the same way they naturally evolve — by point mutations, deletions, insertions, and recombinations. Fitness of encoded proteins would be determined by a multifaceted fitness function analyzing protein schemas (that bioinformatically define enzymatic consensus sequences) and protein grammar rules that would structurally validate a protein's primary sequence based on de novo folding algorithms minimizing a free energy function and/or primary string attribute properties like hydrophobicity, isoelectric charge, amino acid sub-words, et cetera. Provided that a basic truthful understanding of protein grammar could be encoded, all evolved proteins in the population could be driven to evolve new functional clades, and eventually protein families. The fitness function could also be coupled with a “protein spam filter” that could rule out structurally unstable permutations and combinations, effectively reducing the search space of garbage that needs to be explored.

To reduce the complexity of designing the initial EASME, the algorithm could drive evolution of very important select proteins, such as key photosynthetic enzymes (for applied use in agriculture); but as EASME matures, any proteins could be evolved and optimized, and all functional islands in the sea of invalidity might be colonized over time.

The final problem EASME will face is understanding how to determine the function of newly evolved proteins derived from in silico processes. Our suggestion is that libraries of EASME-derived peptides would first be chemically synthesized, then screened for useful activities. For example, a library of EASME proteins might first be screened against insects, to find EASME-derived proteins capable of killing insects. The positive hits could then be developed into novel insecticides for agriculture, which would also reinforce and improve the grammar structures of the fitness functions. Thus, we outline a feasible path toward the development of the EASME algorithm.

Limitless Potential

EASME represents a specific effort to focus on expanding the set of extant proteins. The EASME framework, once developed, can run in two distinct ways.

Two ways to run EASME.

First, EASME can evolve a random sequence toward a known consensus sequence (“unknown to known”). In this context, the desired outcome is to reconstruct sequence clusters that went extinct during the process of evolution. Selective fitness is implemented by pushing the evolution towards a known protein sequence family. EASME outputs samples of Pareto optimal sequences from theoretical evolutionary intermediates, effectively recovering extinct sequence variants. How much the EASME generated sequences would differ from real historical intermediates is unknowable without ancient genomes. However, the utility of generated sequences can be tested, measured, and linked to a corresponding successful discovery rate.

The second way to run EASME is “known to unknown,” where a known entity is forward evolved into the future by implementing a selection regimen that drives towards a desired characteristic phenotype. This methodology outputs Pareto optimal sequences that may have never evolved yet and is effectively a fast forward button on evolution into the future. While this approach would undoubtedly produce many false positives, wet lab work will allow us to test and validate designed proteins while simultaneously honing a given enzyme's fitness function. Biologically measuring the ratio of valid to invalid protein outputs would allow us to optimize the design process (and even if that ratio is low, it will still be orders of magnitude faster than natural evolution, a process which plays out on evolutionary timescales). To achieve both these ends, EASME will employ EA and GP models supplemented with ML where appropriate.