EASME Projects
Machine Learning Protein “Spam Filter”
Proteins are sequences of amino acids that have evolved to carry out diverse functions. For example, proteins can make crops more resistant to pests, enhance biofuels production, or aid in developing targeted drugs. Designing custom proteins for specific applications is promising but challenging: with 20 amino acids and sequences often exceeding 1,000 residues, the possible combinations are countless, but only a fraction are viable, resulting in the vast majority of time being spent on evaluating unviable solutions. To address this, we developed the “Protein Spam Filter,” integrated into the EASME toolkit, to predict whether a new protein is structurally functional, enhancing the efficiency of automated protein design. The Protein Spam Filter comprises two key components: (1) the “stability” module, which distinguishes real proteins from non-functional sequences using an evolved neural network trained on the UniProt database (~58,000 real proteins), and thousands of artificially created “bad” sequences (including random, reversed, and highly mutated sequences), achieving over 95% accuracy; and (2) the “aggregation” module, which predicts protein aggregation propensity — a phenomenon linked to diseases such as Alzheimer’s and type 2 diabetes, as well as pharmaceutical product quality and functionality. This module was trained on the A3D-MODB database (~500,000 predictions across 11 model species) and achieved up to 86% accuracy. Together, these components help ensure that newly-designed proteins are structurally stable and viable, fostering progress in protein customization with applications in biotechnology and medicine.
Enhancing Photosynthesis
The objectives of this project are to leverage the latest computational tools, assembled in novel pipelines, to investigate how the enzymatic pathways of photosynthetic organisms might be enhanced to improve algal bioenergy and crop yields under changing environmental contexts. Improving photosynthesis is obviously a tough task, and the reaction might currently be at a Pareto optimal solution from within the limited set of extant proteins it must work with. A Pareto optimal solution is one where gaining ground in one aspect causes lost ground in another aspect. Pareto solutions are optimal solutions in multi-objective problems, like photosynthesis, which must maximize both speed and specificity; sometimes gains in speed reduce specificity, et cetera.
Thermostable Proteins
Heat denatures and unfolds proteins. Thermostable proteins are adapted to maintain structure and function at high temperatures. Taq polymerase, a thermostable variant of DNA polymerase I, discovered in 1976 Chien et al., is the prototypical example. In a world with changing climates, EASME holds the potential to help agriculture adapt quickly, by learning the differences between regular proteins and their thermostable variants, then solving any environmental problems through speedy evolutionary rescue. By applying EASME we could produce a set of thermostable variants for any protein.
Virus Mutation Prediction
With a database of extant DNA sequences, it is possible to predict which new variants of a viral protein might arise in the future, developing new phylogenetic trees with the origin point in the present. We plan to use this approach to simulate the evolution of tomato spotted wilt virus.