In this blog post we highlight some of the key takeaways from Alexander Rosenberg’s presentation on deep learning for proteomics and the future of medicine at the SigOpt summit.
Alexander is a second-year computer science PhD student at Stanford University working with Professor Michael Snyder. Alexander is using state-of-the-art computer science to answer questions within the field of biology. Previously Alexander has been featured in journals such as Nature Biotechnology and Computational Biology and Chemistry as well as having papers accepted at conferences like NeurIPS and ICLR.
In this presentation it is shown how research from Natural Language Processing (NLP) can be used to solve fundamental challenges within protein classification.
Similarly, to how a spoken sentence is made up of a series of words, protein structures are made up of a series of peptides. So, by taking datasets like the UniProt protein dataset and treating every protein structure as a sentence and every peptide as a word, Alexander and team can get state-of-the-art performance when classifying these proteins using complex NLP models.
Natural Language Processing in Proteomics
One of the most common problems in NLP is predicting the next word in a sentence. Historically people have been using n-grams to predict the next word in a sentence.
However, using n-grams is not very scalable, which is why an increased number of people are replacing n-grams with deep neural networks. Deep neural networks can extract latent information from the first part of a sentence and use that to predict the missing words. Some of the current best performing models are GPT-3 and BERT.
Similarly, to how a GPT-3 or a BERT model can learn how to predict the missing words in sentence. The same type of models can be used to predict the missing peptides of a protein structure, and through this process the model can learn how to extract the latent representations of the protein structures, which can then be used for classification of different proteins.
An example of this is the SignalP 6.0 framework used for classifying protein structures.
The difference between SignalP 5.0 and SignalP 6.0 is the SignalP 6.0 is using state-of-the-art NLP techniques to classify protein structures. Here we see that SignalP 6.0 is performing better across all the different classes of protein structures compared to SignalP 5.0.
A lot of the models used for these complex NLP tasks are publicly available and ready to use. However, these models are optimized for tasks such as predicting the next word in a sentence, but by using tools such as SigOpt one can take an existing model from the NLP domain and fine tune it for a protein sequence dataset.
This is done by changing the underlying dataset, on which the model is originally trained, and then optimizing the model, to predict missing peptides, by tuning the underlying hyperparameters using tools such as SigOpt.
By fine tuning the model for this new task, Alexander and team saw an increase in performance of 1 – 3% depending on the task.
If you want to learn more about deep learning for proteomics or learn what the future of medicine holds, check out the presentation or visit Alexander’s website where you can read all his publications. To see if SigOpt can drive equivalent results for you and your team, sign up to use it for free.