SigOpt offers our software free to academics, nonprofits, or labs who are performing research they plan to publish. Dozens of publications across a wide variety of disciplines have utilized SigOpt through this program. Learn more about the SigOpt Academic Program or sign up for free access to SigOpt.
Researchers from the Technical University of Denmark, Stanford University, and University of Copenhagen collaborated to develop and optimize a recurrent neural network (RNN) for glycosylphosphatidylinositol (GPI) signal prediction. Detecting GPI anchors is of significant relevance for understanding the development of both healthy and unhealthy cells in a wide variety of organisms, so is of particular research interest. They found that this RNN significantly outperformed prior methods for GPI prediction and published their findings in Current Research in Biotechnology (paper, model, code).
In the course of their research, they applied SigOpt to explore their modeling problem and optimize their model. In this interview with Alexander Rosenberg Johansen, we explore this research, how they utilized SigOpt, and what’s next for this collaboration.
What is your research subject?
GPI-anchored proteins are essential to the development of fungi and animal cells, so detecting them is highly useful across a wide range of biological research. Typically, detection requires the use of experimental assays that are low throughput and costly. Beginning in 1999 through the 2000s, machine learning was applied as a low-cost way to approximate the presence of GPI-anchored proteins as a way to make this process more affordable, efficient and useful (Fankhauser and Mäser, 2005; Eisenhaber et al., 1999; Pierleoni et al., 2008).
Developed over a decade ago, these machine learning methods do not take advantage of recent advancements in machine learning and deep learning techniques. In this research, we develop NetGPI, a recurrent neural network, to determine whether different deep learning techniques can boost the performance of these models for GPI-anchor detection tasks. We found this approach significantly boosted performance and led to state-of-the-art results for classifying GPI-anchor proteins.
For whom is this research most valuable?
NetGPI is useful for a wide range of researchers. As a general technique, it shows how neural networks can be reliably applied to sequence classification problems. From this type of work, researchers doing biological sequence classification across any variety of subjects can benefit from this paper, model, and code as a starting point.
NetGPI itself is useful for any researcher with a need to classify GPI-anchor proteins as part of their research process. This may be a step in a broader process exploring the development of animal cells, so could apply to a broad distribution of biological research as a component to the overall process.
Finally, there is a line of research exploring the classification of GPI-anchor proteins to which NetGPI contributes. Further research on this particular task can benefit from NetGPI as a starting point as new deep learning techniques emerge or datasets become available.
What made you interested in it?
Detecting GPI-anchor proteins is critical for a broad class of biological research, but is resource-intensive to perform in the real world. This challenge, coupled with the lack of recent research applying novel deep learning techniques, made it a compelling area for research that could make a big impact when applied to actual use cases. This more practical element of the research and potential utility of NetGPI is what makes this research particularly compelling to me.
Which models did you use and how did you select them?
Traditional machine learning methods had previously been applied to this classification problem (Fankhauser and Mäser, 2005; Eisenhaber et al., 1999; Pierleoni et al., 2008). So we chose to apply a recurrent neural network to the problem to discover whether we could drive additional performance gains.
Long-Short Term Memory (LSTM) networks have demonstrated state-of-the-art performance on related sequence classification problems, so we developed NetGPI relying on this type of architecture (Mikolov et al., 2013). This model is trainable with stochastic gradient descent (Adam) using back-propogation through time in Pytorch, and the hyperparameters were optimized with SigOpt.
Why did you select SigOpt to optimize these models?
To develop a robust understanding of our modeling problem, we needed to experiment with a variety of potential architecture and hyperparameter configurations. SigOpt makes this process sample efficient with Bayesian optimization that requires fewer training runs than other methods to achieve high-performing configurations of a given model for a problem. It also automatically logs all metadata from these runs to a dashboard and populates useful comparisons, charts, and plots that make it easy to explore and understand performance of these models across the metrics that are most important.
SigOpt made it easy for us to efficiently train and optimize 20 different models to determine which was most accurate for this sequence classification task. This approach led to us discovering NetGPI, which generated state-of-the-art performance on this task.
How would you characterize the benefit of using SigOpt?
I love SigOpt! First, it automates the optimization loop and population of relevant charts, plots, comparisons, and history of experimentation. The fact that it is a hosted API reduces any overhead for implementing, managing, or trouble-shooting the product. This makes my life a lot easier and makes it easier for us to collaborate as a team across research universities.
Second, it is a sample-efficient method for hyperparameter optimization and model selection. With other methods, such as random search or grid search, it would be difficult to discover a configuration of these models in as few as 30 training runs. Even with other Bayesian optimization methods, this would be a challenge. We threw the problem at SigOpt and it handled it cleanly and efficiently to give us a comparison across this variety of models.
Finally, SigOpt makes it easy to explore and understand the modeling task and parameter space in a collaborative way. Collaboration is critical when we are working across universities. SigOpt automatically logs every run and experiment, and provides a dashboard for us to view, visualize, analyze, and annotate these results to develop a deeper collective understanding of the modeling problem.
What would you most love to see as an improvement to SigOpt’s product to assist you in further research?
The primary improvement to SigOpt that would help our team is being able to run experiments that are purely exploratory in nature. In biology, we are often exploring a parameter space across a variety of metrics that have hard constraints. To start this type of research, it would be very useful to run an experiment with SigOpt where the results are not designed to maximize or minimize any given metric, but instead designed to satisfy the constraints of a variety of metrics that mimic the real-world limitations we have in our research.
My understanding is that SigOpt is evolving a new line of research that applies active search to do this type of exploratory modeling problem space in an efficient and intelligent way. I look forward to using this feature once it is available.
How do you expect to continue evolving this research in the future?
The most exciting area for future exploration is related to how NetGPI can be applied as a component to various biological research use cases. This is more of an input to future research than an end point for research itself, so we look forward to researchers building on this work as part of a broader research scope. In this sense, we hope that NetGPI enables and streamlines an important component of other biological research.