Machine learning holds significant promise for fields like proteomics, therapeutics, and more—but blockers like access to datasets and issues of health privacy make progress complicated.
In this week’s episode of Experiment Exchange, “How Alexander Johansen is Pioneering the Role of ML within Health and Bio Science,” join Michael McCourt as he talks with Alexander Johansen, a Ph.D. student in computer science studying the intersection between computer science, bioinformatics and digital health. Within his lab at Stanford University, Alexander has applied Natural Language Processing to proteins, explored the history of wearables data privacy, and more. SigOpt’s Head of Engineering Michael McCourt spoke with Alexander about his pioneering work and how SigOpt has played a role in advancing progress.
Below is a transcript from the interview. Subscribe to listen to the full podcast episode, or watch the video interview.
Michael McCourt [00:00:19] Today we have with us Alexander Johansen, a Ph.D. student in computer science at Stanford. His work focuses on how deep learning can play an increasingly significant role in the future of medicine. We will start by discussing how SigOpt was a key tool in maximizing the performance of his neural networks, which were originally developed for natural language processing but are now being applied to proteomics. We also discuss how new advances in medicine and therapeutics can be found through machine learning, but that unique challenges exist in these pursuits, such as access to medical data and the need for privacy-ensuring methods. Ladies and gentlemen, thank you for joining us again on another one of our interviews. In particular today, we have Alexander Johansen, a Ph.D. student in computer science at Stanford, joining us. Alexander, thank you so much for coming today.
Q: We hosted you at SigOpt Summit earlier this year. I want to talk about that, but first, why don’t you tell us a bit more about yourself?
I’m a second year Ph.D. student at Stanford University, and I’m studying the intersection between computer science, bioinformatics and digital health, and how we might use computer science to answer important questions, such as: How do you extract health data? How do you share health data? And how do you ensure user privacy?
Q: That’s a big topic right there. Tell us more about your lab. What kinds of things do you work on, and how do you approach this complicated intersection of various topics?
I think a lot of people today are starting to ask questions about health and about the use of data.
We’ve seen in the last 100 years, if you look through anthropological and even archeological studies, that the human body and even the human skeletal structure has changed vastly during the last century. And we don’t really know why. We don’t necessarily know whether or not this is caused by lifestyle changes or any other different type of changes, and where we might make changes to the way that we live, the way that we handle pregnancies, the way that we handle childhood, and the way that we handle treatment of chronic illnesses.
Within that, there’s a new, exciting field of research called Digital Health that came about some six, seven years ago, which a lot of you might be familiar with through the Apple Watch and the Fitbits of the world. But it’s the idea that we can continuously measure humans on a mass scale.
Now with this comes a lot of questions, such as what’s the best device to use? Once you have this device or the 23andMe genetics tests, how might you be able to extract this data, get access to this data? What about usage of this data? Does 23andMe own your data? Does Fitbit own your data? If you want to build any type of medical algorithms on it or get medical insight to it, that’s very important.
Also, let’s say that you want to use this device for some type of clinical study. You want to get insight into diabetes. You want to get insight into pregnancies, miscarriages, or another example of a clinical study. How might you ensure that the people who use the Fitbits throughout these pregnancies, that they can delete that data once they’re done with the study, that it’s not owned by Fitbit, that it’s privacy-ensured? We’re also seeing hackings that happen all the time for these different wearable companies. Do we have de-identified data? Do they have secure connections? So there’s a whole array of privacy questions, security questions that we think about, especially with the current atmosphere around data privacy which we’ve seen with a lot of the big tech giants.
We ask these questions in an academic sense, and we highlight the issues around it, such that we can come up with solutions that solve these issues. So right now, we’re mainly focused on figuring out how we might be able to find this combination of devices and how we might be able to handle some of these privacy concerns.
Q: The problem isn’t just one of getting and using data, but there are also ethical concerns, practical concerns, and security concerns. Do you have people in your lab who are philosophers? What kind of person deals with these questions?
We actually just ran a competition for volunteers to join the lab. We do most of our recruiting on Reddit [laughs], because there’s a lot of talent out there today who don’t have the opportunities to work together with an exciting research lab on exciting research projects. So we have about eight people, maybe nine people today. A lot of our talent is actually people from abroad who might not be in the vicinity of a big research institution. We go out and then we ask them these fundamental questions. We try to get computer scientists, biomedical engineers, and so forth, who have mainly been trained in creating things and making databases. The classical core computer science courses unfortunately do not deal with data privacy and ethics at all. Then, for the competitions that we put out for them, we asked them these questions. We want people who are well-rounded and who are able to think about societal problems.
In the review paper that we’re constructing right now, we’re going to take a historical view on not just data privacy, but also questions within medical grading, which I think is very interesting nowadays, especially when you see all these different vaccine trials, like COVID, Pfizer, Pfizer vaccines, Moderna vaccines, and so forth.For example: What does that actually mean, these different FDA approvals and different FDA processes? What can you expect from them? What about transparency? What about you as a user actually understanding all of these processes?
We want to highlight some of the history behind these different legal documents, and we also want to highlight what it means that something is FDA-cleared. What does it mean something is FDA registered or FDA approved? What stages are approved, and what can you trust this device to be when it has these different levels of approvals? So those are also some of the things that we’re very interested in right now and that we’re looking into for our recent paper.
Q: Are there other government entities that should be involved in this discussion? Or do you think that the FDA will be where a lot of this discussion takes place?
That is a good question. I think right now, there are not a lot of laws around variable data—unless you’re actively using it for treatment, which means it’s something you can apply for through your insurance, like a pacemaker. I mean, that’s effectively a wearable, right? You can apply to get a pacemaker, you get surgery, you get it put in. All of this is through insurance. You go through a hospital, you have treatments. And so as long as it’s a part of your medical treatment, then there’s a set of rules around that. For example, it needs to comply with HIPAA.
But for a lot of these wearable devices, they’re kind of seen as sports devices, which means that there’s not a lot of legal protection around them. There’s not enough legal protections for the user. The amount of data and information that you can gather for the user is extremely valuable and can be used for treatment and can be integrated into your personal treatment—but we could also see a future where if we don’t have strong privacy regulations for this, then a lot of that data might also be utilized by companies that you might not want to utilize that data.
A recent study found that Oura rings can predict menstrual cycles, and so by that, you might even be able to predict pregnancies, if you’re off your menstrual cycle or menopausal and so forth. That information might not be something that you want shared with everybody.
Given that there are currently not a lot of restrictions around what you use the variable data for, that might lead to some privacy issues down the road. Right now, there’s only FDA protection for a device that is used for medical purposes and intents. Other than that, there’s not much involvement from any government entity.
Q: I’d like to talk about the particular project that you spoke on at the Summit, the title of your talk Deep Learning for Proteomics and the Future of Medicine. Could you break down what you spoke about in your talk?
So, we have two different research directions at our lab. One of them is more basic research within bioinformatics, where we’ve seen that contemporary deep learning tools can be used for handling proteomics challenges.
A protein is very similar to a string of text or a paragraph of text. It consists of amino acids and these amino acids have a vocabulary of about 20 different characters, which is very similar to the English language of 27. These different characters will be strung together in a sequence of about, on average, about 300 characters, which is quite similar to a paragraph in length.
There’s also a lot of other structure in these sequences because they actually have a physical meaning, like they fold in a specific way. They will attach to other proteins and you can target specific receptors on these proteins. They’ll fold in specific ways that will allow drugs, chemicals, and so forth to attach to them and change. Because of that, researchers started thinking that we might be able to use some of the same methodological work that we’ve been using for natural language processing on proteins.
Recently in the past three years or so, there’s been this move towards using unsupervised language models in order to generate embeddings for text. About a year and a half ago, researchers started asking the same question. Well, can we do this for proteins? And it turns out, yes, we can.
We have been collaborating with the technical University of Denmark, who have some of the world’s leading tools for protein predictions. So that’s predicting a whole range of functions for proteins, which is important in every single aspect of life sciences. So for medicine, for plant technology, for bacteria, and so forth. If you want to have bacteria that breaks down plastics, then they need to act in certain ways and we need to be able to predict how they act and whether or not we can design specific proteins and enzymes that can help us solve the task they’re interested in.
It turns out that using protein language models to solve those problems works a lot better than what we’ve been doing previously, because the model is able to learn the underlying structure of proteins and how proteins relate to each other. It’s able to learn not just the different types of functionalities, even the functionalities that we’re looking to predict, it can predict signal peptides, secondary structures, whether or not it’s soluble, whether or not you can express specific enzymes in a bacteria, and more. On a whole range of protein tasks, we’ve seen that this approach has been very useful.
However, especially for basic sciences like proteomics, we have the issue that getting labeled data is very expensive. A lot of these data sets will only have hundreds, maybe just a few thousands of labeled proteins. These will have been gathered by the whole community through the last maybe 20, 30 years.
Q: How are you managing access to data for this project?
We have big databases of hundreds of millions of proteins, but we might not have any labels for these proteins. Now, what this language model is able to do is it’s able to look through these large databases, it’s able to understand fundamentally what goes on in these proteins, how some proteins differ from other proteins. It understands the core functionality of a protein, and is able through context vectors to predict how different proteins will be expressed. We found that it’s extremely well suited to handle proteins that are very distant from what we’re training on—it generalizes really well.
Q: How has SigOpt played a role in your work?
In order to optimize our proteomics language models, we use SigOpt. We’re dealing with these large neural networks and we need to optimize learning rates, how we handle the output from these neural networks, hyperparameter optimization, and so forth. A lot of this can be quite cumbersome and it also can be very computationally expensive because we’re fine tuning big language models that might have—I think the one we use for our latest research had 650 million parameters.
So this takes a lot of computational resources. Because of that, we figured we should try SigOpt in order to see if we can grind out performances. It’s kind of a combination of giving us a little bit of an edge, combined with the simplicity of the tool. We don’t have to think about it, we just pipe our info into SigOpt, and SigOpt does the hyperparameter sweep for us. It probably does a better search than we would do, where we would just do a grid search or something along those lines!
Also there’s the explorative part of it. We’ve used SigOpt previously when we tried to build our own language models. Especially for some of these more foreign style archaeal proteins and virus proteins, we found it very difficult to build a language model. We try to just allow SigOpt to do the full exploration around how you put together a language model for virus and archaea proteins, and then we just let it run, I think for like 200 or 300 iterations, just exploring the space of how might you be able to build a language model for this.
What’s interesting is that SigOpt ran 120 iterations where it was just having complete random perplexity. And then all of a sudden, it just caught it. It just found the right set of hyperparameters, and we were able to drop from roughly 18 to about half of that, about like nine or ten or something. It went from being completely random, we weren’t able to capture any type of information, to exploring it enough to find a suitable combination of hyperparameters that could solve the problem.
So for this more constrained problem of fine tuning the language model, SigOpt gives us convenience in grinding out a few more percentages. And for these open new avenue problems, such as building language models for very unique types of protein families, it helps us a lot with exploration as well.
Q: You’ve spoken now at great depth about the impact of what you’re doing with precision on the proteomics community and the research that’s being done there. How much have you had to immerse yourself in that topic and learn about proteomics in order to be able to to read that research and then contribute back to it?
That’s a good question. I’m a trained computer scientist, and I think that in computer science you’re very much focused on the computer science aspect of things. Like, how can you build these scalable algorithms? How can you work with mathematics? How can you deploy this to web service and so forth? You barely get exposed to applications at all. I think this is a key issue, and I think that we should definitely have a bit more of an interpretive approach to things, which is also why a lot of the variables projects that we are working on, we want to be able to make this useful for people who are exploring sort of the intersection between computer science and health. We want this to be used by people who might come from a straight computer science background.
Now, in my case, I found bio-mathematicians and then I spoke with them and then I took the time to delve into the problems that they’re facing. I was able to extrapolate those problems to key math equations and key math problems, and then from those abstractions, it was just a computer science problem. That didn’t really change throughout the process of the algorithmic development or the process of writing the paper. The only thing that is a challenge, though, and I think this mainly calls for close collaborations, is that while the computer science community is mainly interested in algorithmic development, the life science community is much more interested in biological interpretations.
Beyond just being able to have a better algorithm, we also need to give some biological interpretations, like why is it important that we get a better performance on this specific task? What biological insight might we be able to get from this new tool? For this, it’s been key that we have collaborated with the Technical University of Denmark, and Henrik Nelson, who is an associate professor at an institution.
Q: Going back to the question of health data privacy and the ability to use this data, when I hear you talk about that, a lot of it feels like this question of explore, exploit, and balance. For example, how do I pick the right piece of data to learn more about the person, but also the most valuable piece of data so as to best shape their wellness outcomes, whether it’s lower their blood pressure, lower their cholesterol, whatever number of things it is. How is that research proceeding right now?
The balance between what data is most valuable, that’s very much an optimization problem. I think a lot of this also comes down to cost because, for example, the Fitbit is not that expensive. I think it’s $80 or $120 or something along those lines. But let’s say that you want to do a single cell sequencing of a specific tissue or some blood. Well, that can run you $4,000. So, being able to understand what piece of information among this wide variety of data inputs that we have might be most useful.
So let;s say that you can choose between 100 different tests and 100 different wearable devices, and they all have some type of cost. When should you use what? Should I, two weeks from now, get single cell sequencing? Or should I get a blood panel first and then run some Fitbits and maybe some EG headbands? Based on that, maybe I don’t need single cell sequencing. So we definitely need to have some type of balance in between exploration and exploitation.
I think that this could be extraordinarily valuable for any type of clinical study or clinical trial, because a lot of the big grants given out by the NIH today are mainly for lab expenses, such as single cell sequencing. That’s probably one of the biggest ones right now, or gene sequencing, because you’ll end up spending a lot of money on that. Being able to optimize where we have information gradients will be a key point of research going forward.
Q: Where can people go to learn more about you and your research?
Our lab you can find at stanford-health.github.io, or you can go directly to my Twitter handle at @AlexRoseJo and see the different tweets that we have on these topics.
From SigOpt, Experiment Exchange is a podcast where we find out how the latest developments in AI are transforming our world. Host and SigOpt Head of Engineering Michael McCourt interviews researchers, industry leaders, and other technical experts about their work in AI, ML, and HPC — asking the hard questions we all want answers to. Subscribe through your favorite podcast platform including Spotify, Google Podcasts, Apple Podcasts, and more!