Given a property, what’s the material or the molecule that achieves it?
In this week’s episode of Experiment Exchange, “How Rafael Gomez-Bombarelli Explores New Materials with Inverse Design ML,” join Michael McCourt as he interviews Rafael Gomez-Bombarelli, an Assistant Professor of Materials Processing at MIT whose work is focused on the development of machine learning strategies to design new materials—including fluids, cloths, metals, and nanomaterials. They talk about how SigOpt is used in his lab to automate the tedious tasks of model building, inverse design, and more.
Below is a transcript from the interview. Subscribe to listen to the full podcast episode, or watch the video interview.
Q: Let’s talk about your work. In November of 2021, you spoke at the SigOpt Summit about leveraging machine learning to design new materials. Tell us more about that.
We believe, along with other people in the field, that there is a line between the way people have been thinking about simulating physics in computers, with the type of tasks that machine learning has proven very effective for. For example, DeepMind recently published a DFT paper, so clearly these lines are blurring. We’re very excited about using all of these tools together to tackle our design problems, and use algorithms to invent new materials.
That also means we look for ways to streamline the process. For example, when we can bootstrap the data, when the simulations are a good enough predictor of what’s going to happen, then we build a surrogate model for the simulations. So instead of running whatever n-cube to the fourth and fifth power through a computational method for each problem, we train a surrogate model capable of providing predictions much more cheaply.
For instance, something we’ve been finding very exciting lately is to use differential uncertainty to hunt for the places where the surrogate models are breaking down. We take the derivative of the uncertainty of the model with respect to the inputs and pinpoint the holes in the surrogate functions. That’s a set of tools we use a lot—training a surrogate for an expensive, but accurate, physics-based simulation.
Q: That sounds like active learning—is it?
Absolutely—this is something we’ve been very interested in. For example, we’ve got an oracle: we have a ground truth that we can call on as many times as we want. We can run it overnight, do it in parallel, or use a DoE supercomputer. It’s not like with Mechanical Turk, where you do need to get a cohort. Spinning up the oracle for us is very inexpensive.
We set up these active learning cycles for surrogate functions that sometimes take five, seven, or ten generations of interplay between the algorithm and the oracle. That’s what led us to trying to find the holes actively in our machine learning models by doing differential uncertainty.
Q: What sort of expense are we talking about when you’re doing these computations?
That’s a good point. For instance, we did the simulations we have for ground states, for instance, hundreds of CPU hours to run a single valuable data point. There are also more expensive simulations that we use for excited states, like understanding the optical properties of molecules. An example of this is the glasses that change colors when you’re outside, where they get dark in sunlight. The molecule motions are governed by a complicated physics that is really hard to simulate. The surrogate model for that made our work literally a million times faster. It now takes milliseconds to sample, while the underlying ground truth takes somewhere in the order of tens of hours or hundreds of hours.
Q: It’s incredible that you’re able to get this amazing speed, partly because you’re leveraging your physics knowledge to design the data that you’re creating effectively, and also because of the power of these ML tools.
Exactly. Again, this is a place where SigOpt’s hyperparameter optimization has really come to our rescue. I talked about this in my SigOpt Summit talk and since then, it’s happened again where a student came to me and they’ve said, “I trained the model, it learns, but it’s not good enough to do the simulations we want with the surrogate.” It’s always the same answer. You need to hook it up to SigOpt and let it run. Then come back a few days from now, and the model will be twice as good. I talked about this back in November and it’s happened again since then—somebody’s whole project got qualitatively better by just by hooking it up to SigOpt’s hyperparameter optimization.
Q: What are these surrogates modeling in this particular situation? You talked about color—is that one of the key things you’re trying to model?
Yes, one of the things we’re modeling is how quickly the color will return to your light-sensitive glasses. The specific application we’re focused on is photo-pharmaceuticals, drugs that change with light. Ideally you want a molecule with a very narrow set of properties that twists when you illuminate it, achieving the desired conformation that allows it to bind to your target.
A lot of modeling has been done around how molecules interact with their biological targets, But these properties are really quantum and it’s really, really expensive.
Q: Can you tell us all about how you’re using multi-fidelity methods right now?
Absolutely. I think this is something that people in machine learning have done for a long time in the context of transfer learning, where there have been some very clear successes. When you train a model on Language A, it turns out you can transfer these weights to a model, train this model with a smaller percentage of Language B and still get good results. A lot of these underlying data structures and connections can be learned from one model training set combination and transferred to another.
We found that with chemistry and materials, it’s mor e subtle. It’s not clear what properties and what domains will transfer to one another. For instance, molecules are graphs and the neural network architecture that used to model them is Graph Neural Networks. In general, when pre-training for graph completion, you take out some nodes and you train a machine learning model on how to fill in the holes, which typically helps prepare graph models for other tasks. But it really doesn’t help for molecules. Pre-training on a model that knows how to complete a molecule doesn’t really help it with other tasks related to molecules, like their properties.
But again, we’ve got this oracle that we can call on as many times as we want in terms of our physics-based simulations. So we do have this interplay between experiments, via this ground truth and our surrogate, which has some accuracy and is cheaper to call. It’s not exactly a transfer learning solution
What we found is that there are domains where the amount of computational data we have is commensurate with the amount of experimental data there is. They might not be in exactly the same domains, but they’re commensurate. We’ve been exploring ways to have them talk to one another on an equal footing, more than as a pre-trained model, if that makes sense.
Q: Are there any instances where you’re utilizing the physics knowledge of your system to somehow change how your surrogate model performs? Is that something that’s possible?
There are lots of examples of that. For instance, we don’t do a lot of continuum simulations, but the people that do fluid dynamics or that sort of work are really excited about auto-differential-ability in the differential equations that govern the behavior of their systems, which become learnable. We’ve done a little bit of work on that because our systems follow Hamiltonian dynamics so we can learn a surrogate for the dynamics—not a surrogate for the Hamiltonian, but a surrogate for the dynamic evolution. That’s something we’re definitely interested in.
Another example: energy is an extensive property. You’ll get twice the molecule if you have twice the energy. The way people have explored this is making the energy prediction a Pareto partition. When you have a molecule that is made up of a lot of atoms and you want to predict the energy, it’s not a graph task because you don’t want to predict the energy of the molecule. There is very exciting work from Tess E. Smidt who just came to MIT around symmetry and equivariance—such beautiful work where adding high dimensional channels to run networks suddenly makes the same data tens or hundreds of times more efficient in terms of training models by having the neural network preserve the same symmetries as our physical reality has.
Q: Can you tell us more about inverse problems and how they show up in your work?
We’ve been talking a lot about forward models, that given an input, a model can predict its properties. It’s a surrogate for physics or an experiment. These are really useful tools for invention as long as you have a finite list of candidates to review, but if you design a space that’s open-ended, it’s not really clear how to apply the forward model because you still need somebody to come up with suggestions about what to try.
That’s what we call the inverse design problem in the physical sciences: given the property, what is the material or the molecule that achieves it? Our data structures are permutation invariant. If I re-index which atom is one and which atom is fifteen in my molecule, I should get the same answer. Graph convolutions do that job—they take the permutation invariance on the encoding. But if you want to write a graph, suddenly things get very complicated. So we started doing something else. There is a very friendly representation of molecules based on strings. Language models are good at taking molecules and writing them as text—throw in transformers, and it’s actually a good solution. It’s not the most elegant, perhaps, in terms of the data structures involved.
Just today, I was talking with a collaborator about the fact that you need to do data augmentation to tackle the permutation invariance of the graph. You need to show the same molecule A to B, and C to A, so that the model learns (just like with data augmentation for machine vision) the strings that will eventually be molecular graphs. It’s a very exciting place. We’ve seen some early hits, and there are a number of companies trying to build inverse molecular design. It’s a very exciting field.
Q: With a lot of the ML cross-pollination these days, do you often find ideas at workshops?
The workshops at machine learning conferences are the places where I have the most fun as a scientist, in terms of what’s cooking out there. It’s very hard to get papers into machine learning conferences as full papers. I understand they need to hit a community that is very diverse, there need to be diverse tools that are mathematically sound and there’s a lot of noise in what the community regards as a top accomplishment. The workshops are such an amazing outlet because people move quickly. You present things that are 99% there and you’re going to nail them down in the next few weeks, but you get the conversation going.
Q: What are you looking forward to in 2022?
With in-person conferences coming back, that’s pretty high on the list! On the research side, we have a number of tools coming together. For instance, we’ve got a really nice generative paper on inverse design for 3D molecular structures. The whole field is thinking in a way that I find very exciting about sampling—not just graphs—but cloud points. The molecules that we’re trying to generate also have this 3D aspect attached to them. For the community in general, I sense a nice opportunity coming up in the generation of 3D datasets.
From SigOpt, Experiment Exchange is a podcast where we find out how the latest developments in AI are transforming our world. Host and SigOpt Head of Engineering Michael McCourt interviews researchers, industry leaders, and other technical experts about their work in AI, ML, and HPC — asking the hard questions we all want answers to. Subscribe through your favorite podcast platform including Spotify, Google Podcasts, Apple Podcasts, and more!