Representation Learning for Materials Science with MIT

Luis Bermudez
Advanced Optimization Techniques, Applied AI Insights, Convolutional Neural Networks, Graph Neural Networks, Hyperparameter Optimization, Machine Learning, Training & Tuning

SigOpt hosted our first user conference, the SigOpt AI & HPC Summit, on Tuesday, November 16, 2021. It was virtual and free to attend, and you can access content from the event at For more than any other reason, we were excited to host this Summit to showcase the great work of some of SigOpt’s customers. Today, we share some lessons from Rafael Gomez at the Massachusetts Institute of Technology (MIT) on how they leveraged Machine Learning and Hyper Parameter Optimization in their Material Science Research.

The goal of MIT’s research was to develop new materials. Novel materials need to be developed for society and they need to be developed quickly. It takes about 20 years between the discovery of a new material and its entry into the market. We need new materials for healthcare, sustainability, carbon storage, and more. Rafael Gomez shares his process for speeding up the discovery of these novel materials.

Q: How has Machine Learning helped your Materials Science Research?

Computers have taken over tasks that used to be exclusive to people, since they used to be really hard and required deep expertise. No one goes to a travel agent anymore. So with these classes of tasks, they have been taken over by computers. Typically we’ve seen that the clearer the rules are, the easier it is for a computer to take over. And we’ve seen that with the advances in algorithms and the scale of the data that we have in the hardware that we use for these operations, computers have increasingly outperformed people.

And of course, the questions that come in this context are: How does this apply to material science and engineering? How do we invent new materials using computers? How do we do something that people have been doing for a while now? And what is the tipping point? What does this AI supremacy look like? 

And one of the first questions that comes with this is whether we want to outperform the best person ever, like with AlphaGO, or do we just want to commoditize tasks that used to be really hard and really expensive, and they become instantaneous, and just very efficient to do in a computer.

Like I said, there are explicit rules sometimes in the physical sciences. We do have laws that are kind of universal and allow us to extrapolate, but we also have data that is typically sparse and domain specific. And we postulate that the computers can help across this continuum. I don’t think there’s a lot of computational physical scientists in the audience, so I won’t dive too deep into this. But we postulate that there is a continuum between the physics (the first principle simulations) that are great because they extrapolate and they hold universally – as long as we put in the right physics that we know should be happening. But it’s typically expensive to simulate. Like I said, if we don’t know what the nature of the phenomenon is, then we can’t really use this tool. And at the opposite end of this spectrum, there is machine learning and data science. And there is this big set of algorithms that have worked very, very well in other domains that leverage these big datasets that are pretty much a black box that have shown uncanny performance.

And of course, AlphaFold is a huge example in the physical sciences. A task that seemed to be absolutely impossible a few years back as we solve through data and some understanding of the phenomenon. But mostly through algorithmic innovations.

Our group works precisely in this continuum. Our work connects the first principles to machine learning and tries to mix them up in the most effective way, so they feed off of each other, and we increase the robustness of machine learning, and we reduce the data needs of machine learning through the first principles.

Q: How has Representation Learning helped your Materials Science Research?

The first problem that end to end representation learning helped with was: How do we input matter into machine learning algorithms? Traditionally, this has been called structure property relationships in the physical sciences. This is how do I connect the structure of material (the structure of a molecule) with its properties? And its mirror image in learning is representation learning. What are the types of transformations that we need to apply to our raw input, so that it becomes a vector that we can input into our machine learning models? Traditional approaches have worked well with this (to some degree). Particularly because datasets are typically small and not as big as the places where one uses deep learning typically.

But over the last few years, the libraries have become big enough where we can attempt to apply representation learning over the full chemical space. Furthermore, we can abandon hand engineered features just like the machine vision community did about a decade ago. They moved to using deep convolutional neural networks and learned to process pixels all the way up to full images.

Q: How has Graph Convolutional Neural Networks helped your Materials Science Research?

Graph Convolutional Neural Networks helped us connect the structure of atoms with its properties. Traditional approaches have worked well in the past, but can require a high degree of manually hand engineering each feature. The Graph Convolutional Neural Network aggregates information across local graph environments, such that they learn how to represent every atom optimally for a given task.

Q: Why is Color Prediction important for Material Science Research?

Graph Convolutional Neural Networks are the state of the art architecture that we use to predict the color of a molecule. This problem is extremely visual. It is possible to calculate what the color of a molecule is going to be with pure physics. And that’s really important to make solar cells. It’s also important to make medical imaging dyes. Furthermore, it’s important for a type of medical therapies that are called photo therapies. Photo therapies leverage the interaction of a molecule with light to help diagnose or treat a disease. It’s really important to be able to predict what a molecule’s color is. You can see that on the recognition plot (see below) that physics is good at this. 

There’s this technique called time-dependent density-functional theory (TD-DFT) that has a network of 25 nanometers, and that’s pretty much the difference between two successive colors in the rainbow. The difference between red and orange is about 20 nanometers. The difference between orange and yellow is also about 20 nanometers. With this, physics gives you plus or minus one color in the rainbow.

Q: How do Graph Neural Networks help with Color Prediction?

So this is a place where we’ve used a multi-fidelity Graph Neural Network that tries to predict the outcome of the physical simulation from the molecular structure. And that gives us some regularization and some general stability.

Experiments are expensive, but these simulations are one hundred to a thousand times cheaper than actually making a molecule, so we can cast a much wider base. And through this multi fidelity approach, we’ve managed to get to about seven nanometers. And so we hit the color right. 

We maybe don’t compute the hue perfectly, but we do know that if the color is always going to be orange, then there’s going to be some red or yellow in it, thanks to integrating physics based relationships plus end to end representation learning in a single stack. 

Q: How does Hyper Parameter Optimization help Materials Science Research?

Where does the Hyperparameter optimization come in here? Well, this is a constant in my group. We take on a new task that has something to do with matter. At the beginning, nothing will work. The students are plugging pieces together, and things don’t work. Then it finally works. They are training well end to end. And when they’re finally training, they then realize that the models are not performing well.

It continues to improve as it minimizes the value beyond the baseline values.

This happens every time. It finally plugs in, but the performance is not great. And we always do the same thing. Now it’s become a hyperparameter problem. And we just take a step back and let the machine solve it. So I’m going to show a couple of examples like this. You can see above that the first experiment we did had sort of random default parameters of 50 nanometers, which came from the graph convolutions. So that was worse than the theory. That’s similar to two and a half colors over the rainbow. And then it slowly improves and gets us to these 25 or 30 nanometers, which is the same as the physics based simulation without training. And then finally, it goes to less than 20 nanometers and these 10 hyperparameters are achieving the correct color. So this is the place where we can take a step back, and let SigOpt’s Hyperparameter Optimization figure out the optimal settings. These are typically new models for new tasks, so there’s very little intuition about hyperparameters. It’s difficult to know what the correct hyperparameters should be, or how many atoms to compose together. All of these things haven’t really been standardized.

Q: How does Representation Learning help with Combinatorial Synthesis of Peptide Arrays?

This is on therapeutics. This class of therapeutics is called peptides that are really excellent for, among other things, delivering biological therapeutics into the cell. So there is a class of biological therapeutics that are important for diseases that are really devastating, like Duchenne muscular dystrophy, whose patients have a life expectancy of under 30. And there is a therapy that can help with that. But it really struggles to keep its biological state when it needs to go inside of the therapeutic macromolecules that really struggle to be carried into the cell. So the cell penetrating peptide can be attached to the therapeutics to help carry it. From here, you can quantify how good a cell penetrating peptide is. You can systematically quantify how good they are delivering these biological payloads. 

So this is the same problem that essentially needs to explore the combinations of 20 amino acids. So maybe a few more than 20 if we’re allowed to use artificial chemistries, which in our case, we are to find the optimal performing one. So we need to have this combinatorial space of 20 to the length of the peptide, which can be 30, 40, or 50. That’s more than the atoms of the Earth, and we need to find the best one. So what we do for that is we train another class of representation learning algorithms. 

Here we leverage the sequential nature of the time. We applied one deconvolution and then we encoded each monomer based on the informatics fingerprint. Based on this, we’re not doing representational learning end to end because we don’t have enough data. So we just use traditional representations for every moment in the sequence and then string together all these individual monomers into a sequence representation for a full peptide and then apply one deconvolution for that.

Q: How does Hyper Parameter Optimization help with Combinatorial Synthesis of Peptide Arrays?

This goes back to our architectural choices. Representational learning helps us understand a whole class of models that we don’t really know about. It helps us answer the question: What are the settings that will make it work?

And in this case, you can see in the histogram to the top right, that every peptide we invented with this algorithm was better than the best in the training data. So this was a one shot experiment with even active learning. We just did it. Out of the 14 times, we predicted two controls and 12 were better than the best that had been ever made based on this ability to learn over macromolecular representations.

Results were only as good as a random forest until MIT used SigOpt.

And again, this was an example where it’s a new class of models and a class of stars. We don’t really know what we should do to fine tune it. And you can see again, that when the students started, they were only as good as a random forest (see plot above). So not very good. And so just taking a step back and letting the machine run and automatically tell us which experiments we should be running. We were able to achieve state of the art and predict better than any other moment. And you can even validate these SigOpt results experimentally in the lab, to find what the best therapeutic peptides should be.


To learn more about how Hyper Parameter Optimization improved the Material Science Research at MIT, I encourage you to watch the talk. To see if SigOpt can drive similar results for you and your team, sign up to use it for free.

Luis Bermudez AI Developer Advocate