ML4Sci #1: Discovering new materials from abstracts; Designing diffractive metagratings with GAN's; How ML can help fight climate change; +Industry Highlight

Welcome to the inaugural issue of ML4Sci!

Hi, I’m Charles Yang and I’m sharing (roughly) monthly issues that describe various applications of artificial intelligence and machine learning in the scientific community. If you’re interested in receiving this newsletter or if you know someone who would, subscribe here.

Unsupervised NLP on scientific abstracts is able to discover new materials

Using 3.3M scientific abstracts mined from materials science related journals, researchers at Lawrence Berkeley National Lab and UC Berkeley trained a Word2Vec model on this dataset to learn embeddings of material science terms. They showed that traditional NLP vector operations (projection, addition, subtraction) match the same intuition that material scientists would expect. For instance, their word embeddings were able to complete the following analogy with the correct term (antiferromagnetic)


They also show that the degree of cosine similarity between materials (e.g.Bi2Te3) and properties (e.g. thermoelectric) is a good predictor of actual performance, as determined by DFT calculations. Most fascinatingly, they retrained their model on historical abstracts and showed that word embeddings were able to predict material-properties pairs that were not present in the literature at the time. For instance, CuGa2Te is one of the best thermoelectrics known today and was first reported in 2012, but a Word2Vec model trained on abstracts published before 2009 would have ranked CuGa2Te as one of the top 5 materials for thermoelectrics - 4 years before it was experimentally tested!

They also showed that Word2Vec trained on Wikipedia articles, a much larger corpus, does not perform well at predicting material properties. It seems more data is not always better, and we still need material scientists to interpret model predictions in the context of their field (just imagine the process of tokenizing molecule and chemical compounds, many of which can be written in different ways)

The growing computational materials community has begun developing large, open-source, high-quality datasets such as the materials project and materials scholar. Such repositories are critical because they help drive the progress and iterative development of machine learning models for specific scientific domains. I hope to do some more extensive coverage on these resources in the coming issues.


Designing Diffractive Metagratings with GAN’s

Metasurfaces are used in a variety of applications for the manipulation of light. However, traditional optimization of the design of metasurfaces require slow, iterative, optimizations which use intensive numerical solvers [this is a very common problem setup in ML4Sci: traditional numerical models(molecular dynamics, density functional theory, finite element, optical physics solvers, etc.) are terribly slow and we need a better way to explore large design spaces,namely machine learning. Expect to hear more about this problem setup in a variety of fields]. In this work, Jonathan Fan’s lab at Stanford uses a conditional GAN trained on an small, initial, high-performance design training set generated by adjoint-based optimization. The cGAN then generates an order of magnitude more samples, which are then refined with adjoint-based optimization.

This work highlights several important aspects of ML4Sci that differentiate it from traditional ML techniques. While using traditional computer vision models, their cGAN model is able to be combined with traditional optimization techniques i.e. adjoint-based optimization. The GAN is no longer the final solution but rather a good initialization technique that reduces the time to find novel designs by orders of magnitudes. This analogy certainly does not exist in normal GAN’s for computer vision - we have no way of “optimizing” or refining faces generated by GAN’s. This is a good example of how traditional scientific techniques are not discarded or abandoned, but are used in conjunction with ML techniques.

Another important idea introduced in this paper is that the refined, adjoint optimized, outputs produced by cGAN can then be fed back into the cGAN, allowing for a “second-generation” cGAN. But there’s practically no reason this method can’t be repeated allowing for many generations of cGAN’s, which could be progressively grown in size as the training dataset scales. This is a notably different idea than what is normally meant by Progressively Growing GAN - we’re not changing the image resolution, but rather it is the size of the dataset that is changing. This is only possible because of a key differentiator in inverse-design problems in science: we can usually generate many initial designs for free, as opposed to computer vision problems, where we cannot just generate more images of faces whenever we want. The Fan group has already published an arxiv preprint of this exact idea here.

Tackling Climate Change with ML

Climate change is one of the most daunting and frightening issues of our time. This report by an interdisciplanary team of machine learning experts from both industry and academia highlights how ML can be used to fight climate change. They identify 13 climate change domains and for each domain, identify different ML areas that could be utilized e.g. Transfer Learning, NLP, RL and control etc. Within each domain, they identify sub-areas of research and how ML can be applied to each sub-areas, with conveniant labels such as High-Leverage and Long-Term for each sub-area.

Of course, the work doesn’t stop with just a report! They’ve set up a website, with an interactive version of the paper, a newsletter, a schedule of future events at various conferences, and a FAQ. I would definitely recommend reading this report if you’re interested in learning more about the various impactful ways ML can be used to help save Planet Earth.

If you want to move beyond research and can’t wait to begin implementing models that can be used in the real world, I’d suggest checking out Open Climate Fix, a fully open-source non-profit recently founded by Jack Kelly, a former DeepMind research engineer who’s now applying computer science to climate change.


Industry Highlights: Optimizing Heat Exchanger Design

Heat exchangers are a critical component in cooling electrical systems, as well as for HVAC (both of which are critical areas for reducing overall energy consumption). By combining the advent of advanced manufacturing techniques (3D printing) with advanced computer simulations, nTopology is able to design and manufacture more efficient heat exchangers for electronics. The dual combination of opening up the design space with 3D printing and using optimization techniques to efficiently explore the enlarged design space is a good example of the synergy of advancing manufacturing and novel ML techniques for new products. Expect to see many more startups like nTopology, which are using novel ML-based discovery workflows, to take on larger competitors in every manufacturing market vertical.

Thank You for Reading!

I hope you’re as excited as I am about the future of machine learning for solving exciting problems in science. You can find the archive of all past issues here and click here to subscribe to the newsletter.

Have any questions, feedback, or suggestions for articles? Contact me at or on Twitter @charlesxjyang