ML4Sci #10: TLDRs for Science Papers; First Principles Database of Ferroelectrics; Bayesian Optimization for Stanford Linear Accelerator Center Laser

Also, thoughts from Donald Knuth in 1999 about specialization and ML4Sci

Hi, I’m Charles Yang and I’m sharing (roughly) weekly issues about applications of artificial intelligence and machine learning to problems of interest for scientists and engineers.

If you enjoy reading ML4Sci, please hit the ❤️ button above. Or forward it to someone who you think might enjoy reading it!

Share ML4Sci

As COVID-19 continues to spread, let’s all do our part to help protect those who are most vulnerable to this epidemic. Wash your hands frequently (maybe after reading this?), check in on someone (potentially virtually), and continue to practice social distancing.

I recently found out that Donald Knuth, an eminent computer scientist at Stanford, gave a lecture series titled “Things a Computer Scientist Rarely talks about” in 1999 while at MIT. In the beginning of Lecture 1, he hypothesizes that in the future, as the sciences explode in complexity, we will see the emergence not of domain specialists, but of specialists in two subareas, who connect ideas between two disparate fields. Hopefully this newsletter will serve as a community to connect those bridge machine learning and scientific domains.

TLDR: Extreme Summarization of Scientific Documents

Published on April 30, 2020

TLDR: too long; didn’t read

Here is the TLDR of this paper, produced by the Allen Institute web app for this paper: “SCITLDR: TLDR Generation for Scientific Papers with Multitask Finetuning”

The number of machine learning papers continues to grow exponentially, making it difficult for researchers to keep up and separate the wheat from the chaff. I’d imagine the number of COVID-19 papers is probably seeing a similar growth. Researchers at the non-profit Allen Institute for AI (co-founded and funded by the late Paul Allen, co-founder of Microsoft) and University of Washington introduce a new NLP task: extreme summarization (15-30 tokens) of scientific papers and also introduce the SciTLDR dataset, which was gathered from OpenReview (which are mostly AI papers published in conferences).

To collect this new dataset, which consists of 3,935 articles, they use the OpenReview API to scrape the author-provided tldr, as well as the peer reviewer’s tldr. They use a pre-trained BART model and fine-tune it on their dataset and baseline with a variety of standard NLP models. They use the Rouge metric to determine summarization performance, which basically looks at the overlap between the provided summary and the model-produced summary. As a side-note, I think it’s interesting how NLP and generative models have come up with a whole zoo of really interesting metrics e.g. how do you determine if a generated image is “realistic”? Overall, the paper demonstrates how the rise of powerful pre-trained models and a broad family of NLP datasets, combined with open-source tools like OpenReview, can be used to tackle the rising deluge of papers published in machine learning. (see ML4Sci #1 for more work that shows how scraping scientific literature can help drive materials discovery)

[arxiv][SciTLDR data][live demo]

An automatically curated first-principles database of ferroelectrics

Published on March 03, 2020

From Nature’s new journal, Scientific Data, which aims to give open-sourcing high-quality data the same prestige as a high-impact research publication.

Researchers from UC Berkeley and Lawrence Berkeley Lab open-source a dataset of ferroelectrics, building off of the open-source Materials Project and the python package pymatgen in conjunction with the standard (proprietary) VASP Density Functional Theory software to do high-throughput computational screening of ferroelectric materials. They identify 126 new potential ferroelectric materials - a good demonstration of how open-source materials database can be used to guide experimental discovery of materials. Importantly, they also validate their computational predictions with experimental results from literature, as shown below.


In the field of material science, we’ve been seeing efforts like the Materials Project and their associated open-source python package environment (pymatgen, fireworks, atomate) mature and how they’re helping to abstract and accelerate computational materials science. By open-sourcing both data and software, computational material science is minimizing the amount of redundant parallel work and building a common set of tools for the whole field. In the same way quantum and particle physics converged on a commonly agreed upon set of axioms, from which emerged theories that we are still exploring today in particle colliders; how tensorflow, pytorch, and keras helped accelerate deep learning research; the field of computational material science is developing a common set of software practices and data pipelines which can drive experimental exploration. (see ML4Sci #1 for work by the same group that predicts new materials from scraping scientific literature)


Bayesian Optimization of a Free-Electron Laser🔒

Published on March 25, 2020

Humans are engineering increasingly complex systems: satellites in space, large-scale battery systems, quantum computers, particle colliders. Controlling and modelling such complex, non-linear dynamical systems is increasingly difficult. In this work, coming out of Stanford’s Linear Accelerator Center (SLAC) and in collaboration with UC Santa Cruz scientists, gaussian process (GP) optimization is used to fine-tune the massive quadrupole magnets used to align the laser beam and maximize laser pulse energy. Normally, the magnets need to be retuned at least twice a day with a simplex optimization algorithm to prevent hysteresis, resulting in 500 hours/year wasted on re-tuning. In this work, the GP algorithm is both significantly faster at inference and demonstrates better performance in both real-life testing and simulations, as shown in the figure below.

Despite all the hype around deep learning, gaussian processes are still extremely popular in ML4Sci because they provide probabilistic confidence intervals and are able to incorporate a priori knowledge fairly easily. As a good demonstration of this, the authors use their domain knowledge of how the magnets behavior is highly correlated to improve the GP’s convergence by varying both the kernel choice and the hyperparameter tuning. A good reminder that there exist models besides neural networks that address gaps in deep learning’s abilities which are crucial for scientific endeavors and engineering problems e.g. interpretability, confidence intervals.


In the News, an enterprise AI company, alongside Microsoft and universities including Berkeley, Princeton, and MIT, launched a “Digital Transformation Institute”. Their first call for proposals focuses on AI techniques to mitigate the COVID-19 pandemic.

OpenAI releases Jukebox, a neural network trained on raw audio that generates music, including rudimentary singing. This builds off their previous work, MuseNet, which took in MIDI files as input and didn’t have vocal parts.

Facebook open-sources Blender, a massive 9.4B parameter chatbot. They include some impressive chatbot dialogue. I certainly would not have been able to tell the difference between the chatbot and a human’s response.

Google blog post on applying reinforcement learning to chip design [arxiv]

Update from the folks over at OpenClimateFix on their Solar Nowcasting and PV mapping projects. Always exciting to see ML being applied to concrete, real-world problems, particularly one’s as pressing as climate change

Thanks for Reading!

I hope you’re as excited as I am about the future of machine learning for solving exciting problems in science. You can find the archive of all past issues here and click here to subscribe to the newsletter.

Have any questions, feedback, or suggestions for articles? Contact me at or on Twitter @charlesxjyang