ML4Sci #32: Finding papers with SemanticScholar
Using recommendation feeds to find papers on polymer screening, discovering semiconductors, and predicting planetary system dynamics
Hi, I’m Charles Yang and I’m sharing (roughly) weekly issues about applications of artificial intelligence and machine learning to problems of interest for scientists and engineers.
If you enjoy reading ML4Sci, send us a ❤️. Or forward it to someone who you think might enjoy it!
As COVID-19 continues to spread, let’s all do our part to help protect those who are most vulnerable to this epidemic. Wash your hands frequently (maybe after reading this?), wear a mask, check in on someone (potentially virtually), and continue to practice social distancing.
I’ve recently started using Allen AI Institutes new tool, SemanticScholar, which is basically a much better version of Google Scholar. They include a tool called Research Feeds, a recommendation tool you can customize by “liking” papers. I created my own recommendation feed based on articles I’ve shared in ML4Sci - here are some papers I selected from my recommendation feed this week
Accelerating the screening of amorphous polymer electrolytes by learning to reduce random and systematic errors in molecular dynamics simulations
Published Jan 13, 2021
Polymer electrolytes are one of the possible candidates for next-generation solid-state electrolytes for Li-ion batteries. Molecular Dynamics (MD) is the standard paradigm for simulating the properties of polymer electrolytes, but the accuracy of the model is proportional to the relaxation time. Unfortunately, the relaxation time is also proportional to the amount of compute required. Also, the accuracy of the final configuration is dependent on the random initialization of the polymer system. To tackle both of these problems, this work trains a graph neural network on a small dataset of long relaxation time MD simulations and then error-corrects a much larger dataset of short relaxation time MD simulations to build an even larger dataset on which to retrain the model. There are also some interesting design choices to constrain the dataset generation to ensure synthesizability of any candidates, demonstrating how domain knowledge affects every stage of the ML design pipeline.
[arxiv][ML4Molecules Paper@NeurIPS 2020]
Interpretable discovery of new semiconductors with machine learning
Published Jan 21, 2021
A group of Canadian researchers(from UToronto, Vector Institute, National Research Council) used AI to search for ternary semiconductors. A graph neural network is trained on material properties from the Material project and an evolutionary algorithm uses the GNN as a surrogate model to search for ternary crystals with the desired UV-range band gap. The interpretability of the model in this paper is not intrinsic, but instead seems to use data-driven approach to discover learned feature invariances. Most significantly, they synthesize the best crystal, K2CuCl3, and compare the experimental results with theoretical ones and show a good-enough match between predicted and actual results. As someone who is not an experimentalist, I’m always really impressed whenever someone takes AI-predictions and actually makes it.
Note this paper and the one before both use graph neural networks (and they also both used the Pytorch-Geometric framework).
[arxiv]
A Bayesian neural network predicts the dissolution of compact planetary systems
Published Jan 11, 2021
Like fluids, planetary systems are also chaotic physical systems which are difficult to forecast. This work uses bayesian neural networks to provide log-normal distributions of the stability of systems. In fact, they use a pair of neural networks, one to do feature transformation of the planetary system time-series and another to extract the predicted distribution of dissolution times. Several interesting things about this paper: 1. they leverage the fact that the pair of neural networks are coupled to run feature importance by backpropping gradients all the way back
2. they model bayesian distributions with a method called MultiSWAG (coolest ML name yet), which essentially takes a trained NN and does a random walk in the converged loss landscape. Doing this repeatedly creates multiple converged models from which to sample from
3. A novel pooling method is used on the the time-series features based on domain knowledge
Department of Machine Learning
🖼️new state-of-the-art on ImageNet with “Meta-pseudo labels”. SOTA chasing on ImageNet is probably not actually useful in any meaningful way now, but the ideas from this paper are interesting, using student-teacher architectures
The Jan 2021 quarterly Montreal AI Ethics Institute (MAIEI) State of AI Ethics Report, with a focus section on Google’s treatment of Timnit Gebru [h/t Abhishek Gupta]
📚Open-sourced Deep Learning course @NYU taught by Yann Lecunn( also @Facebook) and Alfredo Canziani
“Practical tip for getting RL algorithms to work” alt titles include “how to coax your RL algorithm into converging”, “practical advice for dealing with petulant RL agents” (in all seriousness, these kinds of blogs are really helpful, I just find it telling that they are necessary….we really don’t know why anything works)
🤖“How to Train Your Robot with Deep Reinforcement Learning – Lessons We’ve Learned” from Sergey Levine and Google. A nice overview of RL for Robotics, case studies, and overview of challenges
Productionized ML: Bing search engine uses ML for spell-check in >100 languages
Near-Future Science
🌌A review of how ML is accelerating “enhancing gravitational wave science”
🔒Genetic tuning of bacteria to produce different iron oxide nanoparticle morphology - no ML, just cool science to use bacteria to produce different iron oxide nanoparticles. Unfortunately requires institutional access
🌊Google-watch: new preprint from Google research on using ML to accelerate computational fluid dynamics. Particularly interesting because the model is posed in such a way that it reduces the spatial resolution required, making it more general than a standard get-data-and-fit approach that is constrained to the underlying dataset.
💓Also, you can now use the camera in Google Pixel phone to measure heart rate and respiratory rate. Pretty crazy that something that was a niche topic a couple years ago (measuring personal health from consumer cameras (not infrared))) is now in production
🌎EarthSpecies: a non-profit working on “decoding animal communication”. Check out their Github on how they’re using ideas from cross-lingual translation in NLP
From Scientific Data journal: “Quantum chemical benchmark databases of gold-standard dimer interaction energies”. 370K dimer geometries train a model which labels 5M more dimers, all of which are open-sourced. Exciting to see people releasing large-scale datasets labelled by trained ML models
📚[Chemrxiv] Intro to ML for Chemists: An undergrad course using python notebooks [github]
The Science of Science
💬From [Reddit]: “Scientists of Reddit, can the science paper evolve?”
🧠new arxiv preprint: quantifying the brain drain of ML academics into tech
🌎Out in the World of Tech
The geopolitical intrigue around semiconductors chips continues:
An increasingly compute-driven world means whoever controls the chips, controls the world. In addition, the shift from general-compute designs to application-specific chip designs (e.g. AI hardware accelerators) means there is plenty of opportunity for someone to capture the next-generation of hardware manufacturing. Bharath Ramsundar has a nice mini-series @DeepForestSci on the technical details of semiconductor manufacturing
Policy and Regulation
🇨🇳[RestOfWorld] China is moving towards a separate open-source ecosystem thats independent of Github
Thanks for Reading!
I hope you’re as excited as I am about the future of machine learning for solving exciting problems in science. You can find the archive of all past issues here and click here to subscribe to the newsletter.
Have any questions, feedback, or suggestions for articles? Contact me at ml4science@gmail.com or on Twitter @charlesxjyang