ML4Sci #3: Google Tackles Protein Folding; What can scientists learn from Computer Vision and Robotics Research; Lessons learned from Airbnb using Deep Learning; +Cool Science Highlight
What happens when a $100B tech company takes on basic science and AI problems?
Hi, I’m Charles Yang and I’m sharing (roughly) monthly issues about applications of artificial intelligence and machine learning to problems of interest to scientists and engineers.
This newsletter is still in its early days and I’m still figuring out the format and topical coverage that I want to pursue. Right now, I’m leaning towards having several short article discussions, with one long-form, in-depth description and analysis. If you have any feedback or suggestions, including interesting articles, new ideas for how to format these newsletters, length considerations, scope or coverage, or anything else, feel free to reach out at ml4science@gmail.com.
🧬AlphaFold: Using AI For Protein Folding Prediction
Google has recently decided to take the incredible amount of money it generates from ad revenues on its various platforms and invest it into AI+basic scientific research. By leveraging their vast amounts of computational, financial, and intellectual resources and directing them at very specific problems, they’ve been able to achieve breakthroughs that were previously only the domain of federally-funded academic research (in a manner reminiscent of Bell Labs, IBM research, etc). In this work, they were able to predict 3D protein structure using a neural network and placed first in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition, which has been running since 1994.
The model idea itself is simple: predict two properties for which we have ground truth labels for (pairwise amino acid distances and angles) and from which we can construct the protein structure using a neural network. But upon closer examination of their github repo, one can see that a significant amount of infrastructure and feature engineering was deployed to get such strong results, not of all of which was open-sourced (although a community open-source version of AlphaFold is available).
Google clearly knows how to capitalize on media hype around AI as well, by naming their model AlphaFold, despite the actual model architecture and training having nothing in common with how the famous AlphaGo and AlphaZero were trained.
You might have noticed that in our previous issue, we also started with another work by Google researchers about nowcasting rain forecasts. This was no coincidence - Google has been leveraging their immense resources to target specific problems in the field of ML4Sci that they believe can provide future financial gain, especially as AI begins revolutionizing and creating new industries and business models. AI has the potential to radically alter the way we do science - the ramifications of the privatization of early research of this early revolution in a variety of basic scientific enterprises remain unseen.
[blog][Nature Paper (author access provided, no downloads)][PROTEINS paper][github]
🤖What can robotics research learn from computer vision research (And what can scientists learn from both of them)?
A group of researchers from Queensland University in Australia published a review article that highlights common lessons, as well as key differences, between the maturing fields of computer vision (CV) and robotics. I’ll first summarize the article and then discuss how some of the article’s conclusions might help inform us as scientists anticipate how deep learning will change our fields.
Summary
4 key drivers of CV models improvement
standard performance measures and metrics
open-source datasets and standardized competitions
rapid dissemination e.g. Arxiv
wealth creation i.e. companies are interested in such models, fund research, use research to make more money, repeat
assertion 1: standard datasets + competitions + rapid dissemination -> rapid progress
assertion 2: datasets without competitions will have minimal impact on progress
assertion 3: to drive progress in robotics, we should change our mindset from experimentation to evaluation
assertion 4: simulation is the only way in which we can repeatably evaluate robot performance
Commentary
In section 3, the authors discuss how such methodology and observations about the progress of CV might apply to robotics, and many of their concerns and observations are relevant to scientists as well. For instance, I think the same 4 key drivers outlined above are important to driving progress in ML4Sci, particularly the need for large, standardized datasets. However, there are fundamental differences between CV and ML4Sci, some of which are also echoed in the robotics community, but some of which are not:
While standardized competitions and benchmark datasets are important, at the end of the day, we’re interested in experimental realization and real-world performance. The generalizability of models from “simulation” i.e. numerically-generated dataset performance, to “reality” i.e. real-world synthesis and design, is limited by (1) the quality of the numerical models available in different fields, (2) the availability of large-scale experimental/real-world data, (3) the ability to perform high throughput synthesis and evaluation. In particular, I don’t think assertion 3 is as true in ML4Sci; performing well on a dataset means very little if it doesn’t generalize well to the real-world and the only way to test that is through experimentation. Similarly, assertion 4 varies depending on the strength of simulations/numerical models in various fields. For instance, Density Functional Theory for chemistry is generally much more finicky and less well-established in industry than Finite Element Method for mechanics.
Robotics has long relied on a rich background of control theory and dynamical system modeling. In some ways, deep learning has slowly begun to replace such methods and it’s not clear if those methods are even necessary anymore. However, science has developed a tapestry of physical theories that I think will still be, in one form or another, used to help guide the development and training of such models. Finding the right way to incorporate priors e.g. conservation of energy, while still allowing data-driven learning, is already a fruitful and growing area of research in ML4Sci.
Engineers would also like uncertainty estimates for predictions. Most ML-based methods use some sort of bayesian uncertainty, but those models, especially bayesian neural networks, still have many drawbacks. As the field of ML4Sci grows in it’s ability to influence AI research priorities, I think this is a problem that will begin receiving more attention.
Have more suggestions for this (very) incomplete list? Drop a comment to me @charlesxjyang21. I’ve also written about this topic elsewhere as well.
[pdf]
Applying Deep Learning to Airbnb Search
Airbnb search is not much of a scientific problem, but it is rare enough for industry ML practioners to release such an easily readable layout of the problems they encountered (and failures) that it’s worth a read. I’d encourage you to read the whole thing in its entirety, as it’s a relatively short and easy read. While many of the problems and concerns they run into are not common in scientific fields, there are some nice tidbits of advice in here.
As Andrej Karpathy said when it comes to building models, “don’t be a hero”. Always try out simple baseline models that are well-known, before trying more complex models.
Data cleaning is 90% of the work in machine learning and this problem proved to be no exception. Finding data bugs e.g. monthly price vs. daily price, is an important first step, facilitated by visualization and data querying.
Deep learning helps remove cognitive burden of feature engineering
[pdf]
🧪Cool Science Highlight: Spontaneous oxidation of microdroplets of water to form to form hydrogen peroxide
You might expect that we know all that there is to know about water at standard conditions, but it turns out that even for this simple compound, there are still secrets to unveil. While bulk water is seemingly inert, in micron-sized droplets of water, water can both spontaneously self-oxidize to form μM amounts of hydrogen peroxide, as well as serve as a micro-reactor for the spontaneous reduction of several organic molecules. This happens as at the water-air interface of the droplet, most likely due to the strong electric field (strengthened by the high radius curvature of the microdroplet) ionizing the water to form hydroxide radicals.
[hydrogen peroxide generation][🔒organic molecule reduction]
Thank You for Reading!
I hope you’re as excited as I am about the future of machine learning for solving exciting problems in science. You can find the archive of all past issues here and click here to subscribe to the newsletter.
Have any questions, feedback, or suggestions for articles? Contact me at ml4science@gmail.com or on Twitter @charlesxjyang