Frontiers of AI-Powered Experimentation
The confluence of three technologies makes this a special time at the bench.
This post is copied from an email Q&A with Tom Kalil, Chief Innovation Officer at Schmidt Futures. Tom is the [Q] and I’m the [A]. This conversation originally started around my fascination with the progress in Program Synthesis and its future influence on how we do biological research. Every day we see a new demonstration of the power of large language models, and it seems the future integration with biology is getting rapidly closer.
The joy of publishing material in blog format is that it encourages discussion. These are forward-looking ideas with no guarantee of accuracy beyond me doing my best in early February 2023. I recommend Sam Rodriques’ Tasks and Benchmarks for an AI scientist that was published yesterday which explores adjacent ideas and I think coins the term Large Science Model (LSM). Comments and thoughts welcome.
Thanks to Tom and all others for their contributions and conversations.
Q: Dan, what I wanted to explore with you is why you and other researchers believe we can accelerate the pace of scientific progress by combining three different technologies – namely, generative AI, cloud labs, and in silico prediction. Why don’t we start off with discussing the individual building blocks – and then we can discuss opportunities for combinatorial innovation. What is generative AI, what is program synthesis, and why are they relevant to science?
A: The big picture here is that the interface between bits (software) and atoms (physical world) is expanding at an astonishing pace. Most of our Q&A will focus on the bits→atoms direction (eg, software controlling physical experiments) and specifically I think Program Synthesis (roughly, '“code that writes code”) has significant potential to accelerate progress. But the story starts in the atoms→bits direction.
The massive data created in biology has been the resource for powerful computational tools. Most biologists have probably seen the NIH chart showing that sequencing data outpaces Moore’s law. However, in my opinion, the best metaphor for this new era of biological data is that a single optical microscope in a modern biology lab, if run 40 hours per week by a single scientist, can produce more raw data annually than the Large Hadron Collider (once the poster child of team science in the era of Big Data). Paul and I learned this first-hand in our PhDs! Bigger datasets mean bigger opportunities for computer science, and the resulting software breakthroughs have increasing implications in the physical world.
More than “just” processing data from biological experiments, we’re entering the era of AI-powered experimentation. Recently I wrote an essay called AI is Part of Biology’s Future to explore the inevitable 3-way intersection of three major technologies:
Generative AI: a term for leveraging publicly available Large Language Models (LLMs) like GPT-3 that can power robotic planning and human-level chatbots.
Cloud Labs: centralized robotized lab facilities, eg Emerald Cloud Lab.
in silico prediction: exemplified by the protein folding work of AlphaFold and ESM which now publish 100M+ structures.
I believe we will see that the deep integration will be more than the sum of its parts.
OpenAI’s ChatGPT has been in the world for a few months and was heralded as “AI’s iPhone Moment.” If any readers haven’t yet played with chatGPT, I strongly advise them to do so now while referring to Riley Goodside’s Twitter feed for examples of humorously illuminating expert usage. ChatGPT is a general purpose LLM, and later we will explore how LLMs might change as they ingest more scientific information. But first, let’s start with the most powerful capability of Generative AI: Program Synthesis.
Program Synthesis is a subfield of AI that suddenly went from academic pursuit to a new foundation of industry. As long as we’ve had software, we’ve had the wild idea that a software agent could solve arbitrarily complicated problems by writing its own programs. I had my “oh wow, this is serious” moment when I sat in the audience of Kevin Ellis’ MIT PhD thesis defense in 2020 and was floored by his DreamCoder work. Despite the beauty of DreamCoder, it was academic in scope and programmers in 2021 scoffed at any practical implications of GPT3. But it turns out that LLMs have unlocked Program Synthesis and we can now see the world-changing potential in the publications from Microsoft, from Google and from OpenAI. And as software reaches into the real world, the implications can be seen in SayCan work from Google and Everyday Robots (video below) which demonstrates complex natural language tasks being broken down into a sequence of robotic actions.
Even more impressive to me than human language instructions is the new ability to prompt a software agent with just a unit test (paper). A toy example in software would be a function to confirm two numbers are correctly multiplied together and a toy example in biology would be a PCR amplification to detect the presence of DNA sequence (like a Covid test). Now it’s probably a weekend hackathon for chatGPT or similar to connect to a robotic lab setup to do such a PCR test. These examples may seem trivial, but the takeaway from 2022 is that practitioners move beyond toy problems very quickly. We are in an environment rich with positive feedback.
Q. What are some potential advances we could make by training large language models on the entirety of the scientific literature?
A: It’s inevitable that somebody will ingest all of scientific knowledge into a foundation model like a GPT1. Any of the leading groups could probably do this, and probably are already working on it. For example AllenAI has many public tools for processing scientific literature. But it’s hard to directly answer the question of what advances we could expect. An LLM with all scientific knowledge might be shockingly good or it might be deeply disappointing (see the ScholarBERT paper): As I’ve written about recently, we have to embrace the unpredictability ahead.
Let’s briefly talk about the opportunity of mining all scientific literature with the intent of creating a Large Science Model (LSM). Only 3% of the training data of GPT-3 was Wikipedia, the rest was text from web crawlers and books. The Chinchilla paper (excellent explainer), underscored the importance of finding new, high quality information to ingest into these powerful compute models. So compare GPT-3’s training data of 3 billion tokens (tokens are how the LLMs see words in text) from all of Wikipedia to SciHub’s growing database of 88 million papers, which my back-of-the-envelope math tells me there is an untapped resource of about 200 billion tokens. (ScholarBERT raises good questions on how to make use of the data). This certainly appears like potential for a big improvement, especially since there are now scientific question and answer benchmarks. So, if somebody navigates the legal challenges of ingesting all of SciHub and sharing the new LSM with the world, there might be a really powerful platform on the other side.
So let’s assume there is a publicly accessible, LSM, what might we do with it? As discussed earlier regarding Program Synthesis, one very powerful feature of LLMs is that they can serve planning functions. Perhaps the next-gen science LLM could auto-draft roadmaps by breaking down big problems into tractable projects or predict next projects (eg, DELPHI). Furthermore, the science LLM could execute day-to-day scientific research in the lab if it were paired with domain-expert AI sub-systems (eg, see Amyris’ Lila for microbial engineering). This dynamic parallels a human team: one manager to scope the project and a set of subordinates to plan, execute then report back the experimental results with suggested next steps. One advantage is the inherent scalability of automation, and the bits→atoms bandwidth could widen dramatically with cloud labs.
Q: What are cloud labs?
A: Named in reference to cloud computing, cloud labs are centralized robotic facilities for conducting life science research. Currently, there are a few commercial players like Emerald Cloud Lab, Strateos and Culture Biosciences, and it will be interesting to see how this field grows. In the big picture, I’m optimistic that bioautomation will be widely adopted, but we have to appreciate the short-term uncertainty. On one hand, the centralization of expensive hardware and orchestration of logistics (reagents, storage, data, etc) seems to be directionally correct due to economies of scale. But on the other hand, there is currently significant economic and technical friction in the transfer of a local lab to a cloud lab.
This is where Erika DeBenedictis, founder of the BioAutomation Challenge and an innovator in synbio+automation herself, is such an important thought leader, and I’d point readers to her recent article on how cloud labs can better integrate with academic research.
Q. What is in silico prediction, and why are scientists so excited about progress in fields like protein structure prediction?
A: I say in silico prediction to refer to the task of estimating molecular properties. These properties could be 3D shape, reactivity, stability, enzyme kinetics, substrate affinity or many other such characteristics. Although this might sound mundane at first, it’s important to understand the implications. If you solve the challenges of molecular-scale prediction, you unlock the ability to virtually test and iterate millions of times faster than could ever be done in reality. The frontiers of in silico prediction touch everything from solar panels and batteries to medicines and fertilizers.
The most visible example of in silico prediction today is protein folding. It’s brilliant that we can draw a 3D protein structure directly from a 1D DNA sequence. Google’s AlphaFold which has now published 200 Million protein structures, and Meta’s ESM Atlas has published 600 Million protein structures. This changes how we mine the troves of biological data: we can now search directly for 3D and functional characteristics, which is useful when scanning nature’s toolkit for something like signaling peptides to control animal or plant biology. Or if you’d allow to me wade into weird topics in biology, I’d love us to have a better understanding of how (and if) microbes can eat radiation: could we find photosynthesis-like protein systems in the genome of those weird creatures that live on spent uranium fuel rods or the outside of space stations - do they actually harness energy in a totally novel way? These new computational tools can help us explore difficult-to-study biology, probe early ideas quickly and also unlock entirely new directions of bioengineering.
We’re now in a rapid growth period of de novo protein design in which we think of proteins as programmable nanoscale machines. One direction that is particularly exciting to me is the inverse protein folding problem: Let’s say you know a protein’s shape and function, but you want it to be more stable than the original to be used industrially, or have more human-like sequences to minimize immune response in medicine. Or, if you’ll allow me to be speculative again, one of my technical fascinations is the frontiers of artificial enzymes, in which we discover a tool in nature but then replicate it in a scalable, synthetic format (“airplanes don’t have feathers” is the classic quip of learning from nature but scaling it up in our own way). Inverse protein folding is the category of problem that would solve these challenges: Such difficult tasks are becoming more tractable every day.
I have to channel my heroes of experimental biology to give the caveat that these computational tools just give *predictions* and it all needs to be verified. We can’t fold a few proteins and think we’ve solved biology. These in silico prediction tools can and will fail catastrophically, plus their training data might be missing critical knowledge like post-translational modifications, phase-separated nano environments or self-assembled structures in vivo. Such missing data is critical for climate-relevant applications like Methane oxidation to mitigate the greenhouse effect or the “nitrogen-splitting anvil” to develop future fertilizers. So it’s excellent that we can generate so many candidate molecules but we have to remember that it is a long pipeline toward deployment, and this is where robotics has a large role to play.
Q. How might we accelerate the pace of scientific research by co-designing these three technologies?
A: There is the potential for three trends to reinforce each other: AI that is getting good at planning and delegating tasks, AI that is getting good at molecular-scale computation and robots that are getting better at doing physical work. Together, this is potential for a virtuous cycle of more physical science tasks being accomplished by more lab robots that are controlled by a scientific planner AI that then creates more and better tasks.
This is a special moment in time when we currently stand on the flat part of the exponential curve and can ask ourselves the hard questions: How can we steer these new technologies to work on important planet-scale problems? What is needed to kickstart this flywheel for problems that might not yet have market forces to incentivize activity?
Each technical discipline has a different set of needs and motivations, so it does require some co-design: Computer scientists need datasets, biologists need assays, roboticists need tasks. In the diagram below, I briefly sketch out how the full virtuous cycle could be viewed as discipline pairs: it’s possible that it might be initially productive to consider specific, representative challenges along each edge. For example, what would it take for teams of LLM researchers and biolab automation experts, who probably don’t yet intersect much, to collaboratively create a set of challenges and data that inspires both of their fields? The role of common resources cannot be overstated as the role of public datasets has been foundational to machine learning’s progress. As we collectively ponder about how to get each discipline the resources they need, we can also identify application spaces that might inspire the communities to self-organize into projects.
It’s important to consider which subfields of biology will get access to these AI+robotics capabilities without interventions today: it is relatively straightforward to write business plans that justify bringing automation to certain aspects of medical biotech (eg, antibodies). But what about high-impact, high-risk areas like green hydrogen (high CapEx, low margin), carbon farming (no market forces yet) or curing diseases caused by pollution (hard challenges in both science and business) ? These example topics could all be billion-dollar opportunities with positive externalities, but because they have both market and science risk, they are currently the last fields to benefit from cutting-edge AI+robotics automation.
We are building Homeworld Collective to grow the community of climate biotech research and to make sure climate problems are well-framed to make the most of cutting-edge technologies.
We have major technological challenges to ensure Earth can support a thriving biosphere for the centuries ahead, and AI-powered experimentation could give us the speed we need. We’ve mentioned a few climate-relevant topics so far – methane oxidation, nitrogen fixation and radiosynthesis– and there are many more that could be significantly accelerated with by AI-powered experimentation: next-gen metal mining, decentralized energy production, cheap energy storage, utilization of waste biomass, replacing petrochemical feedstocks, biological resilience, rapid topsoil recovery, etc.. But each one of these are big problem spaces that warrant their own white paper to expose the most actionable sub-problems ready for AI-powered experimentation. It’ll be many years, if ever, that we could use the toolset and enter into the chat box “solve carbon capture,” I’ve tried it!
Q. What’s a plausible scenario for how this might play out in a particular application of science, like drug discovery or enzyme design?
A: Personally, I love the topic of understanding and enhancing biological resilience. I think now could be its time.
One specific subproblem inside resilience that could benefit from AI-powered experimentation is how to make plants stronger and more robust against environmental threats. I see a lot of work for plant engineering that is genetic in nature, but I think we need solutions that are faster to develop and more controllable to deploy. Said in a funny way, I’m curious about the frontiers of drugging plants.
First, why would we drug plants? People have long engineered small molecules for crop protection (pesticides etc), but what about for specific climate applications like carbon capture, marginal land repopulation, metal uptake or photosynthesis overdrive (irrespective of food yield)? This is obviously speculative but an interesting direction of thought because small molecules have the potential to go to market much faster and cheaper than a genetic modification.
Developing and delivering small molecule modulators for given protein targets is a well-established playbook in human medicine, so perhaps this process could be replicated broadly and rapidly from the ground-up with AI-powered experimentation. If we could invent small molecules that confer resilience against heat, drought or light stress, then we might have a just-in-time solution to save crops in extreme weather.
There’s enough here to bring AI scientists, plant biologists and roboticists to the table. The plant literature has plenty of metabolic networks for the future scientific LLMs to mine, next-gen robotics companies like Hippo Harvest show us that robots can work with plants at scale, and in silico predictions have shown us that we can screen millions of molecular variants to impact a given pathway.
Let’s say we first build the robotic lab setup, document the assays and share the output data to the world (images and ‘omics of each experimental condition across multiple parts in the plant). Then we make some initial data (ie, positive and negative controls, indicate the ability to scale up to more test molecules (say, using existing drug libraries) then create publicly accessible datasets from the whole system. This creates public goods in the form of knowledge and lowered barriers to entries for experts to engage with the problem.
This is thin on details for brevity, but hopefully illustrative of an opportunity to create assays for biologists, tasks for roboticists and large datasets for computer scientists, while also translating known recipes for success from medical biotech.
Q. How does this relate to your vision of enabling “hyper-productive” fields – like the role that open source software and cloud computing played in lowering the barriers to Internet entrepreneurship?
A: Everything we’ve discussed so far is about technology, but it is ultimately human beings working in groups that turn potential into impact. I want to see more research in how we build and nurture new research communities2.
Machine learning and medical biotech are gold-standard research communities that have produced several world-changing leaps in our lifetimes. These communities have common social, logistical and economic structures that create windows of hyper-productivity, and I believe it may be possible to engineer hyper-productivity in the future..
As an illustrative and famous example, when the AlexNet paper from Geoff Hinton’s lab became the foundation of the Deep Learning Revolution in 2012, it was a leap developed on a large infrastructure of open source tools and community knowledge, notably centered around Fei-Fei Li’s ImageNet dataset. So while it may be useful for a resource allocator to ask “how do I find more Geoff Hintons?” it may be even more productive to explore to ask “how do I recreate the computer vision community from 2010-2015?”
So in this moment, when there is technological momentum and major global challenges in need of fresh ideas, how do we plant the seeds for hyper-productivity into AI-powered experimentation? My current mental model is that there are six conditions to engineer, and I’ll give brief examples of each one:
Common platform for experiments: Open source computational libraries like Torch/Tensorflow, SciPy and Pandas have been essential in the deep learning revolution by making it very easy for a practitioner to be conducting research on the same toolkits as the state-of-the-art methods.
Common platforms for scale-up: Cloud computing means that if you build an app that works for a thousand people, it can work for a million people.
Common goals: In biology we had the Human Genome Project and the CASP protein folding prediction challenges.In computer science, public challenges like ImageNet have been instrumental in focusing many efforts in an apples-to-apples comparison.
Enforceable honesty: A best practice in computer science is to share your code and models so people can replicate your results for themselves. In synthetic biology, people share their plasmids on AddGene.
Playbook for outsized success: Y Combinator became the preeminent startup accelerator with a storied history of 20-somethings becoming multimillionaires by following common strategies.
Funding to explore: Doing software work has been historically cheap, the semi-serious joke in Silicon Valley is you just need enough money to keep people fed on instant ramen. Bio research is more expensive but there has been lots of available capital
For AI-powered experimentation, I would suggest we start with collaboratively developing common goals to motivate practitioners and developing common platforms for experimentation to increase the accessibility to the highest leverage problems.
The views and opinions expressed in this blog are the author’s own and do not necessarily reflect the view of Schmidt Futures.
Gwern + Willy Chertman makes excellent points about the difference between ingesting words (today) and pixels (maybe tomorrow) in the comments of my AI is Part of Biology’s Future post. While it would be a heavier lift for the AI to deal with the pixels, then extract words, figures etc., it could capture the multi-modal nature of sharing scientific knowledge better than pure text.
There are great people actively exploring this. Michael Nielsen just published some great thoughts on “The community as the unit of scientific contribution,” Nadia Asparouhova has a great analysis of all the new scientific institutes that have been built recently, and Sam Arbesman maintains his Overedge Catalog listing all new organizational efforts. Ben Reinhardt does great writing+thinking! And Matt Clancy who wrote a recent essay on What if we could automate invention?