Introduction
In this episode, I sit down with Shantenu Jha, Director of Computational Science at Princeton Plasma Physics Lab (PPPL), to explore how AI is reshaping the path to fusion energy. We discuss why PPPL views fusion as not only a physics problem but also a grand computational challenge, what it takes to close a 10-order-of-magnitude compute gap, and how reasoning models are being integrated into experimental science.
Shantenu also shares lessons from a recent AI “jam session” with over 1,000 DOE scientists, explains the emerging need for benchmark datasets in fusion, and reflects on what AI might never fully automate. Here are three takeaways from our conversation:
AI is transforming fusion from a physics-first to a compute-first challenge
Fusion research, particularly tokamak and stellarator design, demands simulations of extreme conditions: nonlinear, turbulent plasma under hundreds of millions of Kelvin. Shantenu frames this not just as a physics challenge, but as a computational design problem that’s at least 10 orders of magnitude beyond current capabilities. Bridging that gap isn’t just about hardware; it's about smarter, AI-assisted navigation of parameter space to get more insight per FLOP.Bottom-up AI models offer more control and trust than giant monoliths
While large AI models show promise, Shantenu argues that smaller, physics-constrained models offer tighter uncertainty control and better validation. This ensemble approach allows fusion scientists to integrate AI gradually, with confidence, into critical design and control tasks. In short, building up to larger models rather than jumping in all at once is the best approach.Fusion needs its own benchmarks and the community is responding
Unlike fields like materials science or software, fusion lacks shared benchmarks to evaluate AI progress but that’s changing. Shantenu and collaborators are developing “FusionBench” to measure how well AI systems solve meaningful fusion problems. Combined with cross-lab efforts like the recent AI jam session, this signals a shift toward more rigorous, collaborative, and AI-integrated fusion research.
Transcript
Charles: Shantenu, welcome.
Shantenu Jha: Charles, pleasure to be here.
What is the Princeton Plasma Physics Lab, and how did it begin? (00:40)
Charles: Could you talk a little bit about Princeton Plasma Physics Lab (PPPL) and its place within the Department of Energy’s (DOE) national labs? I was surprised to learn that PPPL is part of that system, and I imagine others might be too. It’d be great to hear about the lab’s history and what it does.
Shantenu Jha: Definitely. It's good to hear we're one of the DOE’s best-kept secrets — hopefully with a lot of "bang" to share. The Princeton Plasma Physics Lab has been around for at least 70 years, maybe longer, under various names. It actually began as Project Matterhorn, going back to the time of theoretical physicist and astronomer Lyman Spitzer and others like physicist John Wheeler. It started as a classified effort focused on thermonuclear reactions and fusion, primarily from a Cold War perspective. Over the years, it transitioned from weapons work to peaceful applications of fusion.
The lab has had national lab status since the late 1960s, and fusion — particularly, magnetic fusion — has been its primary mission ever since. PPPL is the only one of the 17 DOE national labs focused almost exclusively on fusion and plasma science. But like all the national labs, it’s part of a larger ecosystem and collaborates widely. Increasingly, we're also doing a lot of work in computation, which we’ll probably touch on more later.
Why is fusion a computational problem? (03:20)
Charles: That’s fascinating. I didn’t realize it dated back that far. Like many national labs, it has its roots in the Cold War and the Manhattan Project. Let's talk about AI. How does AI fit into fusion projects like tokamak design? What’s the role it plays, and what's the opportunity?
Shantenu Jha: That’s a great question. Just to be clear, this is a biased perspective coming from computing. A theorist might say something different. I see fusion as a grand computational challenge. Think of it like drug design or material discovery. You’re trying to design something under a set of constraints, which makes it a computationally expensive problem. The parameter space is huge, and some calculations are prohibitively costly.
Designing a tokamak or a stellarator isn’t just about building a complex machine. You're building one that has to sustain temperatures of hundreds of millions of Kelvin, containing a highly nonlinear, charged, and often turbulent fluid. So you're not just solving a design problem; you're tackling layers of physics at once. That’s why I consider it a computational challenge.
If we had infinite computing power, we could simulate everything and design our way forward. But we don’t and probably never will. I’d estimate we’re about 10 orders of magnitude away from the computational capacity we need to make this a fully simulation-first problem. So the question becomes: how do we close that gap? Every day, I come to work thinking about how to achieve those 10 orders of magnitude in the next five to seven years.
What does it mean to be 10 orders of magnitude away in compute? (07:20)
Charles: That makes me wonder: what does it actually look like to be that far off in compute? Are we talking about limitations in time steps, model resolution, number of parameters?
Shantenu Jha: All of the above. And I’d add that just having more compute isn’t enough. If you don’t use it intelligently, you’ll hit another wall. We’re not going to get 10 orders of magnitude from hardware improvements alone. Moore’s law, which predicts a doubling of performance roughly every 18 to 24 months, only gets us so far — maybe a 1,000x improvement in a decade.
So we have to use computation more intelligently. For example, not every simulation needs the same time step or resolution. Not every region of parameter space deserves the same computational effort. We need to prioritize smarter, use AI to identify which parts of the space are worth exploring, where we can save time, and where we can afford lower fidelity.
This is where I think AI fundamentally changes things. It’s not just about speeding things up. It’s about getting more value out of the same computational budget.
What kind of computing power does PPPL currently use and where does it come from? (10:00)
Charles: What kind of computing resources are you using now? What's the scale, and where’s it coming from?
Shantenu Jha: The leading system we’re using is Frontier at Oak Ridge, which is the DOE’s flagship machine. It has a peak performance of about 1.4 exaFLOPS. But real applications never reach that. As they say, that number is what you're guaranteed not to exceed. If a code achieves even a quarter or a third of that, it's doing extremely well.
The challenge is getting these high-fidelity physics codes to run well on these leading machines. We’re also using other DOE machines like those at Argonne and Livermore but the effort has primarily been on Frontier, raising interesting questions since all those computers are within the DOE’s portfolio today. And we have to prepare for the next generation of supercomputers over the next five to ten years. That’s something I’m deeply interested in.
Charles: When it comes to AI and high-performance computing (HPC), some might wonder: why not just train one big model on all your simulation data and use it for fast inference? Why the need for a heterogeneous system?
Shantenu Jha: The answer is: yes, maybe eventually, but we’re not there yet. Right now, we’re taking a hybrid approach. We're looking at simulations and seeing where more computation doesn’t yield more accuracy. That’s a good candidate for a surrogate model, a spot where AI can help.
In some views of the future, all simulations will be made up of lots of small AI models. Maybe you build one big model from smaller ones, or maybe you train one massive model up front. It’ll probably be a mix of both.
At PPPL, we’re exploring the bottom-up path. We’re running multi-fidelity simulations and inserting AI surrogate models where we can. The goal is either to do more science with the same compute or to reduce the cost of getting the same results.
This could look like super-resolution or bootstrapping. Start with a cheap model, refine it with AI, then move up in fidelity as budget allows. Whether this builds into a giant, all-encompassing model is still an open question. But yes, for now, it's a stack of AI "turtles" all the way up.
Why build a bottom-up ensemble of small AI models and what are the tradeoffs? (15:15)
Charles: Give me a sense of why we might expect a bottom-up ensemble of many small AI models. Why wouldn’t we just use a single large one? Is it because you're working with different types of modules or physics? Help us understand that tradeoff.
Shantenu Jha: Absolutely. That’s exactly right. When you train one very large model, the uncertainty is typically higher. These large models can exhibit emergent behavior, and we all know about issues like hallucination and unpredictable errors. In contrast, if you start with small models and constrain the physics at each step, the uncertainty is much smaller — or at least more manageable.
You can train, test, and validate at every stage, which gives you greater control over uncertainty. That’s one reason I personally prefer building a hierarchy of models. Eventually, yes, we want large, powerful, emergent models. But from my perspective, it’s more effective to build confidence gradually rather than create the biggest model possible and then try to understand its limitations.
How can we trust these models in chaotic, real-world systems like fusion reactors? (16:30)
Charles: One thing I’ve always wondered: plasma physics is fundamentally chaotic. As we try to control plasma fusion in reactor designs like tokamaks, how can we have any guarantee that a given model or control system will continue to work reliably over years of operation? That seems like a major issue when moving from lab to real-world deployment.
Shantenu Jha: I couldn’t agree more. Perpetual reliability is going to be difficult. This is where continuous learning and training come in. As we build digital twins or AI-driven models of tokamaks, those models, like humans, will need to be continuously updated. Just like reading a few new papers each morning to stay current, these models will need to be retrained regularly using high-quality data.
This already happens with large language models on the internet, where huge volumes of new data — ranging in quality — are continuously fed into updated versions. That feedback loop is easier online, but in plasma physics, we’ll need a similar mechanism based on experimental systems and high-fidelity simulations.
Eventually, we’ll run into data scarcity, both in physics and online. At some point, the best training data may come from AI-generated outputs — synthetic data. This raises interesting questions: how do we generate useful synthetic data for the next generation of models? It’s a growing area of research.
Charles: What does synthetic data look like in plasma physics? What makes it useful?
Shantenu Jha: It depends on how you define synthetic data. There isn’t really a consensus. For example, if data comes from an AI system that was constrained by physical laws, some would still call that synthetic. Personally, I take a more flexible view. If a model uses physics-informed constraints and the resulting data comes from inference within those bounds, I think it’s acceptable to use that data for training. But others might disagree. It’s still a bit of a gray area.
Charles: Going back to the earlier point: how do we operate real systems when we can’t fully guarantee reliability? You mentioned active learning and continuous training, which makes sense. But what does deployment look like in practice? Do we just run simulations and physical tests over time and then say, “well, nothing has broken yet, so it must be safe”?
Shantenu Jha: That’s an important question. I think the answer lies in bounding our uncertainty. Think about data centers: some guarantee 99% uptime, others promise 99.9% or even more. That extra fraction comes at a significant cost. Similarly, in fusion, it won’t be about total certainty. It’ll be a balance of technical capability, design tolerances, and economic tradeoffs.
So no, we won’t be able to provide absolute guarantees. But we will aim for high confidence — enough that our computational models and AI-assisted designs operate within acceptable risk thresholds. It becomes a matter of how much certainty is “enough,” and that will differ depending on the application. I don’t think anyone will insist on total guarantees, especially in a field as complex as fusion.
Are there benchmarks in fusion like in other scientific fields? (21:45)
Charles: It’s an interesting contrast with the nuclear fission industry, which has had strict regulatory frameworks for decades. Fusion seems to raise different questions around knowability. You mentioned data earlier. In many fields, benchmark datasets help drive innovation. Is there anything like that in physics or fusion?
Shantenu Jha: That’s a great question. It’s something we’ve been actively working on. Some communities, like materials science or math-heavy domains, have developed strong benchmarks. Even in machine learning for software or math reasoning, benchmarks help communities track progress and compare results without ambiguity.
The fusion community hasn’t really done this yet. That’s been one of my personal goals: working with experts in fusion to define something we’re calling FusionBench. We’re still in the early stages, so I don’t have results to share yet, but we hope to launch something in the next few months.
The idea is twofold. First, we want to measure how much progress we’re making in solving meaningful fusion problems. Second, we want to track our improvements in applying AI to fusion, something the field hasn’t systematically done before.
As new models are released — and they're arriving rapidly — they may be well-suited for certain tasks, but that doesn’t necessarily make them appropriate for the challenges fusion presents. A benchmark helps us calibrate our progress in using AI models, but it also helps differentiate which of the new models are actually effective for our domain.
It’s about making sure our community is aligned: using the right models with the right capabilities to move the science forward. There are many reasons why something like FusionBench is valuable. Just as the Frontier Math benchmark has been useful for the mathematics and reasoning community, we believe FusionBench will serve a similar purpose for fusion.
What happened during the recent AI scientist jam session? (24:50)
Charles: Awesome. I’m excited to see it. It's a great point that many labs are now shifting to tougher scientific benchmarks because the easier ones have been saturated. It’ll be interesting to see how these models perform on a Fusion benchmark. You recently co-hosted an AI scientist jam session with nine national labs, 1,000 scientists, and support from OpenAI and Anthropic, who made their models available for a day. How did that go?
Shantenu Jha: It was fun. We learned a lot. We gained insights into our own limitations and saw firsthand the capabilities of the models provided by OpenAI and Anthropic.
One major takeaway was the sheer diversity of problems. We had around 1,500 scientists from across the DOE complex, each bringing different ideas. We’re now in the process of aggregating what we learned from all the labs and doing a meta-analysis. We hope to publish something soon.
It was incredible to see which problems the AI reasoning models helped with most effectively. That alone was valuable not just for us, but hopefully for the model developers too. The second big takeaway is that while AI models won’t replace fusion scientists, it’s now broadly accepted, even among the skeptics, that these tools are genuinely useful.
That doesn’t mean we can apply them indiscriminately. They won’t be useful for everything. But used carefully, they can be powerful assistants. That’s the shift we’re seeing now: recognizing the value and figuring out how to use it most effectively.
Charles: That’s really interesting. Getting 1,500 people together is no small feat. Do you feel there’s still skepticism toward these reasoning models in the fusion community?
Shantenu Jha: Yes, there’s a healthy level of skepticism and critical thinking, as there should be. I think most people now understand this isn’t just a fad. There’s real scientific value here.
The key is to develop a nuanced understanding of where these models are useful and where they’re not. That boundary isn’t fixed. It’s a moving target. As the models improve and as we get better at using them, the line between "useful" and "not useful" will shift. Our job is to keep pace and use them to enhance scientific discovery. I think the community is starting to embrace that.
What’s the hardest task for AI to master in your work? (29:09)
Charles: One last question. What do you think will be the hardest — or the last — task that AI will become truly expert in within your daily work?
Shantenu Jha: Great question. If you’d asked me a month ago, I would have given you a different answer. Back then, even if you promised me 10 orders of magnitude more compute, I would’ve said we still wouldn’t have AI models capable of abduction—the intuitive leap that lets scientists form new ideas.
But then I attended a meeting in Japan co-hosted by the DOE, focused on post-exascale computing. During a brainstorming session, I had this thought: what if future AI models are capable of rejecting the algorithms they were given? Not in a dystopian sense, but what if they have the intelligence to identify when a better algorithm exists?
In other words, what if they can learn how to learn? If AI can autonomously select the best algorithm for a given scientific problem, that’s a huge leap. That’s what scientists do: choose and tailor the right method. If AI can do that, it would be transformative.
So for me, selecting the right algorithm for a problem remains the hardest challenge. But with enough computational power — 10, maybe even 20 orders of magnitude more — it could also be the ultimate achievement from a computational science perspective.
Charles: Yeah, that’s fascinating. So if anyone from Congress is listening: we need to get 10 more orders of compute for PPPL if we want fusion.
Thanks for joining us, Shantenu.
Shantenu Jha: Thank you, Charles. It’s been a pleasure.
Share this post