Introduction
In this episode, I sit down with Robert Underwood, a staff scientist at Argonne National Laboratory. We dive into Argonne’s mission as an open science lab, the power of its new exascale supercomputer Aurora, and how these resources are being harnessed to drive the future of AI for science. We discuss the AuroraGPT project, which aims to adapt AI models for scientific data, as well as the challenges of handling massive scientific datasets generated by facilities like the Advanced Photon Source.
We also talk about how Argonne is collaborating with initiatives like the Trillion Parameter Consortium to push the boundaries of AI at scale, while staying focused on scientific workflows and reproducibility. Here are three takeaways from our conversation:
Aurora: Public Compute at Scale
At the heart of Argonne National Lab’s leadership computing facility is Aurora, a public exascale supercomputer purpose-built for large-scale scientific computation. Unlike commercial cloud GPU clusters, Aurora is optimized for running massive, coordinated jobs across tens of thousands of nodes—something essential for many forms of modern science, from fluid dynamics to materials modeling. As Robert Underwood explains, Aurora also supports mixed-precision compute, allowing researchers to exploit AI workloads as well. The lab’s role as an open science facility means this capability is available to academic and public researchers, not just private industry, and reflects a broader vision of compute as national infrastructure.AuroraGPT: AI-for-Science Models
AuroraGPT is Argonne’s initiative to adapt foundation models for scientific domains—ranging from high-dimensional physics simulations to sparse bioinformatics graphs. Rather than build one giant model, the team is developing a family of models tailored to specific scientific questions and modalities. Robert notes that this effort is constrained not by compute—Argonne has secured DOE-scale allocations on Aurora—but by personnel, with only a ~30 person team. Argonne is also one of the key backers of the Trillion Parameter Consortium, composed of other scientific and industry leaders, working on building a trillion parameter AI model for science.Managing the Scientific Data Deluge
Unlike commercial LLMs that scrape a finitely sized internet and now rely on synthetic data, science faces the opposite challenge: an overwhelming flood of data generated by experimental infrastructure like the Advanced Photon Source. Each beamline can generate up to a terabyte per second. To handle this, Argonne is pioneering a hybrid edge-HPC architecture—compressing real-time generated scientific data using GPUs and FPGAs at the beamline before routing it to supercomputers like Polaris for further analysis. This vision of autonomous experimentation—AI models directly interfacing with scientific instruments—marks the future of how we’ll do science at scale.
Transcript
Charles Yang
Okay, awesome. Today I have the pleasure of having Robert Underwood join us. Robert is a staff scientist at Argonne National Lab. Robert, thanks for coming on.
Robert Underwood
Yeah, thank you for having me.
Charles Yang
Great. So maybe first it'd be helpful if you could give our listeners a sense of Argonne National Labs mission and history and focus. Not everyone might be familiar with kind of what the national labs and Argonne in particular does.
Robert Underwood
Sure. The national lab system traces its origins to the Manhattan Project—so these labs go all the way back to the development of the atomic bomb. Argonne was one of the original labs, alongside Los Alamos. But since then, the labs have evolved significantly. Their mission today is much broader: developing energy and science capabilities for the benefit of the nation.
That includes a wide range of research domains, and increasingly, that means working on AI for science.
Argonne is what’s called an open science lab, meaning our facilities are available to external researchers. We're one of the three major computing labs in the DOE ecosystem, alongside Oak Ridge and NERSC. These three house some of the world’s most powerful computing infrastructure.
At Argonne, that’s the Argonne Leadership Computing Facility, or ALCF. The crown jewel there is Aurora, one of the world’s largest open science supercomputers. It delivers over one exaflop of double-precision floating point performance—which is a staggering amount of computational power.
But what’s especially interesting about Aurora is its flexibility: it can also compute in lower precision formats, which makes it uniquely valuable for machine learning and AI workloads. That’s one of the key areas we’re exploring—and something we’ll talk more about today.
Charles Yang
You mentioned that Argonne is an open science lab. That’s in contrast, of course, to the weapons labs under DOE that aren’t quite as open, shall we say.
You also brought up Aurora, which I believe came online just a few months ago and recently ranked number one in the world on the Top500 high-performance computing benchmark [Postscript: Aurora is now #3 on Top500, behind Frontier at Oak Ridge National Lab and El Capitan at Lawrence Livermore National Lab]. Could you walk us through how a system like Aurora compares to what we’re seeing in the commercial space—particularly the new cloud GPU clusters companies are building, like the ones from CoreWeave or Lambda? Those also have a lot of compute. So what’s the real difference?
Robert Underwood
Yeah, great question. There are a few important differences.
First, national lab systems like Aurora tend to emphasize specialized hardware characteristics. We typically use high-performance interconnects—that’s becoming more common in commercial AI supercomputers, but it’s still a differentiator. We also rely heavily on parallel and distributed file systems, which offer different consistency models than what you’ll find in commercial cloud environments.
Another major distinction is job structure and scale. We design our systems to run single, extremely large jobs—things that may need the entire machine to run. That’s a core part of the mission of the Argonne Leadership Computing Facility: to enable one-of-a-kind science that simply isn’t possible on other infrastructure.
With Aurora, for example, we’re talking about jobs that span 10,000+ nodes, with something like 60,000 GPUs all working in tandem on a single simulation or model. You just don’t get access to that kind of coordinated compute outside the lab environment. For many scientific applications, it’s the only viable way to run these workloads.
Charles Yang
That makes sense. I do want to dive into the details of the AI-for-science work you’re doing, but maybe one more tangent on compute architecture.
You mentioned that Aurora—and high-performance computing (HPC) more broadly—tends to focus on high-precision float types, like double precision. But many modern AI workloads are now shifting toward lower or mixed precision to scale better.
Do you see a divergence emerging between the needs of traditional HPC and the requirements of AI workloads? It feels like there are two increasingly distinct paradigms for large-scale compute, and I wonder whether the labs will need to start rethinking how they architect future systems to support both.
Robert Underwood
I mean, from my perspective, what I see is that industry is actually getting closer to us. It’s not so much that industry cares about double precision, but if you look at the other exascale machines in the United States, they also use GPUs where, if you want to get the maximum possible computational performance out of the machine, you have to use these lower-precision floating point units. This would be like Tensor Cores on NVIDIA hardware.
But AMD and Intel each have their own equivalent—something like BFLOAT16-style computational capacity. And if you want to fully leverage that, you need to use lower-precision formats.
So while we talk about Aurora as being an exaflop machine, I think if you use the 16-bit precision, if I’m not mistaken, it gets close to 12 exaflops of performance. So if you’re really looking to take advantage of the peak power of the machine, you’re going to be using these low-precision representations.
Charles Yang
Right. Well, so maybe let's talk about the project that you guys announced over a year ago now called AuroraGPT. What is it?
Robert Underwood
So AuroraGPT is Argonne's effort to prepare for a future where AI and science are much more heavily integrated. One way we think about that is by asking: what does it mean to leverage data that’s unique and specific to scientific applications and workflows in the context of AI?
That data often looks very different from what you typically see in most industrial use cases. For example, we might need to represent higher-dimensional data—like 5D or 6D tensors—for certain kinds of physics problems. We might need to handle very large graph data, or work with sparse and unsparsed grids or meshes, which are often used in things like finite element codes.
So there are many ways in which the labs have unique data structures that are extremely valuable for solving specific scientific problems, but which haven’t really been explored by most major AI players in the industry. That’s where we see a niche: how do we adapt AI models and tooling to scientific workflows and applications?
Charles Yang:
Yeah, and I think that data modality point is really important. The kinds of examples you’re describing—those aren’t things ChatGPT is going to be able to help with. Or maybe it could, but the dimensionality just isn’t the right shape for that kind of model.
So is AuroraGPT a single big model that you’re training? You mentioned a bunch of different modalities and different kinds of scientific applications. What does the progression look like so far?
Robert Underwood
So what we're really looking at is kind of a series of models, each aimed at answering one or more different kinds of scientific questions. For example, we might want to understand whether a model trained on substantially more biological sciences information is better at answering questions in that domain. That might be one of the questions we’re trying to answer.
So we would train not only on papers and standard reference materials available in biology, but also look at adapting various other resources. For example, Argonne has something called the BVBRC, which I believe is the Bacterial and Viral Bioinformatics Resource Center, which is one of these major resources that we're using for trying to do these experiments around bio. The BVBRC is a multimodal database containing both tabular information and other forms of data. It includes descriptions of actual in-lab experiments—experiments that people have done using different materials and biological samples—as well as simulations involving similar or sometimes the same materials. So you can imagine this is a very rich dataset, and we’re trying to explore how we can take all of that and make it accessible to scientists working with AI.
Charles Yang:
That’s really interesting—especially this biological dataset you mentioned. How does that compare to something like Arc’s Evo model?
The scale-pilled thesis, of course, is: if you throw enough tokens at it, the model can learn a lot of the underlying relationships. And on the biology side, a lot of that work focuses on tokenizing gene or DNA sequences. Is the dataset you’re describing different in that regard? And how does it stack up against what we’re seeing in industry?
Robert Underwood
Yeah, so my perception not having looked into the details of the EVO model specifically is that Arc is doing some very interesting things kind of in the materials space. They've done some kind of techniques where they look at kind of equivariant neural networks to adapt kind of my understanding is like MD style simulations of the different materials and particles and then adapting those to models.
I think that that gets you some of the way there, but my impression is that there are kind of richer forms of data that might be available from simulations that the labs maybe have access in greater quantities or greater varieties than what companies may have in this particular space.
So one way in which we can have access to a large amount of data is we have a large facility also at Argonne called the Advanced Photon Source. This allows us to take imaging, essentially, of different materials and understand information as it's being the structure of the materials that we're imaging.
So as we're studying the structure and better understanding these more fundamental properties of the materials and the biological samples that we're studying, that then goes back and informs the next set of experiments that one might run. So there's a deep interconnectedness to this that might extend beyond what you might be able to contain in, say, a single simulation about a single material, but finding these deeper relationships that might exist. Now, it's possible that you can get at that with, as you described it, this just scaling with additional tokens. But I think the better way to think about it is like, is it better to provide a more concise, richer data source or a larger, less rich data source? And I think that's kind of an open question that we'll see be answered over the next coming years.
Charles Yang
And the good point about the APS, I mean, I think the UK announced a very similar project run using their hard light source in their national lab to generate a protein ligand data set. So certainly do want to talk about the role that scientific infrastructure plays in generating data for AI models. But before we leave the aura GPT, it sounds like you are developing a number of kind of foundation models that are specifically geared towards this kind of high fidelity experimental data that might not be easily tokenizable by the current class of industry models.
What's the kind of state of the effort? I mean, like how many people are working on it? What kind of compute systems are you all using? What's the scale of the models you guys are working with?
Robert Underwood
So at this time, I think we have like order of 30-ish people that work some percentage of their time on AuroraGPT. So if we compare that to industry efforts, they're going to have a lot more people because they have a much larger budget for these kinds of things. But the idea was that we want to kind of use the small amount of resources that we do have and leverage them for the biggest impact that we can.
So in terms of of model sizes, we've looked at kind of 7 billion parameter models, and we're looking at like 70 billion parameter models kind of scaling up from there. These are kind of sizes that are useful in terms of being able to fit on existing info. So if you look at kind of across the space, you'll see a series of both 7 billion, 9 billion kind of parameters, just kind of this kind of smallish, but still useful size.
And then you'll kind of see kind of the 70-ish billion parameter size that roughly corresponds to like a single DGX node worth of hardware. And those are kind of common sizes that you typically see. Argonne also is affiliated with something called the trillion parameter consortium. So we have aspirations to eventually go bigger, but we're kind of starting small and building and experimenting with these techniques at smaller scales to see where they can eventually go.So at this time, I think we have on the order of 30-ish people who spend some percentage of their time working on AuroraGPT. If you compare that to industry efforts, they’re going to have a lot more people, simply because they have much larger budgets for this kind of work. But the idea for us is to use the relatively small amount of resources we do have and try to leverage them for the biggest possible impact.
In terms of model sizes, we’ve looked at 7 billion parameter models, and we’re starting to scale up to 70 billion parameter models. These sizes are useful because they can still fit on existing infrastructure. If you look across the space, you’ll see a bunch of models in the 7 to 9 billion parameter range—smallish, but still quite useful.
And then there’s the 70-ish billion parameter size, which roughly corresponds to a single DGX node’s worth of hardware. That’s another common checkpoint you see across the industry.
Argonne is also affiliated with something called the Trillion Parameter Consortium. So we do have aspirations to eventually go bigger. But for now, we’re starting small—building up our tooling and experimenting at these more tractable scales to see how far we can push the techniques.
Charles Yang
Yeah, I mean, have you all had any results come out of it that you can talk about now?
Because my general concern is—it’s 2025, and industry models are getting larger and larger. But even now, they haven’t really proven much directly. Though, to be fair, groups like Future House in San Francisco are starting to productionize some of them in more domain-specific ways.
When you talk about 7 billion and 70 billion parameter models—granted, these are very different kinds of architectures, especially when you're dealing with higher-fidelity data—but it still feels like the pacing and the level of resources going into testing this hypothesis you’re describing seems kind of disproportionate or maybe inadequate relative to the broader conversation.
Do you all have a timeline for when you're expecting results? And what would you need to see to feel like the hypothesis is validated—or, on the flip side, to conclude it’s not the right path?
Robert Underwood
So I would say that we're working very actively to have kind of the first set of results that we're ready to talk about publicly. We're not ready to do that at this stage. But what I can say is that this is a very large problem and I have confidence that it will not be solved by the time that we publish our results. So even outside of the context of like actual models being released, we are making efforts to make kind of methodological contributions and kind of other contributions around the evaluation of AI. So a good example of this would be the EAIRA paper.
So the evaluation team here at Argonne has recently put out a paper called EAIRA, which looks at proposing a methodology for evaluating AI models in the context of science. So if you kind of look at that methodology, it proposes kind of two existing components of most methodological stacks, which are multiple choice questions. And that, but we look at them specialized at science for scientific purposes.
And then we kind of look at where are there gaps between existing benchmarks that are used for evaluating these kinds of models? as well as kind of more specific. Then we kind of move into more like open generation style questions or free response style questions. But then the kind of the last two things that we look at that are kind of, I think, kind of different than what you see a lot of the evaluations doing right now is you see that there is a desire to look at what we call lab style experiments or kind of think of these like case studies or so these are like very long form experiments where we have domain experts working on a very hard cutting edge problem.
We bring them in for multiple hours and we have them work with state of the art models from across the different vendors. And we ask the same similar sets of questions to each model and we evaluate where are there gaps. And while it's not the same thing as producing a model per se, it's a meaningful contribution in terms of describing where there are gaps in the methodology for evaluating these models.
So another kind of thing that you kind of see, which is towards the fourth category proposed in the paper, which is this notion of what we call field style experiments. So these are large scale experiments. So you may have heard of something called the thousand scientists jam that was organized by the US Department of Energy and OpenAI and Anthropic earlier this year. So part of that effort is looking at how do we scale up this kind of idea of a lab style experiment, but to a larger community and kind of building automated tools and scalable evaluation methodologies to quickly assess across the corpus, maybe even the size of the entire DOE, what are invaluable problems that we want to solve. So while building a model is like a piece of our mission, it's not the only piece of our mission.
Charles Yang
Yeah. I certainly think, yeah, to the point about evaluation, mean, that's something that we've seen, like, you know, open as fund a lot of work around AI for math, where they are essentially trying to pull out like benchmarks that they can then use to benchmark their models against, right? And that takes a lot of work for mathematicians to kind of get involved in. It's certainly a form of labor at the very least. I mean, and I think to the broader point around AI for science models, you've seen, you know, metas come out the Evo and the OMOL models. Google Deepmind come out the Graphcast and their AI for weather forecasting models.
So, certainly the thesis, I think, is supported by many that there are differentiated classes of models for scientific data specifically. But that message certainly gets lost a lot, I think, in the discourse of not all AI models are born the same or trained the same way.
Robert Underwood
Yeah. Yeah, but the other thing is, if you look at a model like Olmo, example, Olmo is in many ways trying to solve a very different problem than what you might compare to with a llama3, right? Because one of the distinct purposes of the Olmo model specifically is that they want the entire process to be fully reproducible.
And having kind of a fully reproducible model stack all the way down to the data is actually really important if you're wanting to meaningfully measure like what are the performance differences for, for example, injecting a bunch of biological sciences data. Cause you know exactly what was in the training set. And if you want to go back and audit, where did this weird generation come from? You, you, if you're doing this with a llama based model, you don't have a prayer. whereas if you have a model that you've trained from scratch, whether it be based off of or some other kind of data set where you have this full provenance. This is something that's really, really valuable in a scientific context, whereas in a business context, you may or you may not care about that full level of traceability.
Charles Yang
Right. No, I think that's definitely another point as well about the differences between these kinds of models, how we bake these models. OK, last question on AuroraGPT, and then I do want to talk about the data generation side. What do you think is kind of like the primary limitation right now? I'm certainly very supportive of this whole effort. Podcast is really around focusing on AI for science, and I think having a public capacity to do that is obviously important. What are like kind of the primary limitations you think to scaling up the success of AuroraGPT and Argonne National Labs involvement in the AI for Science. It people? Is it compute? Is it something else?
Robert Underwood
My impression is that compute is by far our biggest scaling, or not compute, personnel is our largest scaling constraint that we have right now. As I said, we're a very small team. And if you look at just like, for example, Meta, my impression is that there were order of a thousand people involved with the llama three paper. could have, you know, roughly miscounted there, but like they're at least an order of magnitude, if not two orders of magnitude larger than the effort that we have. So if we're wanting to kind of demonstrate comparable style outputs and comparable style efforts, we're going to need more people than we probably have now. Now, what's kind of...
Charles Yang
But I mean, are you all trying to compete on llama or are you all trying to compete on, I mean, I do want to distinguish like what is like the right benchmark of reference here, right? Because if you start saying we need a thousand people like meta to do a llama style thing, people are gonna ask why is Argonne building a llama style model.
Robert Underwood
I mean, I guess that's a fair assessment, but at the same time, like there are a lot of different science domains and very few, if any of them have robust treatments in science. So I don't know if a thousand people is necessarily the right number, but my point is if you, if you want to see larger and faster progress out of the labs on these kinds of efforts, we are going to need more people to do that kind of work.
Charles Yang
Yeah, definitely. Well, and it's interesting to you say personnel not compute, because I know for some other companies that that is like the primary limitation.
Robert Underwood
I mean, compute will eventually become a limitation, but I think, like, for example, we're able to secure an INCITE proposal. So INCITE is a program within the DOE to get large scale allocations of core hours on machines like Aurora. And like, while we are definitely making use of our INCITE allocation, I think if we had more people, we could make even more effective use of that allocation. So I think personnel right now is our biggest constraint.
Charles Yang
Yeah. OK. I mean, I think that's going to be helpful for many folks to hear. OK. Let's talk about, I mean, in the other hat that you wear at Argonne, you also do a lot of work on scientific data compression. Why does that matter? What's the kind of motivation there?
Robert Underwood
So scientific data compression is really important for a lot of different domains. So if you look at these exascale applications, they can produce mind-numbing volumes of data. So if I'm not mistaken, the hack simulation, hack farpoint that was ran, I think it generated on the order of 2.3 petabytes of data. If you look at things like the APS upgrade at Argonne or LNLC's [Linac Coherent Light Source] upgrade at SLAC, these facilities are like on pace to produce order of a terabyte of data per beamline in some cases, which is just a mind numbing volume of data. And with that much data, you really have to have a careful and thoughtful approach as to what you're going to do with that data in the long run. So data compression is one of a variety of ways by which you can approach that problem.
And what's interesting about data compression is that it allows you to retain the original dimension of the data, so the original sizing information of the data, and the original number of the data. So you're not reducing the featuredness or the richness of the data. You're not really producing the number of data. What you're reducing is the precision. And in many cases, for applications, it's better to lose precision on data, especially if you can control exactly how much precision you lost and where you lost it from. So for example, if you're looking at, for example, fine-grained features on a subsurface, you might have a large portion of the data that's relatively sparse, and you can kind of compress that very aggressively, because there's not a lot of scientific content in that sparser region.
But where you have kind of a turbulent boundary condition that exists kind of between two points, maybe you need to have a more conservative style compression approach that gets used in that region where there's a lot more scientific content. kind of using compression, you can kind of address concerns both of data rate. So basically how fast are you producing data? So like these APS or LCLS style use cases, but you can also address use cases where you need to do large scale data archiving. So kind of deployed at scale, you can imagine data compression would allow you to dramatically reduce the needs for long-term storage of data, not because you're actually storing dramatically less data, but the footprint of that data on storage is dramatically smaller. So that's where compression kind of plays a role and can be very helpful.
Charles Yang
Yeah, I mean, I think that's an interesting contrast because, I mean, with a lot of these conventional industry models, they've kind of reached data limits now of the known set of tokens in the world and people are doing synthetic data and all these complicated things. But in the world of science, like we're actually drowning in data in some sense, right? Like there's too much data being generated by these massive particle accelerators and hard light sources that there's this whole field that is looking to understand how to grapple with all that data in a way that's like sort of more manageable.
We talked with Sergei Kalinin who does autonomous microscopes and he talked about the massive amount of data being generated by each microscope nowadays at the leading frontier. Do you kind of see a heterogeneous or hybrid architecture in the future for scientific instruments where they are running, or maybe this is already the case, like massive data compression at this point where the data is being generated at the facility.
Robert Underwood
So just to kind of define terms really quickly. So what I'm hearing from you is that you're saying that you're going to have kind of some edge facility where you're producing data at a very large rate. And then maybe you have a set of edge devices that are going to kind of process or accept that data, transfer it over a network, and then maybe you have some large computing resource where you're going to then kind of do further processing on it after potentially you've either restored it from an archive or after you've transmitted across a wide area network. So if that's what you're describing, Like we already do those kinds of techniques. Yeah.
Charles Yang
That's what I'm assuming. Yeah. So it's already happening. Yeah. Do you want to talk a little bit more about like maybe for the advanced photon source at Argonne, which is one of the brightest light sources, I think, at least in the country, what does that kind of look like in terms of data flows and where the data is being processed?
Robert Underwood
Yeah, so in the case of the APS, so you can have large different experiments that are conducted at one of many different beam lines that exist on the APS. think there's an order of 80 different beam lines. And the way that you can think about this is each beam line specializes in a particular class of experiments. So you might have some experiments that are performing something called tomography. So this is, as I understand, MRI-style images where you're looking at trying to understand the structure of a material. You may have other ones that are looking at trying to understand more like subatomic style interactions that exist on different materials. And for each of these, you'll use either different wavelengths of the X-rays, or you'll use different intensities of the X-rays to kind of study or different ways of capturing the X-rays as they're going off of the sample. So maybe in some cases, you're shooting directly through the sample and you're studying direct. In some cases, you're looking at backscattering.
So you have different kinds of ways of conducting these experiments. And these can each produce very large volumes of data. So in the case of small angle scattering, like wagon angle scattering experiments, you could potentially generate up to a terabyte of data every second. So in that case, the team that I'm working with as part of another project called Illumine is looking at how can we design specialized compression techniques that will allow us to kind of take the data that comes directly off of these detectors and make it small enough that we can get it from the detectors to intermediate storage where we can then potentially recall it on either a locally available HPC resource or a further away HPC resource. In the case of APS, we frequently will use the ALCF resources.
It's actually kind of interesting. at the ALCF on machines like Polaris, which is one of our other large computing resources that we have, there's actually a special queue that's called the demand queue. And what the demand queue is for this particular block of racks, if a job comes in from the advanced photon source that needs to be processed in near time, we can actually kind of prioritize the jobs that are coming off of the APS for that set of racks that are dedicated towards this demand queue. And then other jobs can kind of run at a lower priority in the background when there's not an APS job just to keep machine busy. So having this ability to kind preempt the computation on the ALCF resource when you have kind of an urgent need for compute is kind of an interesting and exciting way that you can kind of combine these large scale facilities together.
Charles Yang
So, I mean, that's an interesting, like, sort of, I guess, partnership between the fact that you have this leadership compute facility at Argonne and, like, of the, including one of the world's largest supercomputers and one of the world's largest beam lines or hard light sources that's also sending data back and forth. Do they process any data locally at the APS or is it always sent to Polaris?
Robert Underwood
So this is actually a good question. So depending on what the task is, they will perform certain operations at the edge. So for example, if you're looking at, example, this is a technique that's used at LCLS, not at the APS. But if you're doing a technique called serial 50 second crystallography, you might, for example, conduct kind of initial peak detection at the edge. So basically, there are certain regions of this data that contain particular, I'll call them bright spots. I think the scientific term for them is brag spots. And these brag spots are kind of bright, scientifically significant pieces of the overall image that you have. So the reason why you might look for these on the edge is that you can perform this technique called non-hit rejection. So if you have a frame that you take a picture of, and in or across this frame or across this detector, you don't see that there are any peaks on this particular frame.
You can actually discard that frame entirely, which reduces the amount of data that you have to transfer. in some ways, it's kind of an adaptive sampling technique, kind of based off of how much information is in the frame. And then the second thing you frequently will do is you will then apply compression. So you're doing the compression near the beam line, and then you're going to do your analysis further away.
So you're trying to leverage like what techniques do I need to have nearest to the beam line where I can handle them both, where I can perform them both in terms of their complexity at a very high rate. But also where can I leverage the fact that I haven't had to transfer that data yet in order to make the most effective use of it.
Charles Yang
And so for beamlines that are each generating up to one terabyte a second and doing this kind of compression, mean, does this mean each beamline has a CPU server that's dedicated to servicing the data and compression needs? And roughly, what scale are we talking about here?
Robert Underwood
So if you look at different beam lines have different needs. So for example, not all beam lines necessarily generate a terabyte a second. But the ones that do, at least at Argonne, they actually have essentially Polaris nodes that are deployed at the edge. They're like the very similar kind of hardware. They may have slight differences in terms of the network interface, for example. But otherwise, they look very, very similar to the kinds of resources we already use on the supercomputer. So there's just fewer of them.
So if you want to kind of look at kind of a more forward-looking example of this, you might look at, example, LCLS is doing at Slack. In their case, they're actually looking at kind of taking data directly off of FPGAs, field programmable gate arrays, and then communicating that directly to a GPU where they do some preliminary processing. And then after that, they then use a technique where they send that data directly from the GPU across the network interface to either long-term storage or for further analysis. So you can potentially get these really integrated beamline designs where you're very carefully understanding each of the stages of the pipeline, you're kind of deploying them as a collective whole.
Charles Yang
Awesome. And so these are GPU-based workloads.
Robert Underwood
Yes, so because of the data rates that these systems have, you're very frequently going to be moving towards GPUs for many of the different frameworks that you have.
Charles Yang
That's awesome. So we're basically running a, I guess this is the dichotomy of both scientific simulations, video games, and AI training are all kind of similar style workloads. Okay, so we're basically doing like GPU, like pre-processing of the flood of scientific data being generated at each of these beam lines.
Robert Underwood
Yeah, GPU plus FPGA. So some of the tasks are even actually being done on FPGAs because, for example, if you're doing this non-hit rejection, that might be something that you really want to do it in hardware, given the throughput constraints that are involved with that particular part of the process. So if you can build it sometimes, FPGAs also play a very important role in these kinds of real-time scenarios.
Charles Yang
And roughly what kind of data compression rate are we talking about? Is it like on there of like compressing 5 % or is it more like a 10-fold compression?
Robert Underwood
So the goal set out for us by the different beam lines is usually to achieve at least a 10x in compression. In some cases, it's a 20x in compression relative to the raw data stream. And in many cases, we have been able, and if you are interested, can point you towards some papers where we've done this exact kind of technique on a variety of different beam lines and approaches.
Charles Yang
Yeah, I mean, that's awesome. And I think really, again, striking the difference in how these fields think about it where the AI world is turning to synthetic data in the scientific world is compressing up to 20x of its data because it can't deal with the amount that's being generated at these facilities. What do you think is kind of a promising area? mean, so, you know, Argonne has all these large pieces of scientific infrastructure like beam lines that are generating vast amounts of data.
What do you see as kind of a particular fields or applications that you're excited about in where AI could potentially play a role. mean, this is sort of going back to the Aurora GPT conversation. We talked earlier about UK OpenBind competition as one example of what the role of the scientific infrastructure can play. Have you come across any others that you think would be exciting or perhaps not enough folks know about?
Robert Underwood
So what I would say is we're as part of our AuroraGPT, we're actually actively working with both beam lines at the event on source, proton source, in addition to kind of the biological data group that we have here at Argonne. so I think that definitely as you kind of progress and it's kind of the development of science, you'll see increasingly that we're going to be linking AI systems up to things like self-driving labs, where we're going to utilize either robotics to conduct experiments and then collect the results from those experiments and then interpret them using AI. Or maybe we use AI to guide where is the next most promising experiment to perform. So there's a lot of opportunities here as we interface automatable infrastructure with AI systems.
Charles Yang
Awesome. And certainly self-driving lab is something we've talked a lot about on this podcast as well. So great to hear. mean, that's quite the vision of both generating mass amounts of data at these beam lines, running large scale AI models on the leadership class facilities that you have at the supercomputers, and then using the self-driving lab infrastructure at Argonne to kind of then iterate from there and generate more data. And the beam lines have some degree of automation themselves as well,
Robert Underwood
Yeah, and that's actually something that we're trying to improve with projects like Illumine. So if you look at what the project has, there's roughly three thrusts. And two of those thrusts, broadly speaking, have to deal with kind of what's sometimes referred to as integrated research infrastructure. So how do we more actively communicate kind of and make decisions in an automated fashion, both at like the kind of each frame detection scale. So that's like one really tight real time kind of set of constraints.
But also kind of broader optimization style constraints that might happen at the order of several seconds or several minutes. So you have kind of different reinforcement loops that are happening at these different timescales, making different kinds of decisions about how the experiment potentially will progress. So I think it's a very exciting area. It's a project I'm excited to be a part of.
Charles Yang
Awesome. I can't think of a better way to end than that. Robert, thanks for the time.
Robert Underwood
Yeah, appreciate it.