Addressing doubts of AI progress
Addressing misconceptions about the big picture of AI progress. Why GPT-5 isn't late, why synthetic data is viable, and why the next 18 months of progress is likely to be greater than the last.
TLDR:
- Compute Scaling Expectations: The first GPT-4.5 scale clusters have only recently started training models in the past few months, and the first GPT-5 scale clusters are expected to be built later in 2025.
- "GPT-5 is Late/Overdue": Even naively extrapolating the release date of GPT-3 to GPT-4, would suggest a December 2025 release for GPT-5, not a timeframe much sooner as some imply.
- "The Data Wall": While web text is starting to reach practical limits, multimodal data and especially synthetic data are paving the way for continued progress.
- "The Role of Synthetic Data": Data discrimination pipelines, and more advanced training techniques for synthetic data helps avoid issues like model collapse. Synthetic data is already now providing exceptional benefits in coding and math domains, likely even beyond what extra web text data would have done alone at the same compute scales.
- Interface Paradigms: Perception of model capability leaps aren't just a result of compute scale, or architecture or data, but is a result of the actual interface in which you use the model too. If GPT-4 was limited to the same interface paradigm as original GPT-3 or GPT-2, then we would end up with just a really good story completion system for creative use-cases.
- Conclusion: As we go through 2025, I believe the convergence of larger-scale compute, synthetic data, new training methods, and new interface paradigms, will change many people's perspective of AI progress.
Section 1: Personal predictions
I'm going to start by first introducing my current views on where AI progress is headed on a high level. On December 2nd, 2024 I told a friend online (after ruminating on this a lot) that I believe the next 18 months of public progress will likely be greater than the last 18 months. (I have ~90% confidence in this prediction). Later on December 13th, I made another more near term prediction (with ~75% confidence) : "In the next 4.5 months (December 13, 2024, to April 29, 2025), there will be more public frontier AI progress in both intelligence and useful capabilities than the past 18 months combined (June 13, 2023, to December 13, 2024)." I also provided some measurable ways to determine what the "progress" was at a later date. A few days later on December 20th OpenAI announced their O3 model. I've now grown further confident in this 2nd prediction, not because of O3 alone, but due to other models/capabilities I expect will be announced too.
Throughout 2024 I’ve had many conversations with skeptical friends, family members and acquaintances that believe AI is slowing down, hitting a wall, or that progress is greatly overstated. While their points often seem reasonable at first glance ("Internet data has run out," "Synthetic data causes model collapse," "GPT-5 is late/overdue"), I’ve often pushed back, asserting reasons why there's not a fundamental slowdown happening. In the past couple months of me working on this post I've seen skepticism decreasing due to some new releases, and much of the vocal doubters online becoming more... quiet. But I believe many are revising their views and behaviors for surface-level reasons (like simply seeing a new cool model release after months of nothing interesting) rather than revising more fundamental misconceptions they may still hold. In this blog post I won't detail all the reasons for this very near-term optimism around progress, but I'll focus on addressing the most common arguments I hear from "progress doubters."
Section 2: Compute Scaling Expectations
The jump from GPT-2 to GPT-3, and GPT-3 to GPT-4, each involved a ~100x increase in raw training compute. A hypothetical "half-step" like GPT-4 to GPT-4.5 would involve a 10x compute leap on that same trend, requiring approximately 80K H100s. If you follow the most credible analysts in the space, there is no such training clusters that even existed with that amount of compute until mid-2024 when Microsoft/OpenAI is suspected to have built one of the first ~100K H100-scale clusters. XAI built a similar scale cluster a few months later that was publicly announced, and it's suspected Google may have a similar scale of compute in TPU clusters as of recently too.
These clusters are now/were training the first GPT-4.5 scale models as of Q4 2024, and I believe we will see the first of such models announced to the public sometime in Q1 2025. Some may say "these next models should be even better than what the raw compute leap would indicate, due to the many algorithmic advancements we've made in the past 2 years since the first GPT-4", and while yes this is valid on the surface, it should also be noted that the GPT-3 to GPT-4 leap is already estimated (By Leopold Aschenbrenner, EpochAI and others) to be a ~1,000X total "effective" compute leap after you take into account its estimated ~10X efficiency leap with it's ~100X raw compute leap. Thus even if we were to say another 10X efficiency leap is available now between GPT-4 and 4.5, that would still only result in a ~100X overall effective compute leap, significantly less than the ~1,000X estimated from GPT-3 to 4. Thus, the news headlines and reports about how some recent internal leaps are "less significant than what we saw between GPT-3 and GPT-4" are not only unsurprising to me, but expected.
Looking ahead however, Nvidia’s Blackwell GPUs will enable even larger clusters, reaching GPT-5 scale fairly soon. XAI has expressed plans for a 300K B200 cluster around mid-2025, and OpenAI and Microsoft are suspected to pursue similar scales in roughly similar time frames. For reference, even just a ~200K B200 cluster could train a GPT-5-scale model in ~3 months, making a late 2025 or early 2026 release seem likely. There are valid energy infrastructure concerns around how much compute you can have pulling energy from the grid in a single location, but distributed training (eg; coordinating 5 training campuses for a single training run) is expected to soon become more practical and frequently used too, potentially further accelerating the compute scale we see go into single models over the next 18 months. So the odds that GPT-5 scale models will be fundamentally training limited by energy constraints seems increasingly unlikely too, at least for these next 12-18 months. (SemiAnalysis has a great article covering the current situation with potential multi-campus training plans and energy demands.)
Section 3: "GPT-5 is late/overdue"
A common sentiment is that GPT-5 is late/overdue, with progress doubters attributing these perceived delays to things like a failure in scaling laws and no more data left to train on. On the optimistic side, some believers of "GPT-5 is late/overdue" even believe labs are hiding GPT-5 scale models for months/years. I think both views have the same underlying misunderstanding of the previous timeframes and compute scales at play, and I'd say GPT-5 is in fact not late at all.
Many expectations are based on the ~4 month gap between the release of the first GPT-3.5 model in ChatGPT, versus GPT-4, people recall a period of rapid progress. They assume GPT-4.5 and GPT-5 should follow within 4, 8 or 16 months afterwards at most. In reality, GPT-4 is confirmed to have already finished training before GPT-3.5 even released within ChatGPT. The original ChatGPT is much more appropriate to be looked at as effectively a preview, to see how the general public would use such an interface before the full GPT-4 is shown. The actual gap was ~33 months (a bit over 2.5 years) between original GPT-3 and GPT-4 release, and this largely reflects the time required to scale clusters and finalize research that went into GPT-4. If you naively extrapolate this same 33 month gap after GPT-4, you end up with an estimated release date for GPT-5 of December 2025.
(This article from SemiAnalysis provides an excellent breakdown of the interconnect and cluster scaling challenges faced over the years, and how they're being solved.)
Section 4: The Data Wall
A valid concern is the "data wall"- The limit of high-quality, unique internet data. Some look at Common Crawl Internet archive which contains ~100T tokens of text data, meanwhile GPT-4 is commonly said to use less than 6T tokens of unique data (albeit "repeating" multiple times for several epochs during training, leading to technically 13T tokens). But over 80% of this Common Crawl data is filtered out when you deduplicate data and use even basic quality filtering, as shown by filtering efforts like the open-source Dolma work. So the limits of good public text data is indeed close or already being reached in some regards. Proprietary data along with multi-epoch training(showing the model the same data multiple times during training) can likely help, but admittedly, the industry consensus is largely still that the practical limits of such useful, unique, and natively existing data are still relatively close. High-quality multimodal token data (like video) exists in amounts that are estimated at 10X or 100X+ higher than available web text, thus expanding training possibilities more, but the dimensions of such improvements relying on multimodal data may differ from past leaps in unpredictable ways, but there is something being focused on which many believe has even more potential; Synthetic data.
Section 4.5: The Role of Synthetic Data
Synthetic data (In this context; AI-generated data used to train future models) helps address the data scarcity. Progress doubters often cite "model collapse" from papers testing indiscriminate use of synthetic data, but this is contrary to what most large-scale synthetic data efforts are actually moving towards in reality; highly discriminate use of synthetic data.
The latest methods being used involve; highly selective pipelines that filter out low-quality data, mechanisms for maximizing uniqueness and variety of generated data, scoring data on various quality metrics with specialized labelling models, functional verifiers for area like math and code, and sometimes grounding data generation directly in human sourced data too. Recently there is even advanced pipelines working towards generating and pruning such data iteratively within the training process itself.
These approaches especially excel in domains like coding and math, where functional verification is possible. For example, Llama-3.1 was confirmed to be trained with code execution feedback to improve coding abilities over multiple rounds of self improvement during training. General methods for open-ended tasks are also emerging, like Meta’s research of "meta-rewarding language models", allowing models to self-improve and judge its own responses in more open-ended domains that don't have methods of functional verification.
Deepseek V3 also demonstrates the promise of synthetic data on a large scale, not just in terms of improving a smaller model or itself, but actually using their older, reasoning based R1 model(similar to o1 preview in capabilities and behavior), to improve their new large regular chat model. When they did this process of distillation from R1 to Deepseek V3, it ended up achieving 39% on the popular Math Olympiad qualifying exam (AIME). For reference, GPT-4 and Claude-3.5-sonnet both score below 18% in AIME. The Deepseek V3 model also significantly beats both 4o and latest Claude in the Codeforces and MATH-500 benchmark.
Section 5: Interface paradigms.
I believe the feeling we all had from GPT-3 to GPT-4 is often attributed to something achievable with scale alone, but there is also a key fundamental difference with how you experience the model that we shouldn't take for granted, and that's the fundamental interface change of going from story-completion to chat. Imagine if GPT-4 was released as just another story completion model like GPT-2 and 3 were? Used in a way where the user just types the beginning of a story and the model creates the rest of the story for you? The fact that GPT-3.5 and 4 leveraged an entirely new kind of interface (and importantly were also post-trained with new techniques developed to maximize the capabilities of the new paradigm!) plays a big role in the underlying intelligence and capabilities actually being demonstrated to the end-user.
Although the chatbox interface is definitely a big leap in generality and usefulness, I doubt it's the fundamentally best or last interface to unlock the intelligence of models in the future. I would consider voice mode and current O1 interface to both still be part of the same chat paradigm for the most part too. It's also not as simple as just figuring out the new interface paradigm and then slapping GPT-4 onto it either, there was already chat versions of GPT-2 and GPT-3 scale models online like Character AI and AI dungeon for around a year or more before GPT-3.5 and GPT-4 released. Such attempts at the chat paradigm were too early in their underlying intelligence and power to truly have significant usefulness though, and thus AI dungeon and Character AI didn't elicit that big ChatGPT moment, although they were used for some niche creative use-cases by early adopters. That being said, anybody looking at these early GPT-2 scale chat attempts in 2019-2022, seeing how unreliable they are, and concluding that LLMs are fundamentally incapable of ever being useful chatbots for the general population... would be gravely mistaken.
Section 5.5: The coming era of CUA (Computer Using Agents)
I think we're at the precipice of such a "ChatGPT" era for agents, CUAs in particular (also known as Computer Using Agents); Imagine a model seeing and controlling a desktop screen and working for you, or even with you. Imagine it being able to actually use its own mouse cursor on a screen perhaps alongside your own mouse cursor too, and collaborating and talking to you as if you're on a discord call screen sharing with a friend and showing the cool thing you're working on, but also having that friend able to collaborate seamlessly with you on the project with their own direct mouse cursor clicks and inputs on the fly, and keeping important things in mind for you if they're relevant for future interactions. There is already many attempts at making useful "agents" but I believe they've been too early, mainly due to the limited intelligence/capabilities of the current compute scale models. Products like Rabbit R1 and Humane are good in theory, but they over-promise on many agentic capabilities that models of the time aren't capable of doing anywhere near reliably enough, fast enough, or cost effectively enough. But the doubters will be gravely mistaken if they believe the underlying transformer model architectures are fundamentally incapable of reliable agentic capabilities, just like anyone who would have claimed in the GPT-2 and GPT-3 era that LLMs will never be useful chatbots for the general population.
Claude artifacts feature, V0 by Vercel, and Claude computer-use are all fairly impressive things released in the past 8 months. But I believe these are all closer to the AI dungeon and Character AI products of the agent paradigm. In the case of Anthropic computer-use, it’s only available through api access for devs, cost prohibitive, slow, and still can be quite unreliable outside a fairly niche window of use-cases, much like the niche user-base of the original AI dungeon and Character AI users. Over the next 6 and 18 months I think we will see significant capability leaps with most of these barriers broken down in significant ways for agentic capabilities, from a combination of new GPT-4.5 scale models (and soon GPT-5 scale) releasing soon, alongside training techniques that will be applied to those models and derivatives for improving agent reliability, multi-agent capabilities and even significant cost and speed improvements. I believe this will effectively culminate in a ChatGPT era for agents.
In the shorter term, I think several functionalities like being able to easily delegate long term or periodic tasks to your agent, will be important, like telling the model: "Deliver me a well researched 1,000 word report every month on the most recent open source AI releases." Along with the model even proactively initiating conversations with you, and/or asking clarifying questions about how it should do a particular step of a task. In theory a lot of the functionality I’m describing here is somewhat trivial to implement with existing models, but in practice; forcing such functionality with GPT-4 scale models can likely make it quite unreliable, undesirable, and annoying. Thus I think at least much of the more advanced and reliable forms of these features will likely be rolled out with newer models over the next 3 to 18 months, at increasing degrees of complexity and usefulness over time.
Conclusion:
Given the combination of these factors; 100X+ larger training runs in the next 12 months, synthetic data already proving its usefulness, new training techniques like advanced RL used by O1, and new interface paradigms like CUA, it seems likely to me that many people's perspective of AI progress will change significantly in 2025. Progress doubts will become ever more rare.
Going forward, people should be more careful to not extrapolate trends based on very short term data points like the 4 month gap between GPT-3.5 and 4, which can lead to an assumption of progress that is far more optimistic than what is really happening behind the scenes. If you’re interested in the true progress of raw compute scaling versus public capabilities, the best single data type to look at is likely cluster sizes over time. GPT-5 is not late, Synthetic data is viable, and the near future of AI progress looks bright.