✋ This post uses images, so if you read this on an email client, enable the “display images from sender” to get the most out of the post. If you are a new subscriber, welcome! If you find these writings valuable, consider sharing them with a colleague or giving a shoutout on Twitter. No generative tech was used in writing this article.
In How to Understand the Post-LLM Era, I dove from an operator point of view into what product building will look like in this era and how to build a mental model for large language models (LLMs) as a builder. In the AI Weirding post, I shared a broader picture of how AI is weirding the future of work and workers; much of that was inspired by the changes brought by recent developments in AI, especially LLMs. I apologize to the readers for casually dropping the “post-LLM era” in both without explaining what that is and why that is plausible. I will fix that here.
First, let me acknowledge that LLMs are by no means the best models for modeling most things people use LLMs for these days. Neither am I advocating they should be. But, just like in the AI Weirding post, I am, for this writing, taking on the role of an informed observer sharing what they see unfolding on the ground.
Post-LLM Era
Briefly, the post-LLM era refers to an epoch where large language models become the de facto models, like how transformers have become the de facto architecture for modeling.
The post-LLM era refers to an epoch where large language models become the de facto models, like how transformers have become the de facto architecture for modeling.
Imagine a world where LLMs become the hammer to solve every problem, in various guises as “assistants”, “GPTs”, and “agents”, which I am going call them collectively “AI”s. If you are on LLM Twitter X, you might have seen sparks of this already. I am going to lay out all the reasons why LLMs are poised to become the hammer of choice:
Transformers, the architectures dominantly backing LLMs today, are the most flexible models we have had so far if their performance in practically every area is any indication. While transformer alternatives (e.g., RMKV, Mamba) will arise, the attention mechanism is indispensable, and there is already so much investment in transformers, including hardware (more on that later), that you can say Transformers have won the Hardware Lottery for the near future. Even if these architecture alternatives gain mass adoption, they are still building LLMs. So, LLMs will win as models of choice regardless of the architectural innovations.
Aside: To map out the conditions reinforcing this architecture consolidation and its consequences, check out What if Transformers are All You Need.
Prompting (or In-Context Learning) is the most natural interface we have had so far for accessing any machine learning model, and from an ML point of view, it’s also the most flexible meta-learning approach. In other words, LLMs provide frictionless inference. This is important as it allows anybody to build a machine-learning service for any problem, regardless of their ML understanding or domain competence.
Aside: Meta-learning is learning to learn. With every prompt, an LLM is solving a new problem. So, when we train an LLM, we train a general problem-solving engine. This is one of the reasons why some brand these as “foundation models”, but that’s a misleading term as it leads the general public to believe that these models are indispensable for all purposes.
To further remove any friction from LLM setup before use, providers such as OpenAI, Google, Cohere, Perplexity, Anthropic, AWS Bedrock, Fireworks, Replicate, Together, and Anyscale offer high-performance endpoints for anyone to invoke LLMs with simple
pip install
s. Just bring a credit card.The cost of developing and deploying these AIs is coming down dramatically. OpenAI has, over the year, slashed its rates by 3x, and markets expect that to go even lower. High-performance open-source model endpoints, such as mixtral, offered by providers like Mistral, AWS Bedrock, Fireworks, Replicate, Together, and Anyscale at hyper-competitive pricing can force this downward pricing spiral, not only between closedAI but also between open AI companies, to happen faster than we imagined, making LLMs an inexpensive hammer for solving problems and the de facto AI.
Aside: Finbarr Timbers has an excellent post on The evolution of the LLM API market, where he expects the price of LLM APIs “to converge to [the] price of GPUs + electricity (and as competition increases in the GPU market, perhaps just to the price of electricity).”
If you buy LLM endpoint services from the right places, you can get them for prices lower than the price of electricity, thanks to VC discounts. Startups are not subjected to rational economics :)
Algorithmic innovations in neural network quantization (bitsnbytes, for example), pruning (LLMPruner, for example), KV caching (PagedAttention, for example), memory bandwidth optimization (Grouped Query Attention, for example), speculative decoding (guided generation, for example), and device placement to use commodity GPUs (PowerInfer, for example) all individually, and in combination, reduce costs and increase the speed of LLM inference. This further cements LLMs as the model of choice for providers.
Aside: If you don’t get some of these terms, don’t worry. Future posts will bring more clarity to them. For now, you can imagine each of these methods as compute multipliers. With suitable experimentation, finding the optimal subset of these multipliers that work synergistically should be possible.
Software-defined hardware innovations, such as Groq and TinyGrad, can drive these costs further down by a few orders of magnitude. For example, in Groq’s Tensor Streaming Processor, the chip’s instruction set is derived explicitly to optimize transformer inference.
Aside: NVIDIA has, because of its history and long-standing monopoly, created an abstraction leak problem where developers are forced to write low-level kernels for any optimizations they can think of. The fact that it requires CUDA geniuses like Scott Grey, Tim Dettmers, and Tri Dao to produce highly optimized model code for new ideas is a platform bug rather than a feature. Thankfully, this is changing, as it should. If you are a forward-looking ML researcher, you should not be solving CUDA puzzles! Sorry, Sasha :P
Contrary to all this, software/hardware tooling for other models and architectures now seems “hard to use” because we are getting spoiled by the conveniences offered by LLMs as a model.
Aside: I am aware of advances in other generative models (teaching a course on it soon, btw), in particular, diffusion models. A lot of this is written from a speech & language perspective. Diffusion models have similarly gained traction with image and video generation. However, the jury is still unsure whether they will become the de facto models for learning representations for those modalities.
Apple’s “LLM in a Flash” paper is a harbinger of a trend that LLMs will not only be the primary model choice because of their flexibility and multi-pronged cost optimizations but also will become ubiquitous. This will be the post-LLM era. Despite tweet signals, we are only at the threshold of the post-LLM era. We will observe small changes gradually at the beginning and significant changes swiftly towards its peak and end. The big headline events in 2022 and 2023, with a lot of fan-fare around massive models, will feel like a distraction in hindsight. This was useful, and it got LLMs into the zeitgeist. However, we are now witnessing a different revolution of high-performance smaller models, models aligned with personal values, and models with weight sovereignty. These will be the dominant LLMs in the post-LLM era.
Postscript: I sincerely thank the paid subscribers on this platform and elsewhere who encourage me to keep sharing. Knowledge should be freely accessible, and I will not paywall any of this content. I appreciate you becoming a paid subscriber or donating a subscription to a friend. This will help me write freely and in-depth, distilling my almost two decades of AI research experience and making it accessible to everyone. Happy Holidays!
I never bought into the idea of LLMs becoming the de facto way we experience AI productively.
The difference in acuity and sophistication between the visual system of a housefly and that of a human isn’t what input they take in. It’s in what they keep out.