An Argument for Modeling Consolidation
Or What Happens if Attention (and Transformer) Is All We Need?
✋ This post uses images, so if you are reading this on an email client, enable the “display images from sender” to get the most out of the post. If you are a new subscriber, welcome! If you find these writings valuable, consider sharing them with a colleague or giving a shoutout on Twitter.
After rebooting, I am still figuring out the right cadence and format for this newsletter. I am now experimenting with a weekly cadence on Wednesdays and a three-parter format (reply to send feedback):
1) A “Learn” section that includes tutorials, posts, and other resources to pick up new skills and ideas in AI.
2) The “Water Cooler”, where I have exciting finds from the ML/DL world, including products, businesses, fun gigs, and new libraries.
3) This week’s Story section focuses on a future where Transformers become all that we need. What would that future look like? If you are a builder, a product person, or an investor, you might want to pay attention to this.
One of the hardest things for models to do is to tell when they are not sure. To learn more about uncertainty prediction in time series, check out this new 50-page tutorial on conformal prediction (with code samples!).
The term “Language Model” is becoming ubiquitous, but few realize that language modeling research is almost 50 years old. Where does this term come from? See this thread where I try to hunt its origins.
Stable Diffusion has captivated the imagination of researchers and artists alike. Here’s a 4-hour CVPR tutorial and a recent (Sep. 2022) survey to get up to speed with generative modeling using Diffusion.
Karpathy is teaching a fun class on language modeling, starting all the way from n-gram models. I recommend this class.
🥛 Water Cooler
Audio-LM from Google demonstrates zero-shot generation for speech and audio. As someone who has worked on low-resource speech synthesis, I am blown away listening to the samples generated by Audio-LM with just 3s of speech. For applications besides speech synthesis, consider a user giving an initial melody, which could be unoriginal, and audio-lm generates the rest that is original. This solves many copyright issues for the growing creator ecosystem on content platforms such as YouTube, Spotify, and TikTok.
ReplIt launches AI Ghostwriter, an AI pair programmer, built on SalesForce’s CodeGen that appears to be a contender for Github’s CoPilot. Ghostwriter offers a chat-based UX, which is becoming common in language model-based applications. Despite protests from hardline coders, AI-assisted development is vital as there is growing evidence of its utility in improving programmer productivity. For example, see these posts from Google and Github. Github’s Copilot is available for free to students, faculty, and open-source developers, and ReplIt Ghostwriter is an invite-only beta at this time.
“Transformer Engines” are coming to the Hopper GPUs from NVIDIA, but they’re just fp8 and fp16 accelerated matmuls. There’s not much else that’s transformer specific going on there. While many have commented that this is yet another example of a “hardware lottery”, I am glad that NVIDIA is riding the hardware lottery. The hardware lottery is a great-sounding concept that’s detached from the practical and economic realities of hardware supply chains.
[R]esource lotteries are inevitable, and we are also better served by focusing on answering interesting questions posed by current realities than an imagined future. In trying to create a uniform exploration of idea spaces divorced from economic/practical realities (to “avoid the hardware lottery”), we would be missing out on exciting research opportunities by shunting works simply because they don’t fit our current understanding of how the human brain works or is capable of.
🍿 What if Transformer is all we need?
Even if you don’t believe in that question, let’s consider it for a moment.
There are reasons for even such a consideration. In less than five years, no AI application area has been untouched by the transformer, in particular, and attention-based models in general. The fire that started in NLP has moved beyond and burned into computer vision, speech, computational biology, reinforcement learning, and robotics, decimating all competition in its path.
Andrej Karpathy made an astute observation about the consolidation of ML methods and architectures that goes something like this (my paraphrase, for brevity):
Since 2012, Autodiff-driven backpropagation has more or less replaced a zoo of machine learning methods. This led to a zoo of DL architectures, and since 2017 most of these alternative architectures have been subsumed by one architecture — The Transformer.
In my ~20 years of experience with AI, I haven’t experienced this kind of consolidation and performance. I can almost hear some of the readers revolting against my use of “performance” here. Yes, the transformer models still have some ways to go. But the good news is that we have turned, in many cases, problems that were impossible into engineering problems where we know how to measure, optimize, and iterate. So I expect every aspect of the transformer modeling pipeline to get optimized within the next 2-3 years.
The “hardware lottery” camp might view this as an adverse event, but here’s the deal: Most researchers are not builders. Researchers are incentivized to explore and, thus, have an exploration bias. Builders, on the other hand, have an exploitation bias. For further criticism of the hardware lottery paper, see my post on riding the hardware lottery.
That said, the search for a grand unified architecture that can model all data is still on (Graph Neural Networks or GNNs have made some strides here), and I am not claiming transformers are it. But instead, I am considering the hypothetical of “what if” transformers, particularly very large versions of them, happen to provide this unification. How will that future unfold? If you are a builder, where should you focus the most? If you are an investor, how will you write your thesis for
AI companies software any company?
1) We will see a dramatic spread of research ideas from not just one area of AI to another but from one area of science to another. Say a grad student in an astronomy lab comes up with a new way of doing position embedding in transformers for large datasets and puts that on arXiv. The next day, a biologist can read and understand that paper (because we are now speaking the same language) and implement it in her work. This kind of unification will not just catalyze the spread of ideas, but we will see catalysis of progress in all AI-driven fields, which currently is all of STEM and arts. We will witness progress in humanity that was only preceded by meta-inventions like steam engines and electricity.
2) Consistent high performance across multiple modalities and unified representations will make the transformer the model of choice for combining data from different modalities, leading to a proliferation of multimodal applications, putting AR and VR in the front center.
3) Consistent OODA optimizations on different parts of the transformer pipeline mean standard and exchangeable interfaces and more competition between providers, leading to ever-decreasing costs for end users.
4) An exciting category of cost is the ease of use. This consolidation will enable simpler interfaces that can bring diverse people with ideas to participate in building software applications for everyone. Companies building infrastructure that enables this very early on will see winner-take-all effects.
5) A surprising effect of this consolidation and transformer performance becoming “too good to ignore” is that companies and research teams will be forced to work on explaining model predictions. Hence, there will be a massive uptick in model explainability work. The ground zero companies for model explainability are yet to be built. Explainability is one area I am excited about and will undoubtedly be the subject of future issues.
6) As the cost and the direct involvement of builders with modeling go down, we can anticipate more investments in pre- and post-modeling parts of the pipeline, which are typically domain-specific. Perhaps consolidation might be what we need to get engineers out of their modeling sandboxes and back to product and customer needs.
7) Security, bias, and other issues around transformer models will be inevitable as their proliferation continues (this is true for all modeling). However, consolidation means the pressure to fix problems as they arise will also increase, resulting in increased investment in the number of researchers and engineers working on those problems.
Of course, the future never unfolds linearly, as presented here. Considering the history of technology and engineering, consolidation of ideas is a given. I am not saying Transformers will be the future of modeling (some claim it already), but if there ever was a candidate for that, transformers certainly would be one. Also, consolidation doesn’t mean other ideas will die out; just that resources for ideas will be distributed according to a power law. But it’s interesting to start thinking about the second and third-order effects of modeling consolidation and prepare accordingly.
Thanks for reading AI Research & Strategy! Subscribe for free to receive new posts and support my work.