As I write this issue, I realize how incredibly lucky and privileged I’ve been working as an academic researcher in AI, teaching latest techniques (including writing a book), consulting for different industry leaders on how to bring AI to their business, working on social media misinformation, and building a variety of AI products at big corporations to startups (mine and others). From that vantage point, I am offering a highly opinionated commentary on piercing the hype, which is a kind of misinformation, around few-shot models like GPT-3. My goal in studying this is two fold: 1) Raise awareness about potential safety issues as technology hype is primarily a safety problem, but mostly, 2) develop a clarity in thinking about what is actually possible with such models. As we meditate on the true nature of advanced technologies like GPT-3 and its applications to automation, it forces us to examine what we mean by automation itself from the perspective of AI models. We will examine ways in which models and humans have coexisted in the past and the present, and what it bodes for the future as the AI technology landscape is changing as frequently as it is.
Looking at the audience who already follow Page Street Labs and subscribe to this newsletter, I am hoping this will seed more informed conversations around AI-based automation possibilities.
N.B: This post heavily relies on illustrations, which are embedded images. If you’re reading this over email, it might be best to read it directly on substack as some email clients clip long emails and block images.
The Tyranny of Appearances
DailyNous, a philosophy blog, invited a bunch of researchers to share their opinion about GPT-3 and related questions. Shortly after the essays appeared, someone claimed in a clickbait tweet (GPT-3 clickbait tweets are now a genre of their own, but we will use this one for illustration):
“I asked GPT-3 to write a response to the philosophical essays ... It's quite remarkable!”
Folks on Twitter did not disappoint. The tweet and its enclosing 4-page “Response” were retweeted more than a thousand times because the clickbait language and the presentation in the Response document probably made people go, “ZOMG! I can’t believe an AI did this.” To aid that, the Response came with a misleading (and an ethically dubious) “NOTE” to prime the readers into thinking that.
What was not covered in the original tweet or the Response document was the amount of human involvement needed to produce a text of that length and clarity -- with multiple generations for each sentence and careful picking and choosing that went in the composition of the generated text. A more likely scenario for the cogent text in the Response is illustrated here, which raises an interesting design question of how best to faithfully portray generated content (not a topic of this issue but worth exploring from a safety/trust point of view).
Raphaël Millière, the author of the “Response”, to his credit, published the details of the production process later, which was only shared a few dozen times as opposed to more than a thousand or so times for the original misleading clickbait. As usual, misinformation flies, and the truth comes limping after it.
Aside: The word misinformation means many things. Withholding some facts to misrepresent something is a kind of misinformation. For a good taxonomy of misinformation, I recommend this article from Claire Wardle, a fact-checker and misinformation researcher.
The Turk’s Gambit
Such sensational overrepresentations of technology are commonplace in the Valley. Many demos in VC pitches are carefully orchestrated wizard-of-oz shows, much like Von Kempelen impressing Maria Theresa’s court with his chess-playing “automaton” — The Turk.
There are accounts (personal favorite is by Tom Standage) of how Kemplen, and later Mälzel, captivated audiences ranging from peasants to nobility to the scholars at the time on the Turk’s abilities for almost a century across Europe and America before its limits and workings were discovered. The Turk was a marvel of engineering and ingenuity, but more importantly, a storytelling device. It captivated generations to come — e.g., Charles Babbage was impressed by it — and raised questions that weren’t asked frequently before, much like the GPT-3 demos are asking of us now:
While Kemplen and Mälzel were showmen and some trickery was expected of them, how does one ethically present results for technologies like GPT-3? As we will see, this is not just a question of ethics and attribution, but also a question of AI Safety — i.e., preventing AI models from being harmfully utilized.
How do we avoid the steep descent into the “trough of disillusionment” that inevitably comes after peak hype and fast-forward our way to the “slope of enlightenment” and the “plateau of productivity”? If we clear the clouds of hype, the resulting clarity will make us ask the right questions about the technology.
AI Model Capability Hype is Fundamentally an AI Safety Issue
Bringing clarity amid a model capability hype is useful for identifying true product/business opportunities. But a more critical purpose is to ensure our implementation choices lead to products that are safe for the consumer and the world we inhabit.
Safety issues from the AI model hype can arise in two different ways. The first is when product builders overreach model capabilities and, either knowingly or unknowingly, set wrong expectations with the customers. These are usually self-correcting (unless you’re on a 4y startup exit trajectory) as customers inevitably complain and regulatory bodies step in, but not without significant harm to the company building the product and its customers. Tesla overreaching the driver-assist feature of its cars to “Fully Self Driving” in its product promotion materials (and in tweets from Elon Musk himself) is an example.
Customers mislead into believing these hyped up capabilities could potentially endanger themselves and others due to misplaced trust. As AI models become easier to use (as GPT-3’s few-shot examples promise), folks building with AI models will increasingly not be AI experts who designed those models. Building appropriate safety valves in the product and regulatory framework around its use becomes critical.
The second way AI models can become unsafe due to hype is by customer overreach. People are inherently creative in how they use their tools. Folks using AI models outside of the advertised purposes for fun or entrepreneurial reasons can similarly bring harm.
Good policies, responsible communication practices, regulation, and consumer education are indispensable for creating an environment of safe consumption of AI technologies. Many of these practices are often at odds with short-term gains but not necessarily so with long-term rewards. There is a lot more to talk about AI Safety, but in this issue, I will focus on the question: how do we free ourselves from the tyranny of appearances of AI models and truly understand their automation capabilities.
What is AI Automation, and what it isn’t?
AI automation is not a dualistic experience. One of the dangers of the hype over-attributing capabilities of a system is, we lose sight of the fact that automation is a continuum as opposed to a discrete state. In addition to stoking irrational fears about automation, this kind of thinking also throws out of the window any exciting partial-automation possibilities (and products) that lie on the spectrum.
For convenience, we can break up the automation spectrum offered by the deployment of AI models for a task into five ordered levels:
Manual: Human does all the job for the task.
Extend: Model extends/augments the human capability for the task.
Offload: Model partially offloads the complexity of the task (more on this later) by automatically solving some of it.
Fallback: Model solves the task entirely most times, and occasionally, it cedes control to humans because of the complexity of the task or the human voluntarily takes over for whatever reason.
Replace: The human becomes irrelevant in solving the task.
I am deriving this categorization from Sheridan and Verplanck’s 1978 study on undersea teleoperators but adapted for modern AI models. On the surface, this representation might appear similar to SEA levels of autonomous driving (those were influenced by the 1978 study as well). Still, the critical difference in this article is the inclusion of the task and the relation between the model, application, and the task. The SEA autonomous driving levels, on the other hand, are focused on a fixed task — driving on “publicly accessible roadways (including parking areas and private campuses that permit public access)”. We cannot talk about the automation capabilities of a model in isolation without the task and its application considered together.
The Interplay of Task and Application Complexity in AI Automation
Traditional software automation is focused on a specific task and any work related to it is built from scratch. AI model-based automation on the other hand is unique in the sense that you can have a model trained on one task — say face recognition — and used in multiple applications ranging from unlocking your phone to matching a suspect against a criminal database. Each of those applications has a different tolerance for false-positives/false-negatives (aka. “risk”). This train-once-use-everywhere pattern is becoming increasingly popular with large parameter models that are trained on massive datasets with expensive compute. This pattern is especially true with large model fine-tuning and also with recent zero-shot and few-shot models.
While this pattern is cheap and convenient, a lot of the problems in AI deployments result from transferring expectations on a model from its use in one scenario to another and being unable to quantify the domain-specific risk correctly. Sometimes, just retraining the model on the application-specific dataset may not be sufficient without making changes to the architecture of the model (“architecture engineering”) to handle dataset-specific nuances.
To illustrate the dependence of automation level on the task and application consider, for example, the task of machine translation of natural language texts. Google has one of the largest models for machine translation, so let’s consider that as our model of choice. The translation model, and certainly its API, appear general enough to give an unsuspecting user to try it on her favorite application. Now let’s consider a few application categories where machine translation can be applied — News, Poetry, Movie subtitles, Medical transcripts, and so on. Notice that for the same model and same task, depending on the application, the automation levels vary widely. So it never makes sense to assign “automation levels” to a model or to a task or an application, but the combination of the model and the application.
The automation level assignments in this figure are approximate and may not reflect Google’s current systems. This example is also overly simplified as performance on “news” may also not be homogenous. Translation qualities may be different across different news domains — e.g., financial news vs. political news — or across different languages. Yet this simplification is useful for illustration purposes.
Hype (and the subsequent user dissatisfaction) often happens when folks conflate, knowingly or unknowingly, the automation level offered in one application to another. For example, someone claiming all of humanity’s poetry will be accessible in English after their experience translating news articles to English.
To consider an example with GPT-3, the success of the AI Dungeon, a text adventure game, is an example of such a phenomenon. In the case of AI Dungeon, the outputs of the model could be interpreted creatively in any way you like (i.e., there are very few “wrong” answers by the model, if any). The error margin is effectively infinity offering zero risk in directly deploying the model modulo some post hoc filtering for toxic/obscene language and avoiding sensitive topics. Based on those outcomes, it wouldn’t make sense to deploy the model unattended as it stands today, say, for business applications. And in some cases, like healthcare, it may make sense not to deploy the model at all.
Aside 1: So far, when we consider situations where models “fallback” to humans we haven’t considered the thorny problem of knowing when to fallback. Today’s Deep Learning models, including GPT-3, are incredibly bad at telling when they are unsure about a prediction, so situations which require reliable fallback to humans cannot take advantage of such models.
Aside 2: Modeling improvements can push a model’s generalization capability across a wide range of applications, but their deployability will still widely vary. In fact, in risk intolerant applications with very little margin for acceptable error (consider the use of facial recognition for policing), we may choose to never deploy a model. Whereas in other applications, say use of facial recognition to organize photos, the margin of acceptable error may be wide enough that one might just give a shrug about the model fails, and hope for a better model update in the future.
Edwards, Perrone, and Doyle (2020) explore the idea of assigning automation levels to “language generation”. This is poorly defined as language generation, unlike self-driving, is not a task, but a means to accomplish one of the many tasks in NLP like dialogue, summarization, QA, and so on. For that reason, it does not make sense to assign an automation level for GPT-3’s language generation capabilities without also considering the task in question.
Capability Surfaces, Task Entropy, and Automatability
Another way to view the performance of a model on a task is to consider it’s Capability Surface. To develop this concept, first let’s consider an arbitrary, but fixed, ordering of the applications (domains) where the model trained on a task is applied. Now, for each application, let’s plot the automation capability levels of the model across different domains. Now, consider an imaginary “surface” that connects through these points. Let’s call this the capability surface.
AI Models rarely have a smooth capability surface. We then define Task Entropy as a measure of the roughness of this capability surface. As the model for a task becomes sophisticated, and it is trained with increasingly large datasets and compute, the task entropy, for that fixed model, reduces over time. The task entropy is then a measure of the Automatability of a task using that model.
Aside: All this can be laid out more formally. But for this publication, I am taking a “poetic license” and focusing on developing intuitions.
Capability Surfaces of Few-shot and Zero-shot models
In traditional AI modeling (supervised or fine-tuned), the task is usually fixed, and the application domains can vary. However, in zero-shot and few-shot models, such as GPT-3, not only the application domains vary, but also the tasks can vary too. The tasks solved by a GPT-3 like model may not even be enumerable.
In the case of GPT-3, the task may not even be explicitly defined, except with a list of carefully designed “prompts”. Today, the way to arrive at the “right” prompts is prospecting by querying the model with different prompts until something works. Veteran users now may have developed intuitions for how to structure the prompt for a task-based on experiential knowledge. Despite this care, the predictions may be unreliable, so carefully understanding the risks inherent to the application and engineering around it is indispensable.
Aside: GPT-3 is often touted as a “no code” enabler. This is only partially true. In many real-world problems, such as writing assistance and coding assistance, the amount of boilerplate is so high and the narratives are so predictable in language that it is reasonable to expect GPT-3 to contextually autocomplete big chunks based on the training data it has seen. This is not necessarily a negative. With bigger models like GPT-3 the lego blocks we play with have become incresingly sophisticated, but a significant amount of talent and, many times, coding is needed to put together something non-trivial at scale. As Denny Britz points out (personal communication), “[the cost of error when writing code with GPT-3 is kind of high.] If you need to debug and check GPT's code, and modify it, are you really saving much from copy/pasting Stackoverflow code?” Another problem with generality of GPT-3 based applications is that they tend to cover only the most common paths, while reality has a fat tail of “one-offs”.
Embracing this way of thinking using capability surfaces and task entropy allows us to develop a gestalt understanding of a model and foresee its many application possibilities without succumbing to hyped up demos and misrepresented text completion examples.
Summary
Automation is not an all or nothing proposition. An AI model’s automation capability is highly conjoined with the task and application it is used in. This realization leads to many exciting partial-automation possibilities that can be highly valuable. Studying a model’s Capability Surface and the Task Entropy can be critical as to applying the model to a task. While capability surfaces of traditional supervised and fine-tuned models are far from being smooth, it only gets worse with few-shot models, where the number of tasks and applications are uncountably many. Studying capability surfaces of complex models is essential for piercing through the hype and ensuring safe deployments of those models.
If you like more such detailed analyses, do not forget to subscribe.
Disclosures: GPT-3 or similar models did not assist in any of this writing. This article mentions multiple entities. The author or Page Street Labs were not incentivized in any way to include them. They appear only because of the discussion I wanted to have.
Acknowledgments: Many thanks to Jen Hao-Yeh, James Cham, Peter Rojas, and Denny Brtiz for reading/commenting on drafts of this article.
Cite this article:
@misc{clarity:ai-automation-1,
author = {Delip Rao},
title = {GPT-3 Turk’s Gambit and The Question of AI Automation},
howpublished = {\url{https://pagestlabs.com/clarity/ai-automation-1}},
month = {August},
year = {2020}
}
Thank you for the article. I loved the portrayal of the application/task/performance space. Will be interesting to see the actual population of this space for GPT3 (and perhaps it successors) throughout time