Technology has evolved exponentially over the course of human history, with most advances aimed toward taking labor off the shoulders of skilled workers. The printing press shortened the time it took to print a book and created output so robust it brought a surge of literacy along with it. The internal combustion engine streamlined transportation, manufacturing, and industry, replacing manual labor with machines. Even the internet, though it’s made certain types of communication more complex, simplified worldwide logistics and sharing, while also digitizing much of the old paper-filled world. Our world is constantly changing with the advent of new technologies built to make tasks easier or businesses more money. The one sector technology has yet to replicate, though, is creativity.
The kind of human ingenuity that brought us those technological innovations in the first place isn’t something that can be automated. The same goes for any creative process, whether it be writing, visual art, or music. But, as our ability to store massive amounts of digital data grows, so does the capitalist need to automate the creative jobs that business owners in America have long undervalued.
It wasn’t long after I started at my copywriting gig that concerns about AI language generators, like ChatGPT, began to crop up in Slack channels. While alarming at first, the fervor has grown to a point now that I have had to talk some of my friends, both writers and artists, down from the ledge. While I do believe AI models like Midjourney and ChatGPT are damaging to the creative industry as a whole, I don’t think it’s time for every artist to throw in the towel. There are plenty of issues that both software engineers and lawyers will have to resolve before any of these programs are ready to replace an entire workforce. Let’s take a hard look at AI, specifically AI writing, and why it’s not likely to take our jobs– yet.
Know Your Enemy
Navigating the fear around new technology starts by defining it. Ever since its introduction to the wider market, language-generating software like ChatGPT has been called “AI”, or “artificial intelligence” by the media. This may be a way to shorthand the technological innovations happening within the code or a reference to the company behind ChatGPT, OpenAI (probably a bit of Column A, a bit of Column B). One thing is for certain: programs like ChatGPT are not AI. These language generators aren’t exactly HAL 9000 or a Replicant from Blade Runner. There is no sentience or simulation of human thought happening between the ones and zeros. Instead, a more accurate term for these programs would be “Large Language Models”, or “LLM” for short.
Large Language Models work by collating a massive set of text files– anything from books to news articles to social media posts– and analyzing each word, phrase, and sentence. It starts labeling each piece of language with a statistic, built from how often that word or phrase is used in conjunction with similar words or when addressing certain topics. After building enough of a statistical database, it can begin attempting to output sentences that, in theory, should make sense. At these beginning stages, they rarely do.
While Large Language Models can guess what a human sounds like from these statistics, when they actually attempt to generate coherent sentences for the first time, they’re mostly gibberish. It’s the equivalent of raising a person in an isolated room with nothing but a dictionary for entertainment and hoping they’ll come out in twenty years able to hold a conversation. Before Large Language Models can take any prompts and respond with human-like clarity, they need to be taught when their output is right, when their output is wrong, and when they say something inappropriate. To learn that, these models need the help of those it seeks to emulate: humans.
The Human Touch
All LLMs go through an intense period of human-led moderation before release. In this quality-control step, the LLM will be asked to generate answers to prompts, and human readers will rate the responses for clarity, grammar, and informational accuracy. Each rating will update the LLM’s statistical data, allowing it to make better guesses for future prompts. The more ratings supplied by human moderators, the closer a LLM comes to “sounding” human.
The biggest issue with this moderation is scale. As indicated in the name, the Large Language Models use a massive amount of data to form their statistics– in many instances, the programs use a snapshot of the entire internet. Every news article, every social media post, every Tweet, every blog post (like this one), is scanned and used as the data points. The idea is that the more data the LLM has, the more naturally the generated text will read. The problem, which anyone who has spent enough time on the internet can tell you, is that there is some nasty stuff hiding in the dark corners of the web. And, much like checking the LLMs for accuracy before being released, human moderators have another awful job: making sure that their program doesn’t spit the nasty stuff back out.
In a widely read article from Time, investigative journalists found that OpenAI hired Sama, a Silicon Valley company that outsources its labor to workers in Kenya, Uganda, and India, for this type of moderation. The piece focuses on the workers in Kenya who, for less than $2 an hour, were required to read graphic depictions of violence, incest, rape, and death to properly label the content as “inappropriate” for the LLM. Limited mental health resources were available for workers, and what was available was often inaccessible due to productivity quotas. The contract ended eight months early when, in a bid to expand into digital image creation, OpenAI sent similarly disturbing image content for labeling, allegedly including child pornography. It’s a stunning look into the human misery these kinds of technologies require to be marketable– and that’s only in the time before the LLM’s release to the public.
Moderation can’t stop even after an LLM goes public. To stay up-to-date with the latest information and world news events, these programs need to be regularly updated with new data sets or new snapshots of the internet. While the initial heavy moderation may help with filtering, there’s no way to catch it all. However, the inclusion of disturbing, illegal, or offensive content is only a symptom of Large Language Models’ biggest flaw.
Too Big to Fail
It’s hard to imagine how big the entire internet is. It’s like asking someone to think about the size of the galaxy– it’s too big for our brains to imagine. A data set that large, and one that continues to expand with each update, isn’t just impossible to imagine, it’s impossible to manage. The incomprehensible size of the data set used for an LLM is the software’s Achilles heel, leaving it vulnerable to model collapse and data poisoning.
As LLMs like ChatGPT are used more frequently by businesses, spammers, or low-tier news sites, LLM-generated content will begin to flood the internet. These articles, even if they read as “human”, often have a slight off-ness to them, maintaining a vague, circular voice that avoids saying too much so it won’t be wrong. Still, many of these LLM-generated pieces can pass as poor, generic writing. But as companies take more snapshots of the internet, less new content will be written by humans and more will be written by these LLMs using their weird, off-putting syntax. These articles will be added to the LLM’s new data set, skewing statistics away from actual human writers and toward other LLM-generated writing. Like an ouroboros, software like ChatGPT will start sounding less and less human and more and more fake with each update. This self-destruction from oversaturation is called “model collapse”. With the size of the internet, this may take a few years of widespread use to be realized. In the meantime, though, there’s data poisoning.
Data poisoning is a fancy phrase used in tech settings that, in everyday parlance, could be called “misinformation” instead. The current political climate, in conjunction with a rising lack of reading comprehension, makes the internet a breeding ground for misinformation. Without constant oversight, which we’ve established is practically impossible for LLMs, it’s likely that with each updated data set, an LLM will have more incorrect information in its banks than before, leading to a decrease in an LLM’s ability to dispense accurate information. But misinformation isn’t only folks on the internet proudly making declarations without factual basis– data poisoning can be intentional, too.
As an example, on the social media site Reddit, a group of fans of the long-running online game World of Warcraft participated in a directed attack of data poisoning. By making several excited posts on Reddit about a new character, “Glorbo”, and the new items, skill sets, and lore this character would bring to the game, a data-scraping LLM bot for a low-tier news outlet generated an article about this fictional character, even quoting one user who said, “Honestly, this new feature makes me so happy! I just really want some major bot operated news websites to publish an article about this.” While a light-hearted way to expose LLM-generated content and how misinformation spreads, data poisoning can be even more sinister: just ask Microsoft about their 2016 Twitter Chatbot, Tay, who was radicalized into spouting Nazi rhetoric within 24 hours.
Who Gave You the Copyright?
Throughout this breakdown of “AI” writing and its technical pitfalls, I’ve avoided one very large legal issue with the way this technology operates. Across the infinite landscape of the internet, there is a large selection of copyrighted work, and LLMs don’t have a way to differentiate a copyrighted work from a public domain work when scraping data. Every book that’s free on the internet, every short story published in an online magazine, every blog post from a small-time blogger (hi!) is input into these LLMs. This opens up an entirely new legal can of worms: Can the companies who build these LLMs use copyrighted works to train their software? And what rights do copyright holders have when contesting this usage?
There are currently multiple lawsuits pending, both from authors against ChatGPT’s OpenAI, and artists against Stable Diffusion’s Stability AI. Until these and numerous other lawsuits work their way through the courts, neither the companies responsible for these generative software nor the copyright holders have answers as to who owns the right to do what and whether the content that’s output can hold a copyright itself. This compounds another problem “AI” parent companies are facing: monetization. Currently, ChatGPT is on the path to bankruptcy as no one has quite figured out how to get users to pay for LLM-generated content. With legal battles looming, a lack of funding, and increasing government interest in regulating the industry, generative content software may die a slow, penniless death before it ever falls to its flaws. That said, we shouldn’t ignore that while these first LLM companies may fail, more may rise in their place. If you’d like to defend copyright holders and the idea of copyright in general, the US Copyright Office is currently taking public comment on copyright law and policy issues regarding “AI” generated content. The comment period is open through October 18th, so take this opportunity to help the creators in your life and have your voice heard.
It may be hubris talking but, with all these issues combined, I don’t see ChatGPT taking my job any time soon. Instead of throwing in the towel, it’s time to use it to wipe the sweat off our brows, fight for creatives’ rights over our own work, and force-feed the snake its own tail. Until next time friends, happy sipping and happy reading– so long as whatever you’re reading was written by a human and not a Large Language Model. 😉