On tradeoffs in language models, weirdness of decoding methods, and writing as a communicative act
In conversation with Katy Ilonka Gero, computational poet and incoming human-AI interaction postdoc at Harvard.
Katy Ilonka Gero just finished her Ph.D. in computer science at Columbia with a focus on writing assistance and the impact of generative language technologies on creative writing, journalism, and science writing. She now lives with her dog in Cambridge, MA where she will soon be starting her position as a postdoctoral researcher at Harvard. Katy has a writing practice herself and currently focuses on computational poetry. She brings her perspective as both a computer science researcher and a writer to her creative and technological practices.
On becoming a writer and computational poet
When did you start thinking of yourself as a writer?
I have always thought of myself as a writer but computational poetry is much more recent. When I started my Ph.D., I tried to keep my computational work separate from other writing work. I was worried that by dipping my creative practice into computation, I would stop being taken as seriously as a writer. But as a creative practitioner, when you're afraid of something, that can be a really rich area for exploration. I started asking myself why I was so afraid to use computation in my creative practice. I realized I was afraid that computational tools would make writing easier and that’s not what I was looking for. As I've started taking computational writing work more seriously, I've come to appreciate that these tools do not necessarily make creative work easier.
On responsibility for generative language
Do readers deserve transparency when generative and/or computational tools are used in writing? How does the use of these tools influence an artist’s sense of creative ownership and agency?
When you incorporate any kind of generative technology in language production, it raises the question of whether or not you are going to take responsibility for that language. That's really challenging because there's the question of what you want the reader to know and what you think the reader deserves to know about your process.
On one hand, I might want my work to be taken seriously without the computational angle. But on the other hand, I might feel that it is somehow deceptive not to mention the process by which a text is written. Taken to the extreme, if a reader is given a book and told it's written by Ted Chiang, but at the end of it, they learn it’s a story written by ChatGPT, they're going to be mad because someone lied to them.
Today, most people who are taking language models as a writing tool really seriously are using these models collaboratively. They're doing most of the writing—or a huge amount of editing and curation. They retain so much agency and ownership over the final result. In my own experiments using language models, I feel so hyper-responsible for the text that at some point it feels like I wrote all of it. I'm also training models on my own writing. In that situation, it's kind of hard to argue that I didn't write the final text. In many ways, all of these words are mine.
You’ve used the phrase communicative act to describe (human) writing. Tell us more about this idea.
Writers come to writing with purpose and intention. They get their books published because they want someone to read them. Baked into our experience of literature is the idea that someone at some point wanted to share a message with the reader. That's the communicative act. Literature—whether it be stories or poetry or memoirs, or even popular science books—comes from people who have a strong drive to share.
Right now, computers don't have any drive for the communicative act. Like language models, they don't want to do anything. I guess they "want" to maximize their objective function, but even that is kind of complicated. For instance, to decode from a language model, one could use greedy search, beam search, sampling, or Top-p sampling, etc. In this case, decisions about what decoding mechanism to use are a way in which a person is attempting to influence what one might consider to be the communicative act of the algorithm.
On Decoding & the Algorithm’s Communicative Act
What is decoding and why are you interested in it?
These days, “language model” is just a general-purpose term for a model that selects a next word based on a probability distribution over all possible words. The decoding method—e.g. how that word is selected based on the probability distribution—makes a big difference, especially in creative settings.
There are two common deterministic methods for decoding. One is greedy search, where you choose the most likely word at every step. The other is beam search,1 where you look at the top N possibilities at every step, and keep track of what the top N best paths might have been.2
When people use language models as classification tools or to answer factual questions, changing the decoding method affects the way these models perform on benchmark tasks.
That's fascinating and weird, but it’s generally not even reported in a paper or project. It doesn’t seem to get talked about enough.
What do you mean by weird?
For example, beam search is super interesting because it's a heuristic that we came up with that happens to do really well. There's this amazing paper called "If Beam Search is the Answer, What is the Question?" It takes an information theoretical approach to understand why beam search works so well for language when there’s no particular reason it should. Why doesn’t choosing the most likely word work better? That's why I think it's weird. Decoding has a big impact on the generative capabilities of a model but doesn't necessarily align with how we experience language production.
Why don’t we hear more about decoding?
In computer science right now, there's the primacy of the model. We talk about ChatGPT or Google's LaMDA, but we don't love to talk about training data and decoding methods. In academia, innovations in collecting training data are not considered nearly as prestigious as innovations in model architecture—even though you might argue that innovations in collecting training data are more important. Instead, we think model innovations are more technical, and as a society, we think technical innovations are more important than "softer innovations" like collecting data.
On model capabilities and limitations
When we optimize for coherence in training language models—that is, prioritizing legible output—other qualities of language and text are lost. Specifically, you’ve written about the tension between coherence and diversity:
“Despite their successes, language models continue to exhibit known problems, such as generic outputs, lack of diversity in their outputs, and factually false or contradictory information” (Sparks: Inspiration for Science Writing using Language Models).
What do you mean by “lack of diversity in their outputs” and why is that a problem to you?
In the case of that paper, we were interested in language models as idea generators. With idea generation, having a bunch of diverse ideas is inherently valuable. But trying to get the model to generate more diverse inputs—that is, outputs that are less likely under its probability distribution—also reduces output coherence.
So by increasing diversity, you get some kookier ideas and some sentences that don't make sense. The model doesn’t distinguish between a semantically weird sentence and a syntactically weird one. That’s great in a lot of situations, but you still want to be able to access the semantically weird sentences with strange ideas.
What are the implications of this tension for creatives?
From an artistic perspective, smaller models that have been trained on curated and specific training data tend to be more interesting as tools. At the same time, smaller models are less coherent.
How do you think about creativity in your research, especially in relation to machines? Can machines be creative?
This question is hard to answer without being more specific about what kind of creativity you’re talking about. There are a lot of cultural aspects to how communities judge creativity. With computer-generated art, models are capable of producing things that seem creative but I tend to think they're not capable of discerning which of their outputs are creative. The ability to judge or filter is key. You could imagine the computer writing a hundred short stories. Can the computer pick the three that are worth sending to the literary magazine? Or the one that is award-winning?3
Reading Recs
A computational poem you’ve written?
WHALEFALL. I think this poem does a good job of using computation to give the reader a new experience of a poem, but it is based on a poem that I originally wrote by hand.
Other computational poetry?
Travesty Generator by Lillian-Yvonne Bertram and Articulations by Allison Parrish.
Conventional poetry?
Deaf Republic by Ilya Kaminsky. That's the book that I suggest people read if they don't really read poetry. Everyone always likes it. I think he is a genius.
Writing on computation and the creative practice?
Allison Parrish’s “Language models can only write poetry.” Also, Aaron Hertzman’s “Can Computers Create Art?” is a great essay about how computation changes art. He really situates it in a history of photography and the moving picture.
For more of Katy’s work, check out her research on an exploratory user study of how novelists use generative language models, building a stylistic machine-generated thesaurus for writers, and creating an interactive system for metaphor generation. Also, read her poetry which can be found on her website. Katy is currently transitioning from her time as a Ph.D. student at Columbia to postgraduate research at Harvard. You can follow her journey on Twitter, Mastodon, or her personal site.
Embeddings is an interview series exploring how generative AI is changing the way we create and consume culture. In conversation with AI researchers, media theorists, social scientists, poets, and painters, we’re investigating the long-term impacts of this technology.
Author’s Note: With beam search, instead of choosing only the most probable word, the algorithm considers the top N most probable words as candidates. By keeping track of multiple candidate sequences at each step, the algorithm is able to explore a broader range of possible sequences, which increases the likelihood of finding the most probable sequence. (Written in collaboration with ChatGPT).
Footnote from Katy: Alternatively, models might sample from the probability distribution. In this case, the output is not deterministic, but stochastic. You can get a slightly different output each time. Although OpenAI has not publicly stated which decoding method ChatGPT is using (at least, as far as I know,) it seems they are using a stochastic method, given that you can ask ChatGPT to regenerate its response and get something different. Also, Scott Aaronson's description of his idea for watermarking GPT outputs, which he came up with when working at OpenAI, suggests sampling is being used.
Footnote from Katy: “Award-winning” is itself a cultural and contextual notion. A study of literary prizes found that writers “with an elite degree (Ivy League, Stanford, University of Chicago) are nine times more likely to win than those without one. And more specifically, those who attended Harvard are 17 times more likely to win.” They found that half of the prize-winners with an MFA “went to just four schools: [University of] Iowa, Columbia, NYU, or UC Irvine.” Iowa has special clout: its alumni “are 49 times more likely to win compared to writers who earned their MFA at any other program since 2000.”