On the subjectivity of human-language model interactions and new paradigms for LM evaluation
In conversation with Mina Lee, Ph.D. in Computer Science at Stanford University
Mina Lee just received her Ph.D. at Stanford in computer science and is fascinated by the way language systems augment human capabilities. Across her work, she contributes to the fundamental, often overlooked evaluation infrastructure that shapes research incentives within NLP and HCI communities. In “Evaluating Human-Language Model Interaction,” she proposes a new evaluation framework for language models (LMs) that accounts for human subjectivity and user adaptation over time. In “CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities,” she curates a dataset of interactions between human writers and LMs, allowing designers and researchers to better understand and assess human-LM collaboration. We first met Mina when she joined a working group we run on generative AI and creative expression—this conversation started in our weekly meetings and has finally made it to a written form.
On Researching Language Models
Tell us about your work.
I study language models (LMs), focusing on how humans interact with them and how we can develop and evaluate LMs to augment human capabilities. For me, development involves training and adapting LMs to support different human needs arising in human-LM collaborative writing. I’ve built various writing assistants including an autocomplete system, a contextual thesaurus system, and a general-purpose writing assistant. On the evaluation side, my research aims to capture the complexity of human-LM interaction and to evaluate LMs based on their ability to interact with humans.
In your paper Co-Author, you curated a dataset of interactions between human writers and LMs, allowing designers and researchers to better understand and assess human-LM collaboration. How did this project come about?
In 2019, I was working on a project that tested a model’s performance on text infilling.1 Our prototype was initialized with a short story. Each word was in a box, then each sentence was in a bigger box, and finally, the entire document was in one huge box. You could click however many boxes you want to replace or infill. When we put that system in front of our lab mates, they had a ton of fun.
We didn't build it with the intention to make a creativity support tool, but the way people used it made me realize, oh, this can actually spark inspiration. Questions and challenges came from this realization. How do we measure fun? How do we measure creativity? Is the model intelligent in suggesting something creative or is this just a glitch people are deriving value from?
Then, a little later, GPT-3 came out and a lot of apps were built on top of it. Personally, I was a bit dissatisfied with claims like “if you use this tool, you'll write 10x faster” or “you'll become super creative” that weren’t backed by concrete evidence. How are these models actually helping people perform different creative tasks? It seemed so anecdotal. I felt it was important to study how people were actually interacting with these models and to identify how to best measure and capture those interactions. So that’s how the project came about.
On Human-Language Model Interaction
Language models tend to be evaluated according to objective metrics based on model output. Your research suggests, however, that people may prefer models which rank higher when evaluated according to subjective user preferences. Can you elaborate on the differences in these modes of model evaluation?
In the standard non-interactive model evaluation, you mostly look at model outputs. For example, in the case of question answering, you feed in a question, the model generates an answer, and you check whether that answer matches a gold standard reference you have (i.e., non-interactive performance).
In our work on creating human-LM interaction metrics, we highlight several dimensions that are crucial to capturing interactive performance. For instance, we look at the entire interaction process—not just model outputs. For a question-answering task, we could imagine a user comes in and asks the question, but the model doesn't generate the correct answer right away, so the user reformulates the query. In capturing this “interaction trace,” you can evaluate the model on questions like: how many times did the user have to query to find an answer when they were interacting with one model versus another? Was the model output helpful, even if it may not have exactly matched the gold standard reference?
Another dimension is evaluation criteria. In the standard benchmarking setting, it's mostly about quality, and less so about human preference. For a chatbot, the most important metrics might include fluency and sensibility. If Model One performs better than Model Two on a majority of these metrics, we assume that Model One is probably better, so we deploy this model to build a chatbot. In our research, however, we observed that users in a social dialogue setting prefer (or similarly like) the model that may generate a slightly less fluent but more specific utterance. So you can imagine that if the latest version of GPT takes a very safe and conservative stance and always generates politically correct answers, while an older version of GPT gives you a more diverse and specific answer, users might prefer the older version in this particular setting.
People’s preferences matter. By just looking at the non-interactive performance, you may not get much improvement in interactive performance.
Public Understanding and Misunderstanding of LMs
You wrote “appropriate interaction design for a new technology requires a deep understanding of capabilities and limitations.” Right now we're seeing the public interacting with these tools in so many different ways. What do you think people might misunderstand?
If you're not familiar with language modeling, I think it's easy not to know that these models generate from left to right. I heard about an instance when someone used ChatGPT to generate an abstract and then asked the model to change a specific part in the middle. The user was surprised that the model basically rewrote the abstract starting from that point. In theory, the model could have changed only the part she intended it to, but it’s really hard to force the model to follow such instructions. Controllability is an inherent issue.
People sometimes don’t realize that these models are stochastic and are surprised when the models generate something different every time. It’s also not always apparent that the model isn’t consistent with itself.
But on the flip side, I was impressed when people figured out special cues to add at the end of a prompt to make it much better, or began naturally using chain-of-thought prompting. There are now many tutorials and communities around this prompt engineering effort for ChatGPT or text-to-image models, and I find things that people figured out simply jaw dropping (e.g., this MidJourney reference sheet). These tips may not be intuitive from the beginning, but then when you use them, the models work much better.
Measuring the Lasting Effects of LMs
As you wrap up your Ph.D. here, what are the questions or ideas that you're most excited about exploring in future work?
At a very high level, the question I have is: how will language models change the way we write and communicate? For example, Jakesch et al. (2023) look at how people’s opinions change after interacting with opinionated language models. However, I believe that we don’t fully understand the long-term effects of interacting with these models yet. As I get further in my research, I want to feed what I learn back into language model development and evaluation to incentivize positive outcomes through more NLP researchers focusing on the importance of interactions and communication.
Embeddings is an interview series exploring how generative AI is changing the way we create and consume culture. In conversation with AI researchers, media theorists, social scientists, poets, and painters, we’re investigating the long-term impacts of this technology.