Always a Policy Debater: Building a Verbal Reasoning Benchmark for LLMs

14 minute read

Published:

No, I didn’t use AI on this essay

Large language models (LLMs) have become the new normal. Instead of searching on Google for an almost there but not quite answer, we now approach ChatGPT for a more complete answer that gets us from start to finish. A day-to-day input includes “solve this math problem” or “what is wrong with this code snippet”, and the output will cover more bases than Google would. However, there’s one task that ChatGPT currently does not perform better than humans at….

I’m sure that as a college student within the last two years, at least one professor has gone over the AI policy. I recall in one of my English classes, we spent an entire lecture going over what we as a class believed AI should be used for and not used for. While most people jumped on the no AI whatsoever bandwagon, many people did agree that AI is very good for catching grammatical errors. This sentiment is likely echoed in many other classrooms as well.

There are also those classmates who want to use ChatGPT to write their entire essay so they don’t have to move a finger, but they know that if they say it out loud explicitly, many pairs of eyes will be side-eyeing them. Instead, they employ a “let’s see what I can get away with” tactic to push for a maximal use of AI. They ask, “What if we just use it for brainstorming?” or “How about using it to polish my ideas?” My dear fellow classmate, we all know what you’re up to, but you’re only digging yourself a deeper hole for you to jump into…

Because AI is really bad at writing. Based on this class discussion, I took it upon myself to experiment whether ChatGPT could actually write my essay about the 10 Things I Hate About You and The Taming of the Shrew. Needless to say, it did not have the capability to pull quotes from the book which already hinders its writing capability. Although one should argue that ChatGPT now has the power to browse the internet (cough cough SparkNotes), it doesn’t seem to have the technique to write.

As early as elementary school, we learn that writing a paper involves one quote per two pieces of analysis. The chosen quote should support a point made about the work as a whole, while the analysis is why the quote supports the thesis we’re making about the work. ChatGPT on the other hand really only tackles the surface level of a work. I had an English teacher that hated when students would say something like, “This quote demonstrates this point.” This English teacher would hate ChatGPT’s writing. ChatGPT tends to make bold claims that are definitely true but might not be rooted in the quote it previously chose.

ChatGPT’s capabilities have definitely improved since finishing up this essay as of October 2023. Now, with GPT-4o and better prompt engineering than just “write this essay”, I’m sure that ChatGPT could cook up something quite impressive.

All this begs the question…

Are LLMs capable of critical thinking?

Well, nobody knows really. Nobody has exactly measured its critical thinking capabilities nor have come up with a benchmark. OpenAI has released its simple evaluations which include benchmarks such as math problems, reading comprehension, and coding questions. But how exactly can critical thinking be measured?

A couple days ago, I was sitting at the Starbucks I’m sitting at right now, wracking my brain to come up with a startup idea just so I don’t have to work for other people in my life ever again. The only problem at hand coming up with a startup idea. Every two seconds I turn, I find another group of people that I’m somewhat associated with thriving with their startup that just employs LLMs to replace humans doing that thing. Like tell me why I know not one, but TWO groups of people that have quit college to build their really original and unique AI interviewer. These all-the-same ideas win hackathons and fellowships and success in general; it’s just an early bird gets the worm situation.

I thought to myself, “What is something I do that can be combined with LLMs?” CS interview prep? Nah, clearly this is already being done. Piano? Didn’t quite make sense. Ping pong? This obviously does not involve language of any sort.

A language task. What about… debate?

Last week or so, I met up with my high school friend who currently does college policy debate, and I asked her about her thoughts on ChatGPT. She mentioned a couple flaws about ChatGPT, mainly how it fails to serve the general purposes of debate.

Even though I quit participating in high school policy debate roughly four years ago, my complete respect for the activity hasn’t waned a bit. I have always helped on the side to allow the debaters to succeed and promoted my sister to keep at it. Debate has a special place in my heart, mainly because the people that have done debate are some of the best people I’ve met.

If there’s one commonality between debaters, it’s their complex and rapid ability to think critically. These people spend seven weeks and six days a week at debate camp over the summer, absorbed in a subject matter, training to speak with substance at 400 words per minute. Debate emphasizes warrants and clash, how evidence can be used to either support or counter an argument. Because it’s so intense, debate also emphasizes teamwork, the division of labor for a common goal of winning for the entire school. There’s no way that two people can win a debate without the entire team behind them.

Each year, debaters become experts on topics. Not only do they argue for their own case, they also have to negate other people’s cases. Six rounds of preliminary debates each tournament, each debater attending roughly 15 tournaments every year. Each argument becomes incredibly refined from the first week of debate camp to the finals of the Tournament of Champions.

Explain it to me so I understand

Debate is one of the best datasets I can think of that can serve as a benchmark for critical thinking and verbal reasoning. Most people are unfamiliar with the nuances of debate, just like the guy at the mall who asked, “Is it the one that actually does thinking or the one that talks fast?” For this project, you don’t actually need to know how to debate but rather how each aspect of debate is going to serve our goal of creating this benchmark.

1. Methodology

The main reason why I’ve decided to go with a benchmark instead of training an LLM directly on debate is mainly because debate is a rather corner case example of verbal reasoning. Creating this benchmark should be the first consideration in order to measure an LLM’s current capability. To those unfamiliar with debate, they might ask, can debate actually serve as a benchmark for LLMs? My intuition is that it can and should be used. For example, to evaluate the quality of a paper, you might submit it to a conference or turn it in to a professor for a grade. In reality, it would be infeasible to submit papers produced by LLMs to these places and determine what made them accepted or rejected. On the other hand, these are the norm in debate. Who wins and who loses, what makes an argument win or lose are all what makes debate. You have a ground truth in debate that you might not have elsewhere.

After creating this benchmark, it’s worth exploring whether fine-tuning an LLM will enhance its critical thinking capabilities rather than limit it to the subject of debate. At the end of the day, the biggest limitation is that debate is a rather closed space. It doesn’t cover all subject matter that exists on the internet but rather focuses on a specific topic each year. Can and should it be said that debate can improve the critical thinking skills of LLMs across the board?

2. Data

The data is going to be debates themselves. YouTube has a ton of round recordings from both college and high school level debates. There are not enough that are publicly available, but I’m sure there are many unlisted videos that we could ask for. Another idea for each datapoint is to extract arguments from the debates themselves. Instead of each debate being a datapoint, each argument could be a standalone datapoint. Since we’re going to be using a large language model, we’ll have to get the transcripts of each debate to feed to our model. This is actually quite difficult. A couple methods I’m considering include using yt-dlp to extract the mp3 file, then either extracting the subtitles if the video has them or putting the mp3 file into OpenAI’s Whisper to convert the audio to text. We might also consider doing some fine-tuning on Whisper to better capture the fast nature of debates. The first speech of every debate is pretty open source and can be found on openCaselist.

3. Evaluation

To determine the quality of each argument, we can use Tabroom to gather whether each debate was a win or loss. This is not a lot of information, but is the only open source information. A better idea is to get a copy of the Reason for Decision (RFD). While these can be found on Tabroom, they are only available to the participants of the round. They may or may not be included in the round recordings. RFDs are pretty self explanatory, they are the reason a judge decided to vote a certain way given the arguments in the round. The judges should be highly qualified in order for this to be a valid evaluation of the round. To determine judge qualifications, we can determine whether they typically judge elimination rounds at higher level tournaments from their judging record on Tabroom.

On the same topic, judges typically have their own preferences for how arguments should be debated. This is typically included in a judge’s paradigm, found on the same page as their judging record. Their paradigm should be translated into a weight to account for bias in how an argument is evaluated.

4. System

The chain of events that is currently being thought up looks something along the lines of this: first, we will provide the LLM with a judge who judged that round’s paradigm. Second, we will provide the debate speech by speech to the LLM. For each speech, we will ask the LLM to do what debaters know as flowing, in other words, prompt the LLM to take notes of the essential points made in each argument. Third, we will prompt the LLM to evaluate the round given its flows and write up an RFD. The LLM will be said to have successfully determined the outcome of a round if it semantically matches up with the RFD of the real judge’s decision. Otherwise, it will be determined to have failed to evaluate the round. This should be done for every judge of the round and multiple times for each judge.

We can also create a self-reflection feedback loop scenario. This idea involves three parties: ChatGPT, the user, and a “judge”. The user will prompt ChatGPT to find the best argument to counter an argument they provide. Then, ChatGPT will provide an answer, but that answer will not go directly to the user. The answer will first go through the “judge” to evaluate the answer. In this step, the “judge” will provide feedback for ChatGPT to revise its answer in a certain way. The feedback might include stuff like arguments it failed to consider or contradictions between its own arguments. This will loop until the “judge” is finally satisfied with the answer that ChatGPT provides or some sort of set limit is exceeded. The “judge” will also be an LLM. This might be another ChatGPT or an LLM that is fine-tuned on debate. This is a next step of the benchmark that demands more resources if it requires fine-tuning.

5. Limitations

This task is incredibly time consuming and costly. Again there are not that many rounds publicly available on YouTube, and the auto-generated transcriptions from both YouTube subtitles and Whisper are unreliable. I transcribed a two-hour round using Whisper, and it took about 15 minutes (which is by no means bad). It was surprisingly inaccurate though. I would also like if the formatting of the transcript followed the flow of arguments, adding paragraph breaks that make sense if you will. It needs to determine the tag (thesis statement) of a card (piece of evidence), the author, and the source. These can be found on openCaselist but will require an effort to develop a tool to automate this. Manually accomplishing this would be too costly.

The second limitation is getting debates. I know people in debate, but I have to ask around and gather whether people would be interested in donating their rounds. There needs to be an incentive for them to donate their rounds, such as getting access to this tool that we’d be building or a monetary incentive if this project really kicks off.

Dreaming of glory and hoping for riches

All in all, I am so incredibly passionate about this project even though it seems wildly infeasible. I love that I can combine an idea from my life into a high tech project. Some people that I’ve talked to just say “seems interesting” or don’t seem to believe in the potential of this project. I already feel like I’m losing hope before I’ve even started. But I just need one person to tell me that they think this project has potential, and all the excitement will come back to me.

Please please please let me know if you’re interested in embarking on the project with me. Requirement is that you’re a CS or related major. Bonus points if you somewhat know stuff about debate.

Also looking for an advisor who will give us a grant to access certain technologies.