How to Annotate Data for LLM Applications

Показать описание

AF: The last example that you gave is that in a fully automated case, you let two bots carry out a conversation. Then later you're going to go check it and be like, Oh my God, 70 percent of it is completely useless.

The alternative is to sit down a bunch of experts and say cancel all your meetings for the next week. This is what you're going to do for the whole week, right?

Then there is something in between that's sort of a human in the loop type of annotation scenario where you can have a bot to bot system, except that there is a human sitting there they can read the generated answers, click on it and say, thumbs up, thumbs down. Essentially the human and the machine are collaborating to create that data set.

There are a lot of interesting nuances around even how to build the UX for that annotation.

DL: It's a great point. At least in the past couple of years, I've seen a lot of data annotation tooling pop up, open source, proprietary. It's a very worthwhile investment to do. Based on some of the things we were talking about there's different parameters to take into account. How do you make sure the person is fully engaged and making sure that the person actually provides good responses and doesn't just smash buttons. Making sure that you can actually have a business outcome that these are top 20 sets of questions that come in through the platform and to quantify that there might be 50 different variations of those 20 pathways.

I'm sure large companies have built something out to do this kind of task. But the challenge as a smaller company is that what we build internally is useful for us, but when we try to present it to customers, the level of rigor has to be much higher.

AF: Definitely agree with your point that the annotation tool can be, in a lot of cases, a completely separate product. One idea is building some sort of a teacher model into an existing user experience. Let's say normally 90 percent of what is going on in the back end is hidden from the end user. But you flip a button and just becomes overly verbose. The task of the annotator is: read and verify those verbose types of information.

DL: You start getting into kind of the chain of thought verification rather than just outcomes. It again goes back to the challenge of human preference. Even looking at our own customers, there are trade offs in the model and the user has two conflicting perspectives on a response. On one dimension the conversational aspect is better, and that's very subjective. On the other, the multilingual translation isn't as good, right? So now you have to reconcile on these different axes to figure out what is actually "good". Is that a thumbs up, a thumbs down? Going back to the brand, the voice, all these things that companies spend billions of dollars on their PR as well, I don't know how we continue to evolve the sort of human feedback perspective.

There's another recent paper released about human preference in English in North America versus other countries. Completely different preference, cultural norms. Once you start getting into that level of granularity, you get this quantum entanglement or chaos that's really hard to break.