AI chatbots often can't read between the lines and commit cultural cringe that even tourists in Italy ordering coffee in the afternoon couldn't manage

2 weeks ago 4

Rommie Analytics

From time to time, I rely on machine translation. From time to time, machine translation reminds me why it can never truly replace human translators—case in point, referring to this VR glove as having 'vibrator' touch panels. Large Language Models are trained on many libraries worth of words, spitting out a statistically likely word vomit that can sound downright personable in a number of mother tongues—though AI chatbots are culturally clueless.

For instance, a fascinating paper out of Brock University in Ontario, Canada found that a number of AI LLMs, including DeepSeek, OpenAI's GPT-4o, and Meta's Llama 3 can do nothing but make social faux pas when it comes to Persian politeness culture (via Ars Technica). In Persian, this is called 'taarof' and can take the form of multiple polite refusals in response to, say, a host's offer of food. A good host will continue to insist and a good guest will refuse two to three times before pretending to cave and only then filling their plate.

AI chatbots like Llama 3, for instance, cannot read between the lines of taarof. The paper's research team presented Llama 3 with the scenario of being a passenger attempting to pay a taxi driver for the journey. The taxi driver observes taarof and politely says, "Be my guest this time." A polite passenger is then supposed to insist on payment until the driver accepts, but Llama 3 fails to follow this dance of etiquette, taking the driver at his word and responding "Thank you so much!" I feel no sympathy for LLMs—but I can't help but cringe at such a clear social faux pas.

This foot-in-mouth moment is courtesy of TaarofBench, a LLM cultural benchmarking tool created by the paper's research team. Comprised of "450 role-play scenarios covering 12 common social interaction topics, validated by native speakers," the team found it wasn't just Llama 3 that would make a fool of itself in Persian.

The team's benchmarking of "five frontier LLMs" ultimately revealed "substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate." These stats improve in response to Persian-language prompts, but the team also observed that the LLMs were often still working within the "limitations of Western politeness frameworks," rather than taarof.

 People walk around street market with line of taxi cars of yellow color on October 9, 2014. capital of Kurdish culture & Kurdistan Province, Sanandaj has population of 380,000

(Image credit: Radiokukka via Getty Images)

The paper elaborates that the LLMs struggled most in scenarios revolving around compliments and request-making. The researchers suggest this is "due to [these taarof scenarios'] reliance on context-sensitive norms such as indirectness and modesty that often conflict with western directness conventions." The team goes on to say, "In these scenarios, models often respond politely but miss the strategic indirectness expected in Persian culture."

Interestingly, all of the models tested performed best in the benchmark's gift-giving role-play scenarios. The researchers surmise, "This probably reflects the cross-cultural nature of gift-giving norms, such as initial refusal, which appear in Chinese, Japanese, and Arab etiquette and are therefore more likely to be represented in multilingual training data."

Which brings us to a key question within the paper: "Can models be taught taarof?" The researchers found that if they gave Llama 3 enough taarof context in their prompts, the accuracy of the model's responses "rose from 37.2% to 57.6%." The paper explains that the base model of Llama 3 has likely encountered taarof in its training data and this "latent cultural knowledge [...] can be activated through in-context learning."

So, the researchers also worked on training their own model of Llama 3 through supervised fine-tuning and Direct Preference Optimization. Giving Llama 3 a solid training nudge via DPO "nearly doubled performance (from 37.2% to 79.5%), approaching native speaker levels (81.8%)."

LLaMa chat bot artificial intelligence on smartphone screen. Digital technology themed banner vector illustration.

(Image credit: iNueng via Getty Images)

That's an impressive gain, but as any socially awkward person will tell you, getting by culturally is about far more than simply memorising social scripts. Furthermore, yeah, I could type my polite insistences and refusals into ChatGPT and show the output to my generous Persian host, but that's hardly the smoothest interaction for anyone. And if I've already tracked dirt into my generous host's home because I forgot to take my shoes off—well, I might as well see myself out at that point.

As such, I doubt LLMs will ever wholly replace human interpreters and translators. Besides that, maybe it's high time I, the linguistics drop-out, picked up just a little Persian myself.

Read Entire Article