HKU evaluation shows Chinese AI models struggle with hallucinations
The rapid proliferation of AI models, whether from China or overseas, makes it clear that businesses need help sorting out the good from bad
Debates are raging around the world about how artificial intelligence should be developed. Some are calling for strengthened guardrails to guarantee the powerful technology is developed safely, while others argue that doing so would risk kneecapping a fast-moving industry.
But that is a false dichotomy, according to Jack Jiang, an innovation and information management professor at Hong Kong University (HKU) Business School. He said AI safety and development are two sides of the same coin because AI is only economically valuable if it is reliable – and reliability, in his view, must be proven by third parties.
“We are like auditors,” Jiang said in an interview with the Post. For the past two years, he has evaluated the capabilities of dozens of leading AI models as the director of the HKU Business School’s AI Evaluation Lab, launched just a few months after OpenAI introduced ChatGPT to the world.
At the time, existential angst among Chinese tech giants had kick-started the “hundred models war”, each competing to stand out from the crowd as they wondered whether China would ever catch up with the US.
That rapid proliferation of international and Chinese AI models made it clear to Jiang that the business community would need help sorting out the good from the bad. “The choice for businesses was no longer a matter of whether or not to use AI, but rather how to use it and how best to use it,” he said.
In September 2023, he launched his lab with a mission to “drive trustworthy innovation and sustainable advancement of generative AI” through model evaluations. The team today is made up of more than 40 members, spread across Hong Kong, Xi’an, Dalian and Oxford in the UK.
AI model hallucinations – outputs that are factually incorrect or misleading – were a problem that Jiang identified early on. While the raw capabilities of models have improved rapidly in recent years, the problem of hallucinations has not gone away.
“Hallucinations directly impact the credibility of AI models in professional settings,” said Jiang. A recent survey of business leaders worldwide conducted by the consultancy Gallagher found that hallucinations were the main factor preventing AI adoption in their firms.
“In key industries like healthcare and finance, even a state-of-the-art model like GPT-5 with a low hallucination rate cannot be left to its own devices” he said.
Last month, Jiang’s team published its latest evaluation showing that most AI models today still hallucinate, with Chinese models especially struggling. The best-performing Chinese model, ByteDance’s Doubao 1.5 Pro, came in 7th among the 37 international and Chinese models evaluated.
The results, which took almost half a year to produce, surprised the team because Chinese models had previously performed well on reasoning, image generation and general language evaluations. “It was disappointing overall because we tested the models in both English and Chinese,” said Jiang.
Why models hallucinate is a subject of debate among AI researchers. OpenAI recently published a paper blaming existing training procedures for incentivising large language models (LLM) to make guesses at answers rather than acknowledge uncertainty.
If that was the case, model developers would likely have to consider adjusting the training process or the model architecture itself, according to Li Jiaxin, a PhD student who helped lead the evaluation.
The HKU lab is now taking its findings directly to businesses with the goal of providing evidence-based guidance to support AI adoption. One of their partners is a major Chinese bank that has been hesitant to introduce AI customer service because of the risk of hallucinations.
Another potential partner, a Beijing-based short video firm, saw the hallucination findings and immediately contacted Jiang for possible solutions.
The team’s broader strategy moving forward is to take model evaluations out of the lab and into real-world business environments. In Hong Kong, the obvious choice was finance, where the city’s government has called for financial institutions to adopt a “risk-based approach” to AI adoption, from reviewing documents to the act of trading itself.
Nationally, President Xi Jinping has emphasised the need for AI to be developed “safely and reliably”. For Chinese models, competing with US models no longer just means raw capabilities – it also means safety and ethical practices. “It’s obvious Chinese companies must work on their models’ hallucinations moving forward,” said Jiang.
Tech war: Chinese AI models deemed a security risk by new US government report
The evaluation marks the first time the US government has comprehensively assessed DeepSeek’s capabilities in relation to leading US models
Chinese models lag behind their American counterparts in performance, cost, security and adoption, despite their growing global popularity, according to a new report from the US government.
Describing Chinese models as “adversary AI”, the report released on Tuesday by the Centre for AI Standards and Innovation (CAISI) at the National Institute of Standards and Technology (NIST) and the Department of Commerce claimed that models such as those from DeepSeek posed risks to AI developers, consumers and US national security because of their security shortcomings and censorship.
The report comes after US President Donald Trump’s AI Action Plan, released in July, called for the evaluation of frontier Chinese models’ capabilities and alignment with state narratives.
DeepSeek, China’s most high-profile AI company, has come under fire in the US where it has been accused of stealing user data and amplifying Chinese state narratives.
The evaluation conducted by CAISI marks the first time the US government has produced a comprehensive assessment of DeepSeek’s capabilities and popularity in relation to leading US models, including OpenAI’s GPT-5, its open-sourced model gpt-oss and Anthropic’s Claude Opus 4.
According to the report, DeepSeek’s models had lower scores than US models almost across the board on 19 public and internal benchmarks, while also being more vulnerable to being jailbroken by malicious users intent on carrying out hacking and cybercrime activities.
The report also claimed that Chinese state censorship was “built directly into DeepSeek models”, based on a new benchmark CAISI developed in conjunction with the Department of State that tested the models on questions considered politically sensitive to China’s ruling Communist Party.
It found that DeepSeek models were more aligned with Chinese state narratives than the US models, with the most aligned model being DeepSeek’s R1-0528 at 25.7 per cent when prompted in Chinese.
DeepSeek did not respond to a request for comment.
On the other hand, the Hangzhou-based start-up’s “open-weight” models have helped China catch up with the US in the global AI adoption race.
Downloads of DeepSeek models on the developer platform Hugging Face have increased nearly 1,000 per cent since January, while Alibaba Cloud’s Qwen family of models increased 135 per cent, the report found.
Alibaba Cloud is the AI and cloud services unit of Alibaba Group Holding, owner of the Post.
US firms still have the most global downloads across all models on the platform, though Alibaba Cloud is closing in on Meta Platforms, the developer behind the Llama family of models, as the second all-time most popular model maker, only behind OpenAI.
Commonly referred to as open-source, open-weight models are models whose weights – the variables encoding their “intelligence” – are publicly released, allowing developers to download and build on them.
Notably, the number of derivative models built on Qwen and shared on Hugging Face exceeded those from Google, Meta, Microsoft and OpenAI combined.
Another key claim in the report was that OpenAI’s GPT-5-mini cost 35 per cent less on average than DeepSeek’s leading model, V3.1, to perform at a similar level, using their respective application programming interface (API) prices as the basis for comparison.
However, the report did not mention that DeepSeek users have the option of deploying the open-weight models locally, which is not the case for users of proprietary US models who must pay for API access.
DeepSeek has also released newer models in recent weeks, cutting official API prices by over 50 per cent while maintaining similar levels of performance, according to third-party AI benchmarking firm Artificial Analysis.
These updated versions were not included in CAISI’s evaluations, which only looked at DeepSeek’s R1, R1-0528 and V3.1 models.
On social media, US Commerce Secretary Howard Lutnick said his department was helping ensure “continued US leadership in AI” by publishing these findings.
“The report is clear: DeepSeek lags far behind, especially in cyber and software engineering,” Lutnick wrote. “These weaknesses aren’t just technical. They demonstrate why relying on foreign AI is dangerous and shortsighted.”
Launched in 2023 under the Biden administration, the CAISI was previously known as the US AI Safety Institute before having its name changed when Trump came into office with a new focus on promoting US leadership in AI innovation.
The body works directly with US firms such as Anthropic and OpenAI on AI safety and security issues.