Cross-cultural study, 2024

paper

2025·arXiv·arxiv.org/html/2511.17256

Authors

Haijiang Liu·Jinguang Gu·Xun Wu·Daniel Hershcovich·Qiaoling Xiao

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI governance researchers and safety practitioners concerned with cultural bias in LLMs; offers empirical benchmarks comparing Eastern and Western model development trajectories and fine-tuning strategies for value alignment.

Paper Details

Citations

0 influential

Year

2024

Methodology

peer-reviewed

Metadata

Importance: 55/100arxiv preprintprimary source

Abstract

As Large Language Models (LLMs) increasingly influence high-stakes decision-making across global contexts, ensuring their alignment with diverse cultural values has become a critical governance challenge. This study presents a Multi-Layered Auditing Platform for Responsible AI that systematically evaluates cross-cultural value alignment in China-origin and Western-origin LLMs through four integrated methodologies: Ethical Dilemma Corpus for assessing temporal stability, Diversity-Enhanced Framework (DEF) for quantifying cultural fidelity, First-Token Probability Alignment for distributional accuracy, and Multi-stAge Reasoning frameworK (MARK) for interpretable decision-making. Our comparative analysis of 20+ leading models, such as Qwen, GPT-4o, Claude, LLaMA, and DeepSeek, reveals universal challenges-fundamental instability in value systems, systematic under-representation of younger demographics, and non-linear relationships between model scale and alignment quality-alongside divergent regional development trajectories. While China-origin models increasingly emphasize multilingual data integration for context-specific optimization, Western models demonstrate greater architectural experimentation but persistent U.S.-centric biases. Neither paradigm achieves robust cross-cultural generalization. We establish that Mistral-series architectures significantly outperform LLaMA3-series in cross-cultural alignment, and that Full-Parameter Fine-Tuning on diverse datasets surpasses Reinforcement Learning from Human Feedback in preserving cultural variation...

Summary

This 2024 study introduces a Multi-Layered Auditing Platform evaluating cross-cultural value alignment in 20+ LLMs (Qwen, GPT-4o, Claude, DeepSeek, LLaMA) across China-origin and Western-origin systems. Using four methodologies including ethical dilemma testing and first-token probability analysis, it finds neither paradigm achieves robust cross-cultural generalization, with Mistral architectures outperforming LLaMA-3 and Full-Parameter Fine-Tuning better preserving cultural variation than RLHF. The findings call for sustained human oversight given current LLMs' inability to autonomously navigate complex cross-cultural moral dilemmas.

Key Points

•Neither China-origin nor Western-origin LLMs achieve robust cross-cultural value generalization; both exhibit systematic regional biases (U.S.-centric or China-centric).
•Universal failure modes identified: value system instability over time, under-representation of younger demographics, and non-linear model-scale/alignment relationships.
•Mistral-series architectures significantly outperform LLaMA-3-series in cross-cultural alignment benchmarks.
•Full-Parameter Fine-Tuning on diverse datasets preserves cultural variation better than RLHF, which tends to homogenize values.
•Provides actionable governance protocols for model selection, bias mitigation, and policy consultation in high-stakes cross-cultural deployment contexts.

Cited by 1 page

Page	Type	Quality
Why Alignment Might Be Easy	Argument	53.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202688 KB

Cross-cultural value alignment frameworks for responsible AI governance: Evidence from China-West comparative analysis 
 
 
 

 
 

 
 
 
 
 Cross-cultural value alignment frameworks for responsible AI governance: Evidence from China-West comparative analysis

 
 
 Haijiang Liu, Jinguang Gu 
 Wuhan University of Science and Technology
 Wuhan, China
 {alecliu, simon}@ontoweb.wust.edu.cn 
 
 Work done during visiting Ph.D. in the HKUST (GZ). 
    
 Xun Wu 
 The Hong Kong University of Science and Technology (Guangzhou) 
 Guangzhou, China 
 wuxun@hkust-gz.edu.cn 
 
 
    
 Daniel Hershcovich
 University of Copenhagen
 Copenhagen, Denmark 
 DH@di.ku.dk 
 
 
    
 Qiaoling Xiao 
 WUST-Madrid Complutense Institute 
 Wuhan, China 
 CHRISTINA.XIAO@wust.edu.cn 
 
 
 
 Abstract

 As Large Language Models (LLMs) increasingly influence high-stakes decision-making across global contexts, ensuring their alignment with diverse cultural values has become a critical governance challenge. This study presents a Multi-Layered Auditing Platform for Responsible AI that systematically evaluates cross-cultural value alignment in China-origin and Western-origin LLMs through four integrated methodologies: Ethical Dilemma Corpus for assessing temporal stability, Diversity-Enhanced Framework (DEF) for quantifying cultural fidelity, First-Token Probability Alignment for distributional accuracy, and Multi-stAge Reasoning frameworK (MARK) for interpretable decision-making. Our comparative analysis of 20+ leading models, such as Qwen, GPT-4o, Claude, LLaMA, and DeepSeek, reveals universal challenges—fundamental instability in value systems, systematic under-representation of younger demographics, and non-linear relationships between model scale and alignment quality—alongside divergent regional development trajectories. While China-origin models increasingly emphasize multilingual data integration for context-specific optimization, Western models demonstrate greater architectural experimentation but persistent U.S.-centric biases. Neither paradigm achieves robust cross-cultural generalization. We establish that Mistral-series architectures significantly outperform LLaMA-3-series in cross-cultural alignment, and that Full-Parameter Fine-Tuning on diverse datasets surpasses Reinforcement Learning from Human Feedback in preserving cultural variation. These findings provide empirical foundations for evidence-based AI governance, offering actionable protocols for model selection, bias mitigation, and policy consultation at scale, while demonstrating that current LLMs require sustained human oversight in ethical decision-making and cannot yet autonomously navigate complex moral dilemmas across cultural contexts.

 
 
 
 1 Introduction

 
 The integration of Large Language Models (LLMs) into high-stakes applications, such as decision-support systems, requires a thorough evaluation of their moral and ethical reasoning capabilities. For AI agents to operate trustworthily, their behavior

... (truncated, 88 KB total)

Resource ID: 206ca4daeae56964 | Stable ID: sid_nuBQ8eb3qD