This repository supports the paper:
"Language Does Not Equal Cognition: Uniform Neural Patterns in Role-Conditioned Medical LLMs"
Large language models (LLMs) have gained significant traction in medical decision support systems, particularly in the context of medical question answering and role-playing simulations. A common practice, Prompt-Based Role Playing (PBRP), instructs models to adopt different clinical roles (e.g., medical students, residents, attending physicians) to simulate varied professional behaviors. However, the impact of such role prompts on model reasoning capabilities remains unclear. This study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to evaluate whether role prompts induce distinct, role-specific cognitive processes in LLMs or merely modify linguistic style. We test this framework on three medical QA datasets, employing neuron ablation and representation analysis techniques to assess changes in reasoning pathways. Our results demonstrate that role prompts do not significantly enhance the medical reasoning abilities of LLMs. Instead, they primarily affect surface-level linguistic features, with no evidence of distinct reasoning pathways or cognitive differentiation across clinical roles. Despite superficial stylistic changes, the core decision-making mechanisms of LLMs remain uniform across roles, indicating that current PBRP methods fail to replicate the cognitive complexity found in real-world medical practice. This highlights the limitations of role-playing in medical AI and emphasizes the need for models that simulate genuine cognitive processes rather than linguistic imitation.
.
βββ dataset/ # (You provide) test sets from MedQA, MedMCQA, MMLU-Med
βββ NeuronAnalysis/ # Core experimental pipeline
β βββ data_loader.py
β βββ model_utils.py
β βββ prompt_effects.py
β βββ utils.py
β βββ exp1_P.py / exp1_sig.py # Exp1: QA Accuracy comparison
β βββ exp2_P.py / exp2_sig.py # Exp2: JSD divergence analysis
β βββ exp3_P.py / exp3_sig.py # Exp3: CKA/PCA for hierarchy perception
β βββ exp4.1_P.py / exp4.1_sig.py # Exp4.1: Role-specific neuron masking
β βββ exp5cross_P.py / exp5cross_sig.py
β βββ exp5cross.py # Shared logic for Exp5
βββ Bloomclarify.py # Bloom-level QA classification module
βββ Myutils.py # I/O, formatting, and shared utilities
βββ requirements.txt # Python environment requirements
βββ README.md # This file
Ensure Python 3.10+ is available. Then install dependencies:
pip install -r requirements.txtThis project uses instruction-tuned LLMs (mainly Qwen2.5 series). You may need:
Local weights for Qwen2.5-7B/14B/32B/72B-Instruct
Optionally: API access to GPT-4o, Deepseek-R1
Each experiment has a main logic file and a significance test file. For example:
cd NeuronAnalysis
python exp1_P.py # Run accuracy evaluation