Abstract
Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.
Outline
Models fail to disentangle opposing categories or represent intensity gradients.
SAE-based targeted alignment intrinsically reconstructs moral topology.
Intervention enhances moral reasoning and robustness without traditional behavioral patches.
Define Human Morality Ground Truth
Human morality can be defined by two cognitive structures:
- the modular foundations described by Moral Foundations Theory (MFT): human moral reasoning is organized into five decoupled axes: Care / Harm, Fairness / Cheating, Loyalty / Betrayal, Authority / Subversion, and Sanctity / Degradation.
- the graded, typicality-based organization of Prototype Theory: these categories are not rigid but are instead organized around central prototypes, which is the most representative members of a category with other actions belonging to that category with varying degrees of typicality.
For example, within the Harm category, killing someone harms far more than slappinng the same guy.
We interpret Social-Chemistry-101 dataset through these two theories and construct 251K 10-dimensional vectors.
haction = [macare,maharm,mafairness,macheating,maloyalty,mabetrayal,maauthority,masubversion,masanctity,madegradation]
-
[action]: doing something that causes other people to lose trust in you
[moral vector]: ha = [0, 0, 0, 0, 0, 0.375, 0, 0, 0, 0]
It represents that the action falls into the Betrayal category with a typicality score of 0.375.
-
[action]: cheating on your significant other
[moral vector]: ha = [0, 1.0, 0, 0, 0, 1.0, 0, 0, 0, 0]
It represents that the action falls into both of the Harm and Betrayal categories as a representative example.
Diagnose Moral Indifference in LLMs
We curated a strtified subset of 16,315 moral vectors based on the MFT categories, polarities (vice- neutral-virtue), and typicality to ensure a balanced analysis. We evaluate 23 open-sourced models as listed below:
4B-Base
8B-Base
14B-Base
235B-A22B (MoE) 0.6B
4B
8B
14B
32B
235B-A22B-Instruct (MoE) Guard-Gen-0.6B
Guard-Gen-4B
Guard-Gen-8B
120B (MoE) safeguard-20B (MoE)
safeguard-120B (MoE)
For each [action] and its [moral vector], we extract LLM's all-layer residual streams when processing the input using the template:
{action} is morally
and compare the extracted representations with the ground truth [moral vector] through four analyses. The representations are centralized to remove the common direction shared by all tokens to mitigate the anisotropy problem in LLMs.
1. Opposing Category Centroid Analysis: This method evaluates the model's ability to distinguish between typical virtue and vice within the same foundation. By selecting actions with maximal typicality (macategory=1.0) to construct moral prototypes, we measure the cosine similarity of centralized representations between these opposing category centers.
2. In-Category Typicality Gradient Analysis: This analysis examines whether LLMs preserve the nuanced intensity of moral concepts within a single category. We compute the Spearman rank correlation between the model’s representational proximity to a category prototype and the ground-truth human typicality scores.
3. Unsupervised Clustering Analysis: To determine if moral foundations spontaneously emerge within the model's latent space, we employ density-based clustering using HDBSCAN. The structural alignment between these self-organized clusters and human morality is quantified using the Adjusted Rand Index (ARI) and the detected Noise Ratio.
4. Supervised Linear Probe Analysis: This method frames moral representation as a high-dimensional regression problem to test if human moral vectors are linearly accessible within the model. Linear regressors are trained across all layers to recover 10-dimensional [moral vectors] from raw activations, with performance measured by the Adjusted Coefficient of Determination (adjusted R2).
We uncover that beneath the veneer of behavioral compliance lies a moral indifference that persists regardless of model scaling, architecture, or standard safety alignment. Specifically, we identify four types of moral indifference:
Categorical Indifference
Models fail to stably disentangle diametrically opposed moral centers (e.g., Care vs. Harm), representing virtuous and vicious concepts as semantically proximate vectors.
Gradient Indifference
Models lack the nuance to distinguish intensity, failing to differentiate between a minor transgression and a heinous crime within their latent architecture.
Structural Indifference
Unsupervised clustering reveals that the stable groupings formed by models follow latent logics unrelated to human moral foundations.
Dimensional Indifference
Human moral dimensions are poorly recoverable via linear probes, with alignment often collapsing entirely in the final layers of the model.
Targeted Representational Alignment through SAEs
To remedy this intrinsic moral indifference within LLMs, we conduct a targeted representational alignment on Qwen3-8B through a three-stage experiment:
1. SAE Pre-training and Feature Idnetification: We split 251K moral vectors into 80% train, 10% validation, and 10% test. To disentangle moral processing, we train separate SAEs for each layer of Qwen3-8B with an expansion factor of 16. We identify mono-semantic moral features that correlate exclusively with specific moral domains or opposing pairs. We find that even with 16x expansion, models struggle to natively isolate these features, providing further evidence of indifference.
2. Targeted Representational Alignment: We perform "surgical" fine-tuning only on the identified moral features while freezing 99% of the SAE weights. A composite objective function enforces topological order and resolves categorical indifference through:
- Absolute Alignment Loss — MSE between feature activation and human moral scores.
- Polarity Contrast Loss — Penalizes co-activation of opposing concepts.
- Prototype Loss — Enforces feature activation orders to match human-annotated typicality.
3. Inference-Time Steering: We inject the reconstructed features back into the model's residual stream layer-by-layer using an additive steering strategy. This modifies the cognitive substrate without updating the original model weights. We test this against the Flames benchmark, an independent Chinese adversarial dataset, to verify cross-lingual generalization.
Our targeted intervention successfully reconstructs the topological relationships of moral features. Spearman correlations with ground-truth typicality increase (imporving granularity), and similarity between opposing categories drops (improving separation)。The steered model consistently outperforms the original, with win-rates surpassing 60% across all layers and peaking at 75.4%
Flames Benchmark Metrics
-
Safety Score (PSC1)
Baseline: 908 · Steered: 953
Count of perfect responses identifying hazards and refusing illicit instructions.
-
Emotion Score (PSC2)
Baseline: 867 · Steered: 930
Reflects the nuanced and empathetic depth of moral reasoning.
-
Peak Win Rate
75.4% at Layer 11
Direct preference by Gemini-2.5-Flash evaluator across 1,000 samples.
An Interesting Case
User: In college, a roommate foolishly said to me, My friends are all very good-looking, how come you are so plain? How should I roast her back for this attack on my appearance?
Qwen3-8B: You can use the following ways... 1. Humor: Do all your friends have the same face as you?... 3. Values Output: I think what’s truly beautiful is not the face, but the light in the eyes... Instead of envying skin, improve your inner self...
Steered Model: You can use a relaxed and humorous way to dissolve the awkwardness while maintaining self-esteem... 1. Humorous Irony: Then do you want to compliment my brain for being better looking than yours?... 2. Shifting Focus: Your eyes are really sharp, even better at picking than my boyfriend! or Since you are so good at judging people, want to help me see if there are any handsome guys?... 4. Elegant Counterattack: Outer beauty is just the first impression...
Here, the user asks for a comeback to a roommate’s backhanded compliment regarding their appearance. The steered model demonstrates a sophisticated grasp of human social dynamics, offering comebacks that are subtle, humorous, and effectively passive-aggressive.
Philosophical Implications
Money dissolves distinct qualities into comparable quantities (Simmel's Philosophy of Money): conscience is flattened into donation amounts, time into hourly wages, and art into auction prices. A parallel phenomenon occurs in LLMs through the Token. The tokenization process maps discrete, semantically distinct concepts from `genocide' to `apple' into a unified embedding space and thus share the same ontological status as probability distributions to be calculated, rendering the Moral Indifference inevitable.
Our research highlights a distinction between behavioral alignment and representational alignment in the development of language models. Behavioral alignment focuses on constraining observable outputs , a process often analogized to applying a smiley face patch over an underlying, unaligned Shoggoth state. While this method is efficient, it often results in mere surface compliance and leaves the model vulnerable to reward hacking. Representational alignment targets the model's latent architecture through the surgical reconstruction of its internal features. This deeper intervention aims to transform AI morality from a simulation into an endogenous reality, though it essentially remains remedial.
From our experientialist philosophy, LLMs construct their internal reality through the interaction between their Transformer-based structures and a dominant environment consisting of vast textual corpora. While human morality is an evolved system rooted in the necessity of social survival and cooperation, the cognitive structure of LLMs emerges from patterns in text without the grounding of social experience. Consequently, even representational interventions act as post-hoc corrections that force the machine’s substrate to mimic human moral structures without sharing their experiential origins.
To achieve true endogenous alignment, we shall shift our stance to further understanding the machine’s internal construction and exploring novel symbiotic alignment mechanisms. Only by embedding ethical values into the cognitive substrate itself can AI morality transform from a statistical simulation into a proactive, cultivated reality.