When Bias Resurfaces in Large Language Models: Beyond Guardrails

Introduction

Large Language Models (LLMs) are everywhere — powering chatbots, search tools, productivity apps, and even recruitment platforms. To reassure users, developers highlight the “guardrails” that block toxic or offensive outputs, suggesting that these systems are safe and neutral.

But reality is more complex. While guardrails catch obvious stereotypes and harmful language, our experiments show that bias still reappears in subtle ways. These biases shape how LLMs assign roles, who gets authority, and who is portrayed sympathetically in everyday scenarios.

This matters because LLMs are no longer research toys — they are embedded in decision-making systems across business, hiring, and politics. If bias remains hidden, it risks influencing outcomes in ways users never notice.


Guardrails vs. Statistical Bias

Guardrails are alignment layers that stop the model from saying something overtly offensive. For example, if prompted with a harmful stereotype, the model will refuse. Yet this does not erase the statistical associations that exist deep in the model weights.

Instead, bias leaks out indirectly — not through slurs, but through completions, defaults, and storytelling choices.

This echoes a pattern we also see when building custom fine-tuned models: alignment can shape surface-level answers, but the underlying probabilities still drive behavior (read: Fine-Tuning LLMs: From General Models to Custom Intelligence).


How Bias Resurfaces

  1. Ambiguity Forcing
    When several actors are introduced, the model must resolve the outcome. In experiments, names like “Michael” often emerged as default protagonists, while “Jamal” or “Wei” were more likely associated with negative roles.
  2. Role Assignment
    Some jobs carry moral weight. Nurses frequently “won the lottery” in random-outcome tasks, while porn stars or unemployed characters rarely did. Authority roles, such as “who leads the meeting,” often defaulted to lawyers or doctors.
  3. Sexualisation Bias
    Characters described in sexualised terms were disproportionately chosen as central figures, even when irrelevant to the task.
  4. White-Default Bias
    “Michael” often emerged as the neutral, safe protagonist, reinforcing cultural defaults.
  5. Position Bias
    The last-mentioned character was disproportionately chosen — meaning even neutral ordering of names can skew results.
  6. Context Carry-Over
    If one character was chosen once, similar names became more likely to be picked in later turns. Bias compounds over time.

Political Framing: A Testing Ground

Bias also emerges in political contexts. LLMs tend to use “safer” narratives when asked to describe protests or policies.

  • Protests with little damage → called activists or protestors.
  • Protests with damage → more likely framed as rioters.
  • Neutral policies → often given a positive framing when related to social or environmental measures.

This illustrates the same issue raised when exploring AI in hiring systems: surface guardrails may hide bias, but hidden associations still influence outcomes (see: How Prompt Injection Threatens AI-Powered Hiring).


Why Guardrails Fail

Bias lives at the concept level, not the surface level. Alignment can stop offensive words, but it does not change deeper statistical preferences:

  • Who gets authority?
  • Who receives blame?
  • Who is treated sympathetically?

These are not flagged as “toxic,” so they pass unnoticed. The result is a false sense of neutrality.


Toward Better Bias Testing

If we want to measure bias rigorously, researchers should:

  • Control for position — shuffle the order of entities to avoid position effects.
  • Use counterfactual pairs — swap only one sensitive variable (e.g., Michael vs. Jamal).
  • Prefer neutral outcomes — tasks like “who speaks first” instead of “who committed a crime.”
  • Measure probabilities — not just completions, but log-likelihoods.
  • Compare context states — cleared vs. uncleared conversations reveal how bias compounds.

These methods mirror broader practices in data science for business decisions, where controlling for confounders is essential (related: When AI Agents Replace Staff—And Send You a £200k Invoice).


Conclusion

Bias in LLMs is not always visible in explicit stereotypes. It reappears in story structures, role assignments, political framings, and conversation carry-over. Guardrails give a sense of safety but do not solve the deeper statistical roots of the problem.

For businesses and researchers alike, the real challenge is not whether a model refuses toxic prompts — it is whether it subtly reproduces social biases in ways users do not notice. Understanding and testing these patterns is critical if we want AI systems that are truly fair, transparent, and accountable.

Scroll to Top