Language Media LTDDigital Marketing & Data Science in LondonHitting the Limits of LLMs: Why Game Rule Generation is Still a Human Job

Over the past few months, large language models (LLMs) have been marketed as capable of handling a wide range of tasks—from writing and summarization to generating structured data and code. With tools like GPT-4, Claude, and open-source alternatives like Qwen and Mistral, the expectation is that if you “prompt it right,” you can get high-quality, structured outputs without additional training or infrastructure.

This belief is partly true—but only to a point.

I recently worked on a project involving automated game level generation for a puzzle game. Each level is defined by a strict JSON schema representing a grid, stackable tile elements, probabilities, targets, and boolean features like “Fairy”, “Bunny”, and “Ice”. The initial goal was simple: use prompt engineering to get an LLM to output new levels that respected the same structure and logic as existing ones.

The result: mixed, at best.

Pattern Recognition vs Rule Following

The core issue lies in the difference between pattern recognition and rule following.

LLMs are excellent at pattern recognition. If you show them five examples of a structured JSON file, they will likely generate a sixth file that visually looks similar. They will copy the nesting, key names, and some of the value types. With temperature set to zero, the output may even be syntactically valid and occasionally close to acceptable.

But when it comes to enforcing logic—like ensuring that “TilesProbability” sums to exactly 100, that each “GridPosition” stays within bounds, or that boolean feature combinations are valid and diverse—the model quickly fails.

It might leave stacks empty, repeat forbidden values, or mix up probabilities. And no matter how clearly these constraints are stated in the prompt, the model doesn’t “understand” them in the way a human would. It doesn’t reason or validate. It just predicts what token comes next based on its training distribution.

A Dead End for Prompt Engineering

I tried multiple variations of the prompt. I gave clearer examples. I rephrased rules. I emphasized forbidden patterns. I even validated outputs using Python scripts and reran them with corrections. But the improvements were marginal. Some outputs looked fine; most didn’t.

Why? Because LLMs trained via supervised or unsupervised learning cannot enforce abstract rules just from examples unless those rules are extremely obvious and consistent across a large volume of training data.

A few examples aren’t enough.

Fine-Tuning: The Only Real Option

If you want a model that can generate levels reliably according to your logic, the only viable solution is fine-tuning.

Fine-tuning involves training the model on hundreds or thousands of input-output pairs so that it statistically learns not just what to generate, but how the generation relates to the constraints. For game levels, that means you need to create structured prompts like:

“Generate a hard level with Bunny, Flower, and Ice tiles. The grid should have 6 stacks, total moves 30, and unique color targets.”

…paired with valid JSON outputs that match those constraints exactly.

To get even a basic improvement, you would need at least 1,000–2,000 well-labeled examples. Ideally more. And even then, the model might still struggle with very rigid numeric constraints unless you apply further post-processing or enforce structure via function-calling.

For small teams or solo developers, that level of data preparation and infrastructure setup is out of reach.

Alternatives and Workarounds

There are partial solutions worth exploring. One is to use function calling, where instead of generating raw JSON, the model returns structured parameters that are passed into a generation function. This helps enforce schema boundaries and offload some of the logic to code.

Another is to use agentic architectures where the model can reflect, rerun, or validate its own output. This is promising but still immature and resource-intensive.

A final option is to combine traditional procedural generation (via code) with LLMs to handle only the creative or variable parts. For example, the LLM can describe a level concept, while code turns it into a valid level file.

Not Replacing Humans Anytime Soon

This project reinforced something that often gets lost in the hype: LLMs are not rule-based systems. They don’t reason. They don’t validate. They don’t understand the logic of your application. They are extremely powerful pattern generators—but that’s all they are.

In domains where logic, structure, and rules are strict, LLMs can assist, but not replace. Developers, game designers, and data engineers still play a critical role in shaping, validating, and enforcing the final product.

AI won’t take over this part of the job just yet—and that’s a good thing.

It means there’s still plenty of value in human understanding, domain expertise, and well-written code.