What Is This System?
Retrieval Augmented Generation (RAG) has become the standard architecture for building AI assistants that can answer questions from a private document corpus. Rather than relying on a language model’s training data — which cuts off at a certain date and cannot include your specific documents — RAG retrieves relevant passages from your own files and feeds them as context to the LLM at query time.
The system described here is a production RAG application built specifically for Italian legal research. It allows lawyers and legal professionals to upload court documents, query them in natural language, and get grounded answers sourced directly from the documents — with a built-in quality control layer that automatically flags and corrects hallucinated responses.
Architecture Overview
The backend is built on FastAPI and structured around three layers: document ingestion, retrieval, and generation. The API exposes separate modules for projects, documents, chat, settings, and a dedicated integration with the Italian Supreme Court database.
Document storage is project-based: each project has its own index directory, interaction log (SQLite), and metadata file. This allows multiple independent knowledge bases to coexist on the same server, each with different embedding models, LLM choices and search settings.
Document Ingestion and Chunking
Documents are parsed using PyMuPDF (fitz) for PDFs, with plain text, Markdown and CSV also supported. PDF parsing preserves page numbers, which are passed through the entire pipeline so that the source page appears alongside every retrieved chunk in the UI — critical for legal work where citation accuracy matters.
Parsed text is split into overlapping word-count chunks. Three sizes are available:
- Small: 256 words, 32-word overlap — dense retrieval, more precise matches
- Medium: 512 words, 64-word overlap — the default, balancing context and precision
- Large: 1024 words, 128-word overlap — more context per chunk, better for long-form reasoning
Overlap between chunks ensures that sentences that fall near a boundary are captured in both adjacent chunks, avoiding retrieval gaps at cut points.
Embedding: Local CPU and API Options
Twelve embedding models are available, split between two modes:
Local (CPU) via fastembed: No API key required, runs on the server CPU. Options include BAAI/bge-small-en-v1.5 (384 dimensions, ~90 MB), BGE Base (768d, ~440 MB), MiniLM L6 and MiniLM L3. The L3 model is the fastest option for high-volume ingestion.
Remote via HuggingFace Inference API: Higher-dimension models including BGE Large (1024d), MixedBread Large (1024d), MPNet Base (768d) and the multilingual BGE M3 (1024d). Remote embedding uses async batching with configurable concurrency (32 texts per batch, 4 parallel calls) to keep ingestion throughput high.
Embedding vectors are stored in a FAISS index, Facebook’s battle-tested approximate nearest neighbour library, which handles fast cosine similarity search at scale.
Hybrid Search: BM25 + Vector
Pure vector search has a known weakness: it finds semantically similar passages but can miss exact keyword matches, which matters greatly in legal contexts where specific terms, article numbers and case identifiers are critical. The system addresses this with hybrid search, combining two retrieval signals:
- BM25 (40% weight): A classical sparse keyword retrieval algorithm. Fast, language-agnostic, excellent at exact term matching.
- Vector similarity (60% weight): Dense retrieval via FAISS. Captures semantic meaning, handles paraphrasing and synonyms.
The two scores are linearly combined with these weights before ranking, giving the best of both approaches: semantic understanding plus exact term recall.
LLM Generation: Ten Model Backends
One of the most practically useful design choices is the multi-provider LLM setup. Rather than being locked to a single API, the system can route generation requests to any of ten models across six providers:
- DeepSeek V3.2 (671B) via Fireworks AI — highest accuracy for complex legal reasoning
- DeepSeek R1-0528 via SambaNova — reasoning model with visible chain-of-thought
- Qwen3.5 397B and 235B via Together AI
- Qwen2.5 72B via Novita — default model, reliable and fast
- Llama 4 Maverick 17B MoE and Llama 3.3 70B via Together AI
- Cohere Command A via Cohere — explicitly RAG-optimised
- Llama 3.1 8B via Cerebras — fastest option for low-latency use cases
Models that output chain-of-thought in <think> blocks (DeepSeek R1, certain Qwen3 models) are detected automatically — the thinking block is stripped from the final answer and surfaced separately in the UI, so users can see the model’s reasoning without it cluttering the response.
Oracle Mode: Automatic Hallucination Detection
The most distinctive feature of this system is Oracle mode — an optional quality control layer powered by GPT-4o that evaluates every RAG answer before it is returned to the user.
The Oracle works in three steps:
- Evaluate: GPT-4o scores the answer on a 1-10 scale and checks for hallucinations. It returns a structured JSON object with score, confidence, hallucination flag and a one-sentence reason.
- Gate: If the score is below the threshold (default 8.5) or a hallucination is detected, the answer is rejected.
- Correct: GPT-4o generates a corrected answer grounded strictly in the original retrieved chunks. The corrected answer is then re-evaluated, and the final score is updated.
This creates a self-healing pipeline: the system does not just flag bad answers, it fixes them. Both the original and corrected answers are logged, so quality trends can be tracked over time.
The scoring guide is explicit in the system prompt: 9-10 for accurate, complete, hallucination-free answers; 7-8 for mostly correct with minor gaps; 5-6 for partially correct; 1-4 for significantly wrong or fabricated. This calibration matters — a threshold of 8.5 means the system only accepts genuinely high-quality answers.
Cassazione Integration
Italian lawyers have a specific data need: access to rulings from the Corte Suprema di Cassazione (Italy’s Supreme Court). The system includes a dedicated module that queries the court’s official SentenzeWeb Solr endpoint directly.
Users can search both civil (snciv) and criminal (snpen) rulings by keyword, retrieve metadata (year, ruling number, section, president, relatore, subject matter) and full OCR text, and save selected rulings as structured text files for immediate upload to the RAG index. The saved files include a formatted header with all procedural metadata before the ruling text, so the LLM has full context when answering questions about a specific case.
Streaming and Intent Classification
All chat responses are streamed via Server-Sent Events (SSE), with typed events for each stage: status updates, retrieved chunks, document summaries, think blocks, Oracle evaluations, corrections and the final answer. This means the UI can show the user exactly what is happening at each step — which chunks were retrieved, what the Oracle scored, whether a correction was applied.
Before retrieval, an intent classifier determines whether the question is asking about document content or about the conversation history itself. History queries (recap, summary, what did we discuss) bypass FAISS entirely and answer from the last five logged turns, avoiding unnecessary embedding lookups.
Why This Architecture Works for Legal Research
Legal document analysis places unusually high demands on a RAG system. Accuracy matters more than in most domains — a hallucinated legal interpretation could have real consequences. The design of this system addresses that directly:
- Page-level citations let users verify every answer against the source document
- Hybrid search catches specific statute references and case numbers that pure vector search would miss
- Oracle mode provides an automatic quality floor that human review alone cannot guarantee at scale
- Multiple LLM options mean the right model can be chosen per task — fast models for quick lookups, reasoning models for complex interpretation
If you are building a document intelligence system, a legal research tool, or any application where answer quality and traceability are critical, get in touch with our team. This kind of end-to-end LLM and RAG engineering is a core part of what we do.