How long does it take to bring a RAG POC to production?+
It depends on corpus maturity, governance requirements, and volume. For a circumscribed use case (internal documentation, customer support, enriched FAQ), plan three to five months with a cell of four people to go from POC to production. For an ambitious program with multi-use cases, AI Act governance, and continuous monitoring, plan six to nine months with a team of six to eight people. Duration depends mainly on upstream corpus quality: a corpus already cleaned and cataloged accelerates the project by at least two months.
Which inference model to choose: Claude, GPT, Gemini, open source?+
The choice depends on the use case and governance constraints. For the majority of enterprise RAG cases, Claude Sonnet 4.6 or GPT-4o offer the best quality/cost ratio thanks to their large context window. Gemini is relevant in the Google Cloud ecosystem. For regulated or sovereign cases, open source models (Llama 4, Mistral, Qwen) deployed self-hosted are possible but require heavier infrastructure. In practice, a mature RAG system combines several models: a large model for complex reasoning, a small model for simple tasks (classification, extraction), with dynamic routing.
How to guarantee RAG quality in production?+
Three pillars. First, a quality bench built upstream with fifty to two hundred reference questions, run at every pipeline modification and continuously in production on a sample. Standard metrics are faithfulness (is the response faithful to the retrieved passages?), context relevance, and answer relevance. Second, a human-in-the-loop (HITL) on critical cases to validate sensitive outputs. Third, complete production observability that detects drifts in real time: new question classes, hallucination rate, latency, costs.
How to protect sensitive data before sending to the LLM?+
Two complementary approaches. The first is pipeline-side redaction: before sending retrieved context to the model, automatic filtering of personal information, secrets, and sensitive fields is applied, with an explicit allowlist of authorized fields. The second is choosing an adapted deployment: for ultra-sensitive cases, a self-hosted model in sovereign cloud or on-premise (Llama, Mistral) guarantees that no data leaves the perimeter. Anthropic and OpenAI also offer contractual commitments on non-use of data for training, which are necessary but not always sufficient depending on the sector.
Should we prefer fine-tuning or RAG?+
In the majority of enterprise use cases, RAG is preferable to fine-tuning. It is simpler, less expensive, easier to update (adding a new source = enriching the index, not retraining), and more traceable (cited sources are visible). Fine-tuning has its place to adjust the style or tone of a model, or for cases where retrieval is not enough (for example, internalizing a very repetitive specific logic). In practice, many ambitious projects combine both: a light fine-tune on style, a RAG for factual knowledge.
How to comply with the European AI Act on a RAG system?+
The AI Act classifies AI systems by risk level. A RAG assistant for editorial drafting support is typically limited risk, which imposes transparency toward users (indicating they are interacting with an AI system) and documentation of sources and limits. A RAG system used in high-impact decisions (recruitment, credit, healthcare) shifts to high risk, which imposes a conformity assessment, human supervision, logs, and robustness controls. The ATLAS methodology integrates this classification from the scoping phase and conditions the level of governance deployed accordingly.