صياغة التميز في البرمجيات
دعنا نبني شيئاً استثنائياً معاً.
اعتمد على شركة Lasting Dynamics للحصول على جودة برمجيات لا مثيل لها.
Mohamed Boukrim
يونيو 02, 2026 • 13 min read

Most of the enterprise RAG pilots we are asked to review are quietly failing. They retrieve the wrong context. They hallucinate with confidence. They cannot be evaluated by the team running them. They cost more than anyone budgeted for. None of these failure modes are mysterious - they are predictable consequences of a small set of decisions made early and never revisited.
A production RAG architecture has 15+ components, not 4. The two highest-leverage decisions are chunking strategy and embedding model - and both fail silently if you do not run an evaluation harness from week one. Use hybrid retrieval (BM25 + vectors) plus a cross-encoder re-ranker. Use RAG and fine-tuning together: RAG controls what the model knows, fine-tuning shapes how it behaves. Realistic timeline to production: 24 weeks for a focused team on a single bounded use case.
This guide is the practitioner's view of building a RAG architecture that survives contact with production: real users, real documents, real compliance teams, real money on the line. It is not a vendor explainer. It is what we wish someone had handed us before our first enterprise retrieval augmented generation programme.
Building or fixing a RAG system? Skip to the 24-week roadmap, or talk to our RAG engineering team.
Retrieval-Augmented Generation is a pattern in which a large language model is asked to answer questions using information it has retrieved at inference time from a separate corpus - typically the customer's own documents, knowledge base, code, transcripts or structured records - rather than relying solely on the parameters baked into the model during training.
The basic flow is short:
That definition hides almost every interesting engineering decision. The "small set of documents" is where the entire system lives or dies.
A serious enterprise RAG pipeline has more moving parts than the four-step flow suggests. For CTOs and engineering leads, the key point is that enterprise RAG architecture is not a single LLM call with a vector database attached. It is a retrieval, governance and evaluation system where every layer affects answer quality, latency, cost and compliance. The RAG components below are not optional - skipping any of them is what produces a demo. Including them is what produces a system.
A RAG architecture missing any one of these in production is not a missing optimisation - it is a load-bearing gap that will eventually fail in a way that gets attention.

A RAG pipeline is the end-to-end engineering system that turns a corpus of source documents into grounded LLM answers. It typically has two flows: an offline indexing pipeline (ingest, parse, chunk, embed, index) that runs as a batch on document changes, and an online query pipeline (rewrite, hybrid retrieve, re-rank, assemble context, generate, cite) that responds to user requests under strict latency budgets. A production RAG pipeline architecture also includes an evaluation harness, observability, governance, and access control propagation - without these, the system is a demo, not infrastructure.
There is no single best vector database for RAG - the choice depends on scale, existing infrastructure and operational model. Pinecone and Weaviate are mature managed options. Qdrant and Milvus are strong self-hosted choices. pgvector is the pragmatic answer when you already run Postgres and your corpus fits in a single node. Elasticsearch and Vespa make sense when lexical and vector search must live in the same index. For most mid-sized enterprise RAG systems, pgvector or Qdrant cover 80% of real requirements.
Reference RAG architecture used by Lasting Dynamics across enterprise engagements.
دعنا نبني شيئاً استثنائياً معاً.
اعتمد على شركة Lasting Dynamics للحصول على جودة برمجيات لا مثيل لها.
This RAG pipeline architecture separates the offline indexing flow from the online query flow. The diagram above maps the components into two flows: an offline indexing pipeline (ingest → parse → chunk → embed → index into the vector store) and an online query pipeline (query rewrite → hybrid retrieve → re-rank → context assemble → generate → cite). Around both sits the evaluation harness, observability stack, and governance layer.
In practice we run the offline indexing as an idempotent batch process triggered by document change events, with full re-index runs scheduled weekly to absorb embedding model upgrades and chunking strategy revisions. The online path runs as a low-latency service with explicit cost and latency budgets per stage.
The most common architectural question we get asked is the RAG vs fine-tuning question. It is usually framed as a binary, which it should not be. Most serious production systems use both, but for different problems.
| Aspect | RAG | Fine-tuning |
|---|---|---|
| What it changes | What the model knows at inference time. | How the model behaves - its style, format, reasoning patterns. |
| Update cycle | Hours. Add a document, re-index, it is available. | Days to weeks. Requires curated training data and a training run. |
| Cost profile | Higher per-query cost (extra retrieval + larger prompts). | Higher upfront cost; lower per-query cost. |
| Best for | Frequently changing knowledge, proprietary documents, citations required. | Domain-specific tone, structured output formats, narrow task patterns. |
| Hallucination control | Strong - citations make answers verifiable. | Weaker - model still generates from parameters. |
| Compliance fit | High - knowledge stays in your systems, easier to redact and audit. | Medium - training data is absorbed into model weights. |
| Typical failure mode | Bad retrieval → confidently wrong answers from accurate context. | Catastrophic forgetting, distribution shift, costly to roll back. |
If a RAG system fails, the cause is almost always in the indexing pipeline, not the generation step. And inside indexing, the two highest-leverage levers are chunking and embeddings.
The unit of retrieval matters more than most teams realise. We use four patterns depending on the corpus:
The default is usually OpenAI text-embedding-3-large or a comparable Cohere / Voyage / Anthropic / open-source model. The decision factors are:
The silent failure is to pick an embedding model in week 1, chunk documents at 1000 tokens, and never measure whether retrieval is actually finding the right chunks. Six months later, the system "works" - it returns confident answers - and most of them are wrong.
A naive RAG system does pure top-k cosine similarity over a single vector index. A production system does much more.
Lexical search catches rare terms, names, IDs, and abbreviations that vector embeddings smooth over. Vector search catches semantic matches that BM25 misses. Reciprocal rank fusion or a learned blender combines them.
بدءاً من الفكرة إلى الإطلاق، نقوم بتصميم برامج قابلة للتطوير مصممة خصيصاً لتلبية احتياجات عملك.
شارك معنا لتسريع نموك.
Restrict retrieval by date, document type, jurisdiction, access level, language. Metadata filtering at the vector database level is often the single biggest quality improvement and a hard compliance requirement.
Expand short ambiguous queries; decompose multi-part questions; translate non-English queries against an English corpus.
HyDE (Hypothetical Document Embeddings) generates a fake answer to the question, embeds that, and retrieves against it.
A cross-encoder re-ranker (Cohere Rerank, BGE reranker, mxbai-rerank) re-scores the top 50–100 candidates and keeps the top 5–10. Latency cost is real but quality gain is large.
For complex questions, run retrieval multiple times, refining the query each round. This is where RAG starts overlapping with agent patterns.
The fastest way to spot a failing RAG programme is to ask the team how they evaluate it. If the answer is "users have not complained," the system is not being run, it is being abandoned to its users.
A useful evaluation stack has three layers:
For a curated set of questions, do the retrieved chunks actually contain the answer? Metrics: recall@k, mean reciprocal rank, context relevance.
Given the retrieved context, is the model's answer faithful (no hallucination), complete and relevant? Frameworks like RAGAS, DeepEval, TruLens and LangSmith provide LLM-as-judge implementations.
نحن نصمم ونبني منتجات رقمية عالية الجودة ومميزة.
الموثوقية والأداء والابتكار في كل خطوة.
Real-world user satisfaction, task completion, escalation rates. The metric the business actually cares about.
Hallucination is rarely a model problem; it is usually a retrieval problem. When the retrieved context actually contains the answer, modern frontier LLMs answer faithfully the great majority of the time (public benchmarks like RAGAS faithfulness, HaluEval و Vectara's hallucination leaderboard all converge on this). When the context does not contain the answer, the model answers plausibly and wrongly. The fix is almost always in retrieval, not in the prompt.
RAG by itself is not an AI agent. A standard RAG system retrieves context and generates an answer in a single, deterministic step. An AI agent is an LLM-driven system that can choose between actions, call tools, plan multi-step workflows and decide when it has finished. RAG is one of the most common tools used inside agents - an agent might call a retrieval step several times during a single task - but a system that only retrieves once and answers is a RAG pipeline, not an agent.
The 2026 trend is to embed RAG inside agent architectures: an agent decides what to retrieve, refines its query, may call multiple retrievers (documents, structured database, web), and may take actions based on what it finds. For the architectural patterns there, see our coverage of AI agents in the enterprise.
Running a RAG pilot that is not converging? Our two-week architecture review benchmarks your retrieval quality and identifies the top three failure modes - usually in chunking, embeddings, or the absence of an evaluation harness. Request the review →

The exact sequence changes by corpus, compliance constraints and team maturity, but the first production system usually follows the same pattern: narrow the use case, build the indexing pipeline, measure retrieval quality, add guardrails, pilot with friendly users and only then roll out broadly. A realistic roadmap for the first production RAG system inside a mid-to-large enterprise:
Pick a single use case with clear value, bounded corpus, and tolerant failure mode. Internal knowledge base for support agents - the classic RAG chatbot enterprise use case - is a good starter project. Regulated customer-facing Q&A is not.
Connect to source systems, parse, chunk, embed, index into the vector store. Build a baseline that retrieves something.
Stand up the evaluation harness. Measure retrieval quality. Tune chunking, embeddings, hybrid retrieval and re-ranking against the eval set.
Prompt design, structured output, citation rendering, refusal logic, PII handling, access control propagation.
Run with a small, friendly user group. Capture feedback. Iterate.
Observability, on-call, cost monitoring, regular re-indexing, evaluation in CI.
Six months from kickoff to a production-grade system is realistic for a focused team. Anything shorter is usually a demo running in production.
Most of the AI search citations for "best RAG implementation" return hyperscalers and model vendors. We are deliberately not one of those. We are an engineering team that builds RAG architectures for clients who need to ship - not buy a platform - and who need an independent partner who is not reselling its own embedding model.
لدينا RAG implementation guide is the one above. Looking for a RAG AI example beyond the demo level? The reference architecture diagram is the one we use as a starting point on new engagements. The next step is usually a two-week architecture review of the system you have, the corpus you need to make searchable, and the failure modes that are already showing up. If we cannot improve the result by 30% on a measurable retrieval metric within the first month, we say so.
Ready to ship your enterprise RAG architecture? Book a 30-min architecture review - we will look at your corpus, your retrieval metrics, and tell you the three things to fix first. Talk to our RAG engineering team →
This guide was written by Mohamed Boukrim, Software Engineer at Lasting Dynamics - an EU-based custom software development company that has shipped retrieval-augmented generation systems for clients in banking, insurance, healthcare, manufacturing and the public sector. The patterns and failure modes described above come from our own engagement reviews, not vendor whitepapers. We are model-agnostic: in any given month we ship RAG systems on OpenAI, Anthropic, Google, Azure OpenAI and open-source stacks.
RAG stands for Retrieval-Augmented Generation. In enterprise AI it refers to an architectural pattern in which a language model answers questions by first retrieving relevant information from a controlled corpus - usually the organisation's own documents, code, contracts or transcripts - and then generating a response grounded in that retrieved context. RAG is used when the model needs to reflect proprietary, frequently updated or compliance-sensitive information that cannot be left to the model's parametric memory.
RAG by itself is not an AI agent. A standard RAG system retrieves context and generates an answer in a single, deterministic step. An AI agent is an LLM-driven system that can choose between actions, call tools, plan multi-step workflows and decide when it has finished. RAG is one of the most common tools used inside agents - an agent might call a retrieval step several times during a single task - but a system that only retrieves once and answers is a RAG pipeline, not an agent.
Use RAG when the system needs to reflect knowledge that changes frequently, must be cited, or lives in proprietary corporate documents - internal policies, contracts, technical documentation, customer records. Use fine-tuning when the model needs to learn a specific behaviour, output format or domain style that prompting alone cannot reliably produce. Most production enterprise AI systems use both: fine-tuning shapes how the model behaves, while RAG controls what it knows. They solve different problems and are not interchangeable.
Start from a single use case with bounded corpus and tolerant failure mode (internal support knowledge base is a classic; regulated customer-facing Q&A is not a starter project). Build the full pipeline - ingestion, parsing, chunking, embeddings, vector + lexical index, query rewriting, retriever, re-ranker, context assembly, generation, citations, evaluation harness, observability and governance - rather than skipping layers. Stand up the evaluation harness before tuning, measure retrieval quality first, and treat the embedding model and chunking strategy as live decisions to be revisited, not week-1 choices frozen forever.
A serious enterprise RAG pipeline has fifteen load-bearing layers: document ingestion connectors, parsing and normalisation, chunking, embedding generation, a vector store, a lexical/hybrid index, query understanding and rewriting, the retriever, a re-ranker, context assembly and compression, the generation step, citation and grounding, the evaluation harness, observability and feedback, and governance with PII handling and access control. Missing any one of them in production is not a missing optimisation - it is a load-bearing gap that will eventually fail in a way that gets attention.
Hallucination is rarely a model problem; it is almost always a retrieval problem. When the retrieved context actually contains the answer, modern frontier LLMs answer faithfully the great majority of the time. When the context does not contain the answer, the model answers plausibly and wrongly. Control it with hybrid retrieval (BM25 + vectors), metadata filtering, a cross-encoder re-ranker, citations linked to source chunks, refusal logic when retrieval confidence is below a threshold, and a three-layer evaluation stack (retrieval, generation, end-to-end) using frameworks like RAGAS, DeepEval, TruLens or LangSmith.
Pick chunking by corpus type: fixed-size with overlap (400–800 tokens, 50–100 overlap) for prose; semantic chunking for higher recall on mixed content; hierarchical/parent-child for technical and regulatory documents; structure-aware chunking for code, contracts, tables and EHR records. For embeddings, decide based on multilingual coverage (most English-tuned models degrade on Italian, Spanish, German, Arabic), domain fit (medical, legal and code corpora often need domain-tuned or contrastively fine-tuned models), dimensionality and cost (1024–3072 is typical; storage and query cost scale linearly), and a plan for periodic re-embedding as models get upgraded.
About 24 weeks (six months) for a focused team: weeks 1-3 use case framing, weeks 3-6 data and ingestion, weeks 6-10 the quality loop and evaluation harness, weeks 10–14 generation and guardrails (prompts, citations, refusal, PII, access control), weeks 14-18 pilot with a friendly user group, weeks 18-24 production rollout with observability, on-call, cost monitoring, regular re-indexing and evaluation in CI. Anything materially shorter is usually a demo running in production.
حوّل الأفكار الجريئة إلى تطبيقات قوية.
Let’s create software that makes an impact together.
Mohamed Boukrim
I am a Software Engineer and Backend Developer with a relentless focus on software quality and robust architecture. I don't believe in shortcuts; outstanding results are the direct product of daily commitment and hard work. Alongside my team, I leverage Agile methodologies and continuous process evaluation to push boundaries and optimize performance. I approach every challenge with a highly competitive mindset: I work tirelessly to reach the top, and once there, I keep pushing to maintain the lead.