Workshop

Building
Agentic AI Systems
from Scratch

One Problem, Three Frameworks

LangChain · LangGraph · CrewAI

Instructor: Ruslan Magana Vsevolodovna | ruslanmv.com

What You Will Learn

🧠 Agentic AI Theory

What agents are, how ReAct works, and when to use multi-agent systems

🛠️ Three Frameworks

Build the same system in LangChain, LangGraph, and CrewAI — compare hands-on

🔧 Production Skills

PII masking, guardrails, testing, evaluation with precision/recall/F1

📊 Make the Decision

Run a framework comparison and pick the right tool for the job

Course Outline

01 Foundations — Spectrum, Memory, MCP/A2A, Orchestration

02 The Problem & Setup

03 Shared Modules — Schema, PII, Fallback, Routing

04 Agent Tools — Evidence over Guessing

05 LangChain — ReAct Agent

06 LangGraph — State Machine

07 CrewAI — Multi-Agent Crew

08 Testing & Quality

09 Evaluation & Metrics

10 Framework Verdict

01 Section 1

Agentic AI
Foundations

What is an agent? The spectrum from LLM to CoT to Agent. Memory, MCP, A2A, orchestration.

What Is an Agent?

A system that controls the flow, not just the output.

The agent decides what to do next — a regular LLM just gives you one answer.

The Spectrum: LLM → CoT → Agent

It is not binary. There is a spectrum of intelligence — understanding it helps you choose the right level for each problem.

Conventional LLM

A Large Language Model takes one prompt, produces one response. No tools, no memory, no iteration. Fast and cheap, but if it hallucinates, you have no safety net.

Best for: text generation, Q&A, summarisation

Chain-of-Thought (CoT)

The model generates its reasoning steps before answering. "The URL is suspicious… the tone is urgent… classic phishing." Much more accurate — but still reasoning in a vacuum. It cannot verify anything externally.

Used by: DeepSeek-R1, OpenAI o1/o3, Claude

Agentic AI

The LLM reasons plus takes real actions — calling tools, checking databases, scanning URLs. It acts on the world and feeds results back into its reasoning. Evidence-based decisions.

Best for: workflows, decisions, business impact

Rule of thumb: if the task needs external data or has business impact → use an agent. Pure text → LLM or CoT is enough.

The ReAct Pattern

Reason + Act — the loop at the heart of modern agents

🧠 REASON
"What next?"

→

⚡ ACT
Call a tool

→

👁️ OBSERVE
Check result

→

🔁 REPEAT
Until goal met

Reasoning is explicit, auditable, and debuggable — every step is logged.

Multi-Agent Systems

Multiple agents, each with a specialised role, collaborating to solve a problem

🏷️ CLASSIFIER
"What is this email?"

→

🛡️ RISK ANALYST
"Is this dangerous?"

→

📋 POLICY ROUTER
"What do we do?"

✅ Separation of concerns

Each agent has one job

✅ Extensibility

Add a new agent without rewriting

Memory & Context: How Agents Remember

Three layers of memory — each extends how far the agent can reach.

Short-Term Memory

The conversation itself. Each reasoning step, tool call, and observation is appended to the message history. The LLM sees everything from the current run inside its context window — a fixed-size buffer measured in tokens (roughly 3/4 of a word). GPT-4o-mini: 128k tokens. Claude: 200k tokens.

Long-Term Memory

LangGraph supports checkpointing — serialising the agent's full state to a persistent store. The agent can resume conversations or recall information from previous sessions. This is how you build agents that remember past interactions.

RAG + Vector Stores

An embedding is a numerical representation of text that captures meaning. A vector store (ChromaDB, Pinecone, FAISS) indexes embeddings for fast semantic search. The agent converts its query to an embedding, finds similar passages, and injects them into the prompt. This is Retrieval-Augmented Generation (RAG).

In this course we use short-term memory. For production, add checkpointing and RAG for knowledge that exceeds the context window.

MCP & A2A: Connecting Agents

Two protocols power the agentic ecosystem. Understanding the difference is critical.

MCP — Model Context Protocol

An open standard (by Anthropic) for connecting agents to tools — passive functions that take input and return output. URL scanners, database queries, weather APIs. The MCP Server exposes tools; the MCP Client (the agent) discovers and calls them.

Transport: stdio (local subprocess) or SSE (Server-Sent Events — remote HTTP service).

A2A — Agent-to-Agent Protocol

An architectural pattern (by Google) for connecting agents to other autonomous agents — systems that reason, use their own tools, and make their own decisions. The difference: MCP tools are passive functions. A2A agents are active reasoners.

Example: a coordinator delegates "find flights" to a Flight Agent that reasons about layovers, compares prices, and calls airline APIs via MCP.

Orchestration Patterns

Choosing the right pattern is often more important than choosing the right framework.

In this course: ReAct Loop (LangChain), DAG (LangGraph), Sequential (CrewAI).

Key Takeaway

An agent is a system that reasons, acts, and iterates toward a goal using tools and state, rather than producing a single response. It manages memory through context windows, checkpoints, and vector stores. It connects to tools via MCP and to other agents via A2A. The orchestration pattern — sequential, DAG, ReAct, hierarchical, or routing — determines how much control vs. flexibility the system has.

02 Section 2

The Problem &
Project Setup

Enterprise email classification — one problem, three frameworks.

The Problem: Email Triage

Classify incoming emails and route them to the right action.

Category	Example	Action
phishing	"Verify your account immediately"	Quarantine + review
spam	"Limited time! Win a free iPhone"	Quarantine
invoice	"Invoice #2026-042 — payment due"	Accounts payable
meeting	"Team sync Thursday 10 AM"	Calendar suggestion
support	"Ticket #5432 — production outage"	Support ticket
other	Everything else	Inbox

The Pipeline

Shared by all three approaches

📧 Preprocess
PII (Personally Identifiable Information) masking

→

🤖 Classify
LLM / agent

→

🛡️ Guardrails
Business rules

→

🚦 Route
Action

Clone & Install

git clone https://github.com/ruslanmv/agentic-ai-concepts.git
cd agentic-ai-concepts

python3 -m venv .venv
source .venv/bin/activate

make install          # pip install -r requirements.txt
export OPENAI_API_KEY="sk-..."

Verify the setup:

make test             # 80 offline tests — no API key needed
make evaluate         # baseline evaluation against golden dataset

Project Structure

agentic-ai-concepts/
├── src/
│   ├── schema.py              # Pydantic models
│   ├── preprocessing.py       # PII masking
│   ├── fallback.py            # Keyword fallback
│   ├── routing.py             # Label → action
│   ├── tools.py               # Agent tools
│   ├── evaluate.py            # Metrics + prod gate
│   ├── langchain_agent.py     # Approach 1
│   ├── langgraph_agent.py     # Approach 2
│   └── crewai_agent.py        # Approach 3
├── data/golden_dataset.csv    # 30 labelled emails
├── tests/                     # 80 offline + 12 integration
└── examples/
    ├── run_all.py
    └── compare_frameworks.py  # Side-by-side verdict

03 Section 3

Building the
Shared Modules

Schema, PII preprocessing, keyword fallback, routing.

Module 1 — Schema

The contract all three frameworks must produce.

class EmailLabel(str, Enum):
    PHISHING = "phishing"
    SPAM     = "spam"
    INVOICE  = "invoice"
    MEETING  = "meeting"
    SUPPORT  = "support"
    OTHER    = "other"

class EmailClassification(BaseModel):
    label: EmailLabel
    confidence: confloat(ge=0.0, le=1.0)
    rationale: str
    indicators: List[str] = []
    requires_human_review: bool = False

Pydantic enforces the contract — invalid LLM output fails fast.

Module 2 — PII Preprocessing

Replace sensitive data before sending to the LLM.

_PII_PATTERNS = OrderedDict(
    SSN=re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    CREDIT_CARD=re.compile(r"\b(?:\d[ -]*?){13,16}\b"),
    EMAIL=re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
    IBAN=re.compile(r"\b[A-Z]{2}\d{2}[A-Z0-9]{11,30}\b"),
    PHONE=re.compile(r"\b(\+?\d[\d\s\-\(\)]{7,}\d)\b"),
)

def mask_pii(text: str) -> str:
    for tag, pattern in _PII_PATTERNS.items():
        text = pattern.sub(f"[{tag}]", text)
    return text

>>> mask_pii("Contact [email protected], SSN 123-45-6789")
'Contact [EMAIL], SSN [SSN]'

Module 3 — Keyword Fallback

Deterministic safety net — always have a fallback for any probabilistic component.

_KEYWORD_MAP = {
    EmailLabel.PHISHING: ["verify", "password", "urgent", "suspend", 
                          "locked", "click", "link", "account"],
    EmailLabel.INVOICE:  ["invoice", "payment", "remittance", "iban", "vat"],
    EmailLabel.MEETING:  ["meeting", "calendar", "invite", "zoom", "agenda"],
    EmailLabel.SUPPORT:  ["ticket", "issue", "bug", "incident", "outage"],
    EmailLabel.SPAM:     ["unsubscribe", "promotion", "deal", "free", "win"],
}

def keyword_fallback(subject, body) -> EmailClassification:
    # Count keyword hits per category → pick highest
    # Confidence capped at 0.8 (honest about limitations)
    ...

Module 4 — Routing

Map classification → downstream action. Human review always takes priority.

class RouteAction(str, Enum):
    HUMAN_REVIEW = "queue_for_human_review"
    AP_QUEUE     = "send_to_ap_queue"
    CALENDAR     = "create_calendar_suggestion"
    TICKET       = "create_support_ticket"
    QUARANTINE   = "quarantine"
    INBOX        = "inbox"

def route(classification: EmailClassification) -> RouteAction:
    if classification.requires_human_review:
        return RouteAction.HUMAN_REVIEW    # always takes priority
    return _LABEL_TO_ACTION.get(classification.label, RouteAction.INBOX)

04 Section 4

Agent Tools

Giving the agent access to real data — evidence over guessing.

Why Tools Matter

🔍 Real data

The agent checks facts instead of hallucinating

📋 Evidence-based

Classification backed by tool results

📝 Audit trail

Every tool call is logged — you know why

🧠 Agent decides

LLM picks which tools to call per email

Four Tools

check_sender_reputation

Takes a domain name and returns a risk score. Is this sender known to be malicious?

Triggered by: suspicious sender

scan_urls

Extracts all URLs from the email body and checks them against known malicious patterns.

Triggered by: links in body

lookup_known_contacts

Checks if the sender is in our internal contact list — a known, trusted colleague or vendor.

Triggered by: sender email

check_invoice_registry

Validates whether an invoice number matches a known record in accounts payable.

Triggered by: invoice number

Simulated databases for offline testing. In production, swap for real API (Application Programming Interface) calls.

Tool Example: scan_urls

@tool
def scan_urls(email_body: str) -> str:
    """Scan all URLs in an email body for malicious indicators."""
    url_pattern = re.compile(r"https?://[^\s<>\"']+")
    urls = url_pattern.findall(email_body)

    if not urls:
        return "No URLs found. URL risk: NONE."

    results = []
    for url in urls:
        is_malicious = any(
            re.search(p, url.lower())
            for p in _MALICIOUS_URL_PATTERNS
        )
        if is_malicious:
            results.append(f"  {url} → MALICIOUS (risk: 0.9)")
        ...

    return f"Found {len(urls)} URL(s):\n" + "\n".join(results)

The @tool decorator makes it callable by the LLM agent.

Different Emails → Different Tools

05 Section 5 — Approach 1

LangChain
ReAct Agent

The LLM decides which tools to call. Evidence-based classification.

The ReAct Loop in Action

Email arrives: "URGENT: Verify your account" + URL: https://totallylegit.com/verify REASON: "Suspicious URL + urgency → let me scan it" ACT: scan_urls(body) OBSERVE: "MALICIOUS URL detected (risk: 0.90)" REASON: "Malicious URL confirmed. Check the sender too." ACT: check_sender_reputation("totallylegit.com") OBSERVE: "HIGH RISK domain (score: 0.92)" REASON: "Both tools confirm phishing. High confidence." OUTPUT: {"label": "phishing", "confidence": 0.95, ...}

Building the Agent

from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
from src.tools import ALL_TOOLS

def _build_agent():
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

    agent = create_react_agent(
        model=llm,
        tools=ALL_TOOLS,      # 4 tools from src/tools.py
        prompt=SYSTEM_PROMPT,
    )
    return agent

That's it — create_react_agent handles the entire Reason → Act → Observe loop.

System Prompt

SYSTEM_PROMPT = """
You are an enterprise email triage agent.
Classify emails by gathering evidence using your tools.

## Workflow
1. Analyse the email for signals
2. Use tools to gather evidence:
   - URLs or suspicious → scan_urls
   - Sender domain → check_sender_reputation
   - Sender email → lookup_known_contacts
   - Invoice number → check_invoice_registry
3. Classify: phishing/spam/invoice/meeting/support/other
4. Return JSON: label, confidence, rationale, indicators

## Rules
- Phishing → always requires_human_review = true
- Base confidence on tool evidence, not gut feeling
"""

Guardrails

Applied after the agent produces a classification.

CONFIDENCE_THRESHOLD = 0.6

def apply_guardrails(classification, subject, body):
    # Hard rule: phishing → always flag, cap confidence
    if classification.label == EmailLabel.PHISHING:
        classification.requires_human_review = True
        classification.confidence = min(classification.confidence, 0.85)
        return classification

    # Soft rule: low confidence → deterministic fallback
    if classification.confidence < CONFIDENCE_THRESHOLD:
        return keyword_fallback(subject, body)

    return classification

Run It

make run-langchain
# or: python -m src.langchain_agent

Subject: URGENT: Verify your account now Label: phishing Confidence: 0.85 Action: queue_for_human_review Review: True Tools used: ['scan_urls', 'check_sender_reputation']

The tools_used field is the audit trail — you know exactly why the agent decided.

06 Section 6 — Approach 2

LangGraph
State Machine

Explicit graph. Typed state. Conditional edges. Bank-grade auditability.

Architecture

Typed State

class GraphState(TypedDict):
    """Immutable state flowing through the graph."""
    subject: str
    body: str
    masked_body: str
    classification: Optional[EmailClassification]
    action: Optional[str]

No hidden mutations

Every field is typed and explicit

Auditable

Compliance can review the full state at any node

Node Functions

Each node is independently testable.

def preprocess_node(state: GraphState) -> dict:
    sanitised = preprocess_email(state["subject"], state["body"])
    return {"masked_body": sanitised["body"]}

def classify_node(state: GraphState) -> dict:
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
    classifier = llm.with_structured_output(EmailClassification)
    result = (prompt | classifier).invoke({
        "subject": state["subject"], "body": state["masked_body"],
    })
    return {"classification": result}

def guardrails_node(state: GraphState) -> dict:
    result = state["classification"]
    if result.label == EmailLabel.PHISHING:
        result.requires_human_review = True
        result.confidence = min(result.confidence, 0.85)
    return {"classification": result}

Building the Graph

from langgraph.graph import END, StateGraph

def build_graph():
    graph = StateGraph(GraphState)

    # Nodes
    graph.add_node("preprocess", preprocess_node)
    graph.add_node("classify",   classify_node)
    graph.add_node("guardrails", guardrails_node)
    graph.add_node("fallback",   fallback_node)
    graph.add_node("route",      route_node)

    # Edges
    graph.set_entry_point("preprocess")
    graph.add_edge("preprocess", "classify")
    graph.add_edge("classify",   "guardrails")
    graph.add_conditional_edges("guardrails",
        decide_after_guardrails,
        {"fallback": "fallback", "route": "route"})
    graph.add_edge("fallback", "route")
    graph.add_edge("route", END)

    return graph.compile()

Why LangGraph for Enterprise

Explicit edges

Every transition is declared and reviewable

Typed state

No silent mutations — TypedDict enforced

Conditional branching

Logic is declared, not buried in if/else

Unit testable

Each node function is independently testable

07 Section 7 — Approach 3

CrewAI
Multi-Agent

Three agents collaborate: Classifier → Risk Analyst → Policy Router.

Three Agents, One Crew

🏷️ Classifier

"Senior email triage specialist at a Fortune 500 company"

Produces initial label + confidence

🛡️ Risk Analyst

"Cybersecurity analyst specialising in email threats"

Reviews for false negatives
Escalates suspicious items

📋 Policy Router

"Compliance officer — when in doubt, escalate"

Determines final action
Applies company policy

Agent Definition

from crewai import Agent, Crew, Process, Task

classifier_agent = Agent(
    role="Email Classifier",
    goal="Classify the email into one of: phishing, spam, "
         "invoice, meeting, support, other.",
    backstory="You are a senior email triage specialist "
              "at a Fortune 500 company.",
    verbose=False,
    allow_delegation=False,
)

# ... risk_agent, policy_agent defined similarly

crew = Crew(
    agents=[classifier_agent, risk_agent, policy_agent],
    tasks=[classify_task, risk_task, policy_task],
    process=Process.sequential,
)
result = crew.kickoff()

Trade-offs

Strengths

Clear role separation — each agent has one job. Easy to extend — add a fourth agent for compliance or translation without rewriting anything. Mirrors how human teams collaborate: analyst, reviewer, decision-maker.

Costs

Three LLM calls per email instead of one. More latency and higher API spend. For a single classification task the extra agents do not significantly improve accuracy — the overhead is not justified.

Best for complex problems that genuinely need collaborative multi-step reasoning across different roles.

08 Section 8

Testing &
Quality

80 offline tests. No API key needed. Every component covered.

Test Strategy

🟢 80 Offline Tests

PII masking patterns
Keyword fallback — all 6 categories
Routing — label-to-action mapping
Guardrails — phishing, thresholds
Tools — every function
Evaluation — metrics math, prod gate
JSON parsing — 3 extraction strategies

🔵 12 Integration Tests

Require OPENAI_API_KEY
End-to-end pipeline
Verify tool usage
Test all 3 frameworks

Separated by @pytest.mark.integration

Running Tests

# Offline only (fast, no API key)
make test

# Everything including live LLM
make test-all

tests/test_tools.py::TestScanUrls::test_malicious_url ............. PASSED tests/test_evaluate.py::TestComputeMetrics::test_precision_recall .. PASSED tests/test_fallback.py::TestKeywordFallback::test_phishing ........ PASSED tests/test_routing.py::TestRouting::test_invoice_to_ap_queue ...... PASSED ... ========================= 80 passed, 12 deselected in 1.54s ============

09 Section 9

Evaluation
Before Production

Golden dataset. Precision & recall. Production readiness gate.

The Golden Dataset

data/golden_dataset.csv — 30 hand-labelled emails

Category	Easy	Medium	Hard	Total
phishing	1	2	2	5
spam	2	1	1	4
invoice	3	1	1	5
meeting	2	2	2	6
support	2	1	2	5
other	2	3	0	5

Hard samples: BEC wire transfer, legit security alert, ambiguous reply threads.

Precision vs Recall

Precision

Of everything flagged as X, how many actually were X?

Low precision = too many false alarms → alert fatigue

Recall

Of all actual X emails, how many did we catch?

Low recall = missed threats → security risk

F1 = harmonic mean of both. It balances precision and recall.

Production Readiness Gate

MIN_WEIGHTED_F1      = 0.70   # Overall performance
MIN_PHISHING_RECALL  = 0.80   # Must catch ≥ 80% of phishing
MIN_PHISHING_PREC    = 0.60   # Must not over-flag

Three checks. All must pass.

Weighted F1 ≥ 0.70

Overall quality

Phishing recall ≥ 0.80

Safety-critical

Phishing precision ≥ 0.60

Alert fatigue

Baseline Result: Keyword Fallback

make evaluate

Class Precision Recall F1 Support phishing 0.50 1.00 0.67 5 spam 0.75 0.75 0.75 4 invoice 0.83 1.00 0.91 5 meeting 0.86 1.00 0.92 6 support 1.00 0.60 0.75 5 other 0.00 0.00 0.00 5 Weighted F1: 0.67 Accuracy: 73.3% Production Gate: ✗ FAIL — Not ready for production ✗ Weighted F1 = 0.67 < 0.70 ✓ Phishing recall = 1.00 ≥ 0.80 ✗ Phishing precision = 0.50 < 0.60

Keywords alone aren't enough. We need the LLM agents.

10 Section 10

Framework
Comparison &
The Verdict

Head-to-head results. Which framework wins?

Run the Comparison

export OPENAI_API_KEY="sk-..."
make compare

Runs all 30 golden samples through all 4 approaches.

# Or directly:
python examples/compare_frameworks.py --all

Side-by-Side Results

Metric	Fallback	LangChain	LangGraph	CrewAI
Accuracy	73.3%	~90% ★	~90%	~87%
Weighted F1	0.67	~0.90 ★	~0.90	~0.87
Phishing Recall	1.00	1.00	1.00	0.80
Phishing Precision	0.50	0.83	0.83	0.80
"other" F1	0.00	0.78	0.75	0.78
Time (30 emails)	0.0s	~45s	~13s	~68s
LLM calls/email	0	1 (multi-turn)	1	3
Prod Gate	FAIL	PASS	PASS	PASS

Numbers may vary slightly due to LLM non-determinism.

The Verdict by Criterion

Criterion	Winner	Why
Best accuracy	LangChain ≈ LangGraph	Both ~90%
Best speed	LangGraph	Single LLM call, ~3x faster
Best auditability	LangGraph	Explicit edges, typed state
Best safety	LangChain	Tool evidence = audit trail
Best cost	LangGraph	1 call vs multi-turn vs 3
Best extensibility	CrewAI	Adding an agent is trivial

🏆 Recommendation

LangGraph for Production

Best balance of accuracy, speed, cost, and auditability. Every edge is reviewable. Typed state. Independently testable nodes.

LangChain for Prototyping

Evidence-based reasoning with tool audit trail. Better at explaining decisions. Excellent for discovery.

CrewAI is the right choice when you need collaborative multi-step reasoning across truly different roles — but overkill for single classification.

Production Architecture

Saving & CI/CD (Continuous Integration / Continuous Delivery) Integration

# In your CI pipeline:
python examples/compare_frameworks.py --all \
    --output data/eval_results/comparison.json

# Parse the result:
import json
data = json.load(open('data/eval_results/comparison.json'))
winner = data.get('recommended')
print(f'Recommended: {winner}')
for a in data['approaches']:
    gate = '✓' if a['passed_production_gate'] else '✗'
    print(f"  {gate} {a['approach']}: F1={a['weighted_f1']:.4f}")

★ Course Wrap-Up

Key Takeaways
& Next Steps

What Makes an Agent Production-Ready

Explicit decision-making

The system reasons before acting

Controlled actions

Tools provide real data, not hallucinations

State and memory

Each step knows what came before

Safety and governance

Guardrails enforce business rules at every step

The One-Paragraph Summary

An agent is a system that reasons, acts, and iterates toward a goal using tools and state, rather than producing a single response. Multi-agent systems decompose complex workflows across specialised roles. LangChain enables ReAct-style agents with dynamic tool selection, LangGraph provides deterministic state-machine orchestration for enterprise workflows, and CrewAI enables collaborative multi-agent designs. Agentic systems trade simplicity for control, safety, and auditability.

What to Build Next

🔌 Real APIs

Replace simulated tools with real threat intelligence, CRM, and invoice APIs

💾 Memory

Use LangGraph checkpointing to persist state across sessions

🔄 Feedback loop

Let human reviewers correct classifications to improve the model

🚀 Deploy

Wrap classify_email in a FastAPI endpoint with make compare in CI/CD

12 Section 12

Production Metrics,
Monitoring, RAG &
Context Engineering

Continuous evaluation · Latency · Concurrency · Semaphores · Huge contexts

12.1 — From Offline Gate to Production Dashboard

Offline evaluation

Golden dataset, precision, recall, F1, confusion matrix, judge scores.

Production evaluation

Real traffic, real latency, real failures, cost, drift, human corrections.

A production gate must pass on quality, latency, cost, safety, and graceful degradation.

12.2 — TRACE: What to Monitor

12.3 — Remember Precision, Recall, and F1

P = Predicted

Of what I predicted as positive, how much was correct?

R = Really positive

Of what was really positive, how much did I catch?

F = Fusion

F1 balances precision and recall into one score.

FP = False Panic · FN = Forgotten Need. Business risk decides which error matters most.

12.4 — Latency, Concurrency, and Semaphores

12.5 — Production Backpressure Pattern

accept request
  → bounded queue
  → acquire semaphore
  → expensive operation with timeout
  → retry only safe failures
  → fallback if needed
  → release semaphore
  → record metrics
  → return result

Use separate semaphores for LLM calls, retrieval calls, external tools, and database writes.

12.6 — RAG for Huge Contexts

12.7 — CHUNK and H-R-R-R

CHUNK

Cut by meaning · Honor structure · Use overlap carefully · Name chunks with metadata · Keep parent context.

H-R-R-R

Hybrid search · Rewrite query · Rerank candidates · Reduce final context.

Rule: retrieve broadly, rerank carefully, pack narrowly.

12.8 — Evaluate RAG in Two Halves

Retriever	Generator
Recall@K	Groundedness
Precision@K	Faithfulness
MRR / nDCG	Answer correctness
Context relevance	Citation accuracy + abstention quality

A wrong RAG answer is either missing evidence or ignoring good evidence. Measure both.

12.9 — STATE Context Compaction

S

Summary of goal

T

Tasks done and remaining

A

Assumptions and decisions

T

Trace pointers

E

Errors and risks

Summarize old tool results, drop redundant logs, keep source pointers, and retrieve fresh evidence when the question changes.

Section 12 — Recap

Offline evaluation says whether the agent worked on known examples.
Production evaluation says whether it keeps working under real traffic.
Use semaphores and bounded queues to prevent overload.
Use RAG carefully: chunk by meaning, rerank, compress, and cite.
Use STATE to preserve operational memory in long-running agents.

Thank You!

📦 github.com/ruslanmv/agentic-ai-concepts

📝 Full tutorial: docs/blog.md

🌐 ruslanmv.com

If this course helped, ⭐ the repo and share with your team.

BuildingAgentic AI Systemsfrom Scratch