Workshop
One Problem, Three Frameworks
LangChain Β· LangGraph Β· CrewAI
Instructor: Ruslan Magana Vsevolodovna | ruslanmv.com
What agents are, how ReAct works, and when to use multi-agent systems
Build the same system in LangChain, LangGraph, and CrewAI β compare hands-on
PII masking, guardrails, testing, evaluation with precision/recall/F1
Run a framework comparison and pick the right tool for the job
What is an agent? The spectrum from LLM to CoT to Agent. Memory, MCP, A2A, orchestration.
A system that controls the flow, not just the output.
The agent decides what to do next β a regular LLM just gives you one answer.
It is not binary. There is a spectrum of intelligence β understanding it helps you choose the right level for each problem.
A Large Language Model takes one prompt, produces one response. No tools, no memory, no iteration. Fast and cheap, but if it hallucinates, you have no safety net.
Best for: text generation, Q&A, summarisation
The model generates its reasoning steps before answering. "The URL is suspiciousβ¦ the tone is urgentβ¦ classic phishing." Much more accurate β but still reasoning in a vacuum. It cannot verify anything externally.
Used by: DeepSeek-R1, OpenAI o1/o3, Claude
The LLM reasons plus takes real actions β calling tools, checking databases, scanning URLs. It acts on the world and feeds results back into its reasoning. Evidence-based decisions.
Best for: workflows, decisions, business impact
Rule of thumb: if the task needs external data or has business impact β use an agent. Pure text β LLM or CoT is enough.
Reason + Act β the loop at the heart of modern agents
Reasoning is explicit, auditable, and debuggable β every step is logged.
Multiple agents, each with a specialised role, collaborating to solve a problem
Each agent has one job
Add a new agent without rewriting
Three layers of memory β each extends how far the agent can reach.
The conversation itself. Each reasoning step, tool call, and observation is appended to the message history. The LLM sees everything from the current run inside its context window β a fixed-size buffer measured in tokens (roughly 3/4 of a word). GPT-4o-mini: 128k tokens. Claude: 200k tokens.
LangGraph supports checkpointing β serialising the agent's full state to a persistent store. The agent can resume conversations or recall information from previous sessions. This is how you build agents that remember past interactions.
An embedding is a numerical representation of text that captures meaning. A vector store (ChromaDB, Pinecone, FAISS) indexes embeddings for fast semantic search. The agent converts its query to an embedding, finds similar passages, and injects them into the prompt. This is Retrieval-Augmented Generation (RAG).
In this course we use short-term memory. For production, add checkpointing and RAG for knowledge that exceeds the context window.
Two protocols power the agentic ecosystem. Understanding the difference is critical.
An open standard (by Anthropic) for connecting agents to tools β passive functions that take input and return output. URL scanners, database queries, weather APIs. The MCP Server exposes tools; the MCP Client (the agent) discovers and calls them.
Transport: stdio (local subprocess) or SSE (Server-Sent Events β remote HTTP service).
An architectural pattern (by Google) for connecting agents to other autonomous agents β systems that reason, use their own tools, and make their own decisions. The difference: MCP tools are passive functions. A2A agents are active reasoners.
Example: a coordinator delegates "find flights" to a Flight Agent that reasons about layovers, compares prices, and calls airline APIs via MCP.
Choosing the right pattern is often more important than choosing the right framework.
In this course: ReAct Loop (LangChain), DAG (LangGraph), Sequential (CrewAI).
An agent is a system that reasons, acts, and iterates toward a goal using tools and state, rather than producing a single response. It manages memory through context windows, checkpoints, and vector stores. It connects to tools via MCP and to other agents via A2A. The orchestration pattern β sequential, DAG, ReAct, hierarchical, or routing β determines how much control vs. flexibility the system has.
Enterprise email classification β one problem, three frameworks.
Classify incoming emails and route them to the right action.
| Category | Example | Action |
|---|---|---|
| phishing | "Verify your account immediately" | Quarantine + review |
| spam | "Limited time! Win a free iPhone" | Quarantine |
| invoice | "Invoice #2026-042 β payment due" | Accounts payable |
| meeting | "Team sync Thursday 10 AM" | Calendar suggestion |
| support | "Ticket #5432 β production outage" | Support ticket |
| other | Everything else | Inbox |
Shared by all three approaches
git clone https://github.com/ruslanmv/agentic-ai-concepts.git
cd agentic-ai-concepts
python3 -m venv .venv
source .venv/bin/activate
make install # pip install -r requirements.txt
export OPENAI_API_KEY="sk-..."
Verify the setup:
make test # 80 offline tests β no API key needed
make evaluate # baseline evaluation against golden dataset
agentic-ai-concepts/
βββ src/
β βββ schema.py # Pydantic models
β βββ preprocessing.py # PII masking
β βββ fallback.py # Keyword fallback
β βββ routing.py # Label β action
β βββ tools.py # Agent tools
β βββ evaluate.py # Metrics + prod gate
β βββ langchain_agent.py # Approach 1
β βββ langgraph_agent.py # Approach 2
β βββ crewai_agent.py # Approach 3
βββ data/golden_dataset.csv # 30 labelled emails
βββ tests/ # 80 offline + 12 integration
βββ examples/
βββ run_all.py
βββ compare_frameworks.py # Side-by-side verdict
Schema, PII preprocessing, keyword fallback, routing.
The contract all three frameworks must produce.
class EmailLabel(str, Enum):
PHISHING = "phishing"
SPAM = "spam"
INVOICE = "invoice"
MEETING = "meeting"
SUPPORT = "support"
OTHER = "other"
class EmailClassification(BaseModel):
label: EmailLabel
confidence: confloat(ge=0.0, le=1.0)
rationale: str
indicators: List[str] = []
requires_human_review: bool = False
Pydantic enforces the contract β invalid LLM output fails fast.
Replace sensitive data before sending to the LLM.
_PII_PATTERNS = OrderedDict(
SSN=re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
CREDIT_CARD=re.compile(r"\b(?:\d[ -]*?){13,16}\b"),
EMAIL=re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
IBAN=re.compile(r"\b[A-Z]{2}\d{2}[A-Z0-9]{11,30}\b"),
PHONE=re.compile(r"\b(\+?\d[\d\s\-\(\)]{7,}\d)\b"),
)
def mask_pii(text: str) -> str:
for tag, pattern in _PII_PATTERNS.items():
text = pattern.sub(f"[{tag}]", text)
return text
>>> mask_pii("Contact [email protected], SSN 123-45-6789")
'Contact [EMAIL], SSN [SSN]'
Deterministic safety net β always have a fallback for any probabilistic component.
_KEYWORD_MAP = {
EmailLabel.PHISHING: ["verify", "password", "urgent", "suspend",
"locked", "click", "link", "account"],
EmailLabel.INVOICE: ["invoice", "payment", "remittance", "iban", "vat"],
EmailLabel.MEETING: ["meeting", "calendar", "invite", "zoom", "agenda"],
EmailLabel.SUPPORT: ["ticket", "issue", "bug", "incident", "outage"],
EmailLabel.SPAM: ["unsubscribe", "promotion", "deal", "free", "win"],
}
def keyword_fallback(subject, body) -> EmailClassification:
# Count keyword hits per category β pick highest
# Confidence capped at 0.8 (honest about limitations)
...
Map classification β downstream action. Human review always takes priority.
class RouteAction(str, Enum):
HUMAN_REVIEW = "queue_for_human_review"
AP_QUEUE = "send_to_ap_queue"
CALENDAR = "create_calendar_suggestion"
TICKET = "create_support_ticket"
QUARANTINE = "quarantine"
INBOX = "inbox"
def route(classification: EmailClassification) -> RouteAction:
if classification.requires_human_review:
return RouteAction.HUMAN_REVIEW # always takes priority
return _LABEL_TO_ACTION.get(classification.label, RouteAction.INBOX)
Giving the agent access to real data β evidence over guessing.
The agent checks facts instead of hallucinating
Classification backed by tool results
Every tool call is logged β you know why
LLM picks which tools to call per email
Takes a domain name and returns a risk score. Is this sender known to be malicious?
Triggered by: suspicious sender
Extracts all URLs from the email body and checks them against known malicious patterns.
Triggered by: links in body
Checks if the sender is in our internal contact list β a known, trusted colleague or vendor.
Triggered by: sender email
Validates whether an invoice number matches a known record in accounts payable.
Triggered by: invoice number
Simulated databases for offline testing. In production, swap for real API (Application Programming Interface) calls.
@tool
def scan_urls(email_body: str) -> str:
"""Scan all URLs in an email body for malicious indicators."""
url_pattern = re.compile(r"https?://[^\s<>\"']+")
urls = url_pattern.findall(email_body)
if not urls:
return "No URLs found. URL risk: NONE."
results = []
for url in urls:
is_malicious = any(
re.search(p, url.lower())
for p in _MALICIOUS_URL_PATTERNS
)
if is_malicious:
results.append(f" {url} β MALICIOUS (risk: 0.9)")
...
return f"Found {len(urls)} URL(s):\n" + "\n".join(results)
The @tool decorator makes it callable by the LLM agent.
The LLM decides which tools to call. Evidence-based classification.
from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
from src.tools import ALL_TOOLS
def _build_agent():
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
agent = create_react_agent(
model=llm,
tools=ALL_TOOLS, # 4 tools from src/tools.py
prompt=SYSTEM_PROMPT,
)
return agent
That's it β create_react_agent handles the entire Reason β Act β Observe loop.
SYSTEM_PROMPT = """
You are an enterprise email triage agent.
Classify emails by gathering evidence using your tools.
## Workflow
1. Analyse the email for signals
2. Use tools to gather evidence:
- URLs or suspicious β scan_urls
- Sender domain β check_sender_reputation
- Sender email β lookup_known_contacts
- Invoice number β check_invoice_registry
3. Classify: phishing/spam/invoice/meeting/support/other
4. Return JSON: label, confidence, rationale, indicators
## Rules
- Phishing β always requires_human_review = true
- Base confidence on tool evidence, not gut feeling
"""
Applied after the agent produces a classification.
CONFIDENCE_THRESHOLD = 0.6
def apply_guardrails(classification, subject, body):
# Hard rule: phishing β always flag, cap confidence
if classification.label == EmailLabel.PHISHING:
classification.requires_human_review = True
classification.confidence = min(classification.confidence, 0.85)
return classification
# Soft rule: low confidence β deterministic fallback
if classification.confidence < CONFIDENCE_THRESHOLD:
return keyword_fallback(subject, body)
return classification
make run-langchain
# or: python -m src.langchain_agent
The tools_used field is the audit trail β you know exactly why the agent decided.
Explicit graph. Typed state. Conditional edges. Bank-grade auditability.
class GraphState(TypedDict):
"""Immutable state flowing through the graph."""
subject: str
body: str
masked_body: str
classification: Optional[EmailClassification]
action: Optional[str]
Every field is typed and explicit
Compliance can review the full state at any node
Each node is independently testable.
def preprocess_node(state: GraphState) -> dict:
sanitised = preprocess_email(state["subject"], state["body"])
return {"masked_body": sanitised["body"]}
def classify_node(state: GraphState) -> dict:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
classifier = llm.with_structured_output(EmailClassification)
result = (prompt | classifier).invoke({
"subject": state["subject"], "body": state["masked_body"],
})
return {"classification": result}
def guardrails_node(state: GraphState) -> dict:
result = state["classification"]
if result.label == EmailLabel.PHISHING:
result.requires_human_review = True
result.confidence = min(result.confidence, 0.85)
return {"classification": result}
from langgraph.graph import END, StateGraph
def build_graph():
graph = StateGraph(GraphState)
# Nodes
graph.add_node("preprocess", preprocess_node)
graph.add_node("classify", classify_node)
graph.add_node("guardrails", guardrails_node)
graph.add_node("fallback", fallback_node)
graph.add_node("route", route_node)
# Edges
graph.set_entry_point("preprocess")
graph.add_edge("preprocess", "classify")
graph.add_edge("classify", "guardrails")
graph.add_conditional_edges("guardrails",
decide_after_guardrails,
{"fallback": "fallback", "route": "route"})
graph.add_edge("fallback", "route")
graph.add_edge("route", END)
return graph.compile()
Every transition is declared and reviewable
No silent mutations β TypedDict enforced
Logic is declared, not buried in if/else
Each node function is independently testable
Three agents collaborate: Classifier β Risk Analyst β Policy Router.
"Senior email triage specialist at a Fortune 500 company"
"Cybersecurity analyst specialising in email threats"
"Compliance officer β when in doubt, escalate"
from crewai import Agent, Crew, Process, Task
classifier_agent = Agent(
role="Email Classifier",
goal="Classify the email into one of: phishing, spam, "
"invoice, meeting, support, other.",
backstory="You are a senior email triage specialist "
"at a Fortune 500 company.",
verbose=False,
allow_delegation=False,
)
# ... risk_agent, policy_agent defined similarly
crew = Crew(
agents=[classifier_agent, risk_agent, policy_agent],
tasks=[classify_task, risk_task, policy_task],
process=Process.sequential,
)
result = crew.kickoff()
Clear role separation β each agent has one job. Easy to extend β add a fourth agent for compliance or translation without rewriting anything. Mirrors how human teams collaborate: analyst, reviewer, decision-maker.
Three LLM calls per email instead of one. More latency and higher API spend. For a single classification task the extra agents do not significantly improve accuracy β the overhead is not justified.
Best for complex problems that genuinely need collaborative multi-step reasoning across different roles.
80 offline tests. No API key needed. Every component covered.
OPENAI_API_KEYSeparated by @pytest.mark.integration
# Offline only (fast, no API key)
make test
# Everything including live LLM
make test-all
Golden dataset. Precision & recall. Production readiness gate.
data/golden_dataset.csv β 30 hand-labelled emails
| Category | Easy | Medium | Hard | Total |
|---|---|---|---|---|
| phishing | 1 | 2 | 2 | 5 |
| spam | 2 | 1 | 1 | 4 |
| invoice | 3 | 1 | 1 | 5 |
| meeting | 2 | 2 | 2 | 6 |
| support | 2 | 1 | 2 | 5 |
| other | 2 | 3 | 0 | 5 |
Hard samples: BEC wire transfer, legit security alert, ambiguous reply threads.
Of everything flagged as X, how many actually were X?
Low precision = too many false alarms β alert fatigue
Of all actual X emails, how many did we catch?
Low recall = missed threats β security risk
F1 = harmonic mean of both. It balances precision and recall.
MIN_WEIGHTED_F1 = 0.70 # Overall performance
MIN_PHISHING_RECALL = 0.80 # Must catch β₯ 80% of phishing
MIN_PHISHING_PREC = 0.60 # Must not over-flag
Three checks. All must pass.
Overall quality
Safety-critical
Alert fatigue
make evaluate
Keywords alone aren't enough. We need the LLM agents.
Head-to-head results. Which framework wins?
export OPENAI_API_KEY="sk-..."
make compare
Runs all 30 golden samples through all 4 approaches.
# Or directly:
python examples/compare_frameworks.py --all
| Metric | Fallback | LangChain | LangGraph | CrewAI |
|---|---|---|---|---|
| Accuracy | 73.3% | ~90% β | ~90% | ~87% |
| Weighted F1 | 0.67 | ~0.90 β | ~0.90 | ~0.87 |
| Phishing Recall | 1.00 | 1.00 | 1.00 | 0.80 |
| Phishing Precision | 0.50 | 0.83 | 0.83 | 0.80 |
| "other" F1 | 0.00 | 0.78 | 0.75 | 0.78 |
| Time (30 emails) | 0.0s | ~45s | ~13s | ~68s |
| LLM calls/email | 0 | 1 (multi-turn) | 1 | 3 |
| Prod Gate | FAIL | PASS | PASS | PASS |
Numbers may vary slightly due to LLM non-determinism.
| Criterion | Winner | Why |
|---|---|---|
| Best accuracy | LangChain β LangGraph | Both ~90% |
| Best speed | LangGraph | Single LLM call, ~3x faster |
| Best auditability | LangGraph | Explicit edges, typed state |
| Best safety | LangChain | Tool evidence = audit trail |
| Best cost | LangGraph | 1 call vs multi-turn vs 3 |
| Best extensibility | CrewAI | Adding an agent is trivial |
Best balance of accuracy, speed, cost, and auditability. Every edge is reviewable. Typed state. Independently testable nodes.
Evidence-based reasoning with tool audit trail. Better at explaining decisions. Excellent for discovery.
CrewAI is the right choice when you need collaborative multi-step reasoning across truly different roles β but overkill for single classification.
# In your CI pipeline:
python examples/compare_frameworks.py --all \
--output data/eval_results/comparison.json
# Parse the result:
import json
data = json.load(open('data/eval_results/comparison.json'))
winner = data.get('recommended')
print(f'Recommended: {winner}')
for a in data['approaches']:
gate = 'β' if a['passed_production_gate'] else 'β'
print(f" {gate} {a['approach']}: F1={a['weighted_f1']:.4f}")
The system reasons before acting
Tools provide real data, not hallucinations
Each step knows what came before
Guardrails enforce business rules at every step
An agent is a system that reasons, acts, and iterates toward a goal using tools and state, rather than producing a single response. Multi-agent systems decompose complex workflows across specialised roles. LangChain enables ReAct-style agents with dynamic tool selection, LangGraph provides deterministic state-machine orchestration for enterprise workflows, and CrewAI enables collaborative multi-agent designs. Agentic systems trade simplicity for control, safety, and auditability.
Replace simulated tools with real threat intelligence, CRM, and invoice APIs
Use LangGraph checkpointing to persist state across sessions
Let human reviewers correct classifications to improve the model
Wrap classify_email in a FastAPI endpoint with make compare in CI/CD
π¦ github.com/ruslanmv/agentic-ai-concepts
π Full tutorial: docs/blog.md
π ruslanmv.com
If this course helped, β the repo and share with your team.