AI Engineer Agent Specialist

10 chunks · Build-first · Socratic tutor · Health domain

Sequential first pass Interleaved review Harvard 2x method 16 days Print-ready

How to use

First pass (chunks 1→10): Read the chunk. Open a new Claude chat, paste the Socratic tutor prompt. Build the exercise. Don't move on until you can rebuild from memory.

Review passes (after day 7): Pick 3 RETAIN sections from different chunks (e.g. 2+5+8). Mix — never same-topic review.

First pass Sequential 1→10. Dependencies are real.
Interleaved review Rohrer & Taylor 2007: mixing different topics = 77% retention vs 38% for same-topic review. Pick 3 RETAIN sections from different chunks (e.g. 2+5+8). The difficulty of switching is the mechanism.
Harvard 2x (Socratic tutor) 2024 RCT: Socratic AI tutoring = 2x more material learned vs standard instruction. Mechanism: questions force retrieval, which encodes memory. Being given answers skips retrieval entirely — nothing is encoded. Ask "what concept am I missing?" never "fix this for me."
GNOSIS TECHNIQUE block: 30 min after HARVEST. Timer. Stop.
#TopicDayBuild
0Day 0Install packages + verify API key
1110 medical terms similarity search
22Clinic FAQ in Chroma
33–4Clinic chatbot over real docs — €1,500
4520-question test set
572-tool research agent
683-tool health agent
79–10Patient profile + session memory
811–12Multi-step clinical reasoning agent
913–14Wrap researcher tool as MCP
1015–16Rewrite RAG + ReAct in industry frameworks
1117Streamlit app → live URL → portfolio artifact
SETUPEnvironment Setup

"Run this once before Chunk 1 — takes 10 minutes"

Path A — Free (Ollama, local, no API cost)

Use this while learning. Runs entirely on your machine. No API keys, no billing, works offline.

# 1. Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# 2. Pull models (one-time download)
ollama pull llama3.2          # 2GB — main LLM for all exercises
ollama pull nomic-embed-text  # 274MB — free embedding model

# 3. Verify
ollama run llama3.2 "Say hello"  # should respond

# 4. Install Python packages
pip install openai chromadb sentence-transformers langchain-ollama langchain-community langgraph streamlit

Ollama is OpenAI-compatible — 1-line swap

All code in these chunks uses OpenAI(). To use Ollama instead, change only the client init:

# Replace this (paid):
from openai import OpenAI
client = OpenAI()

# With this (free, local):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Everything else — tool calling, streaming, chat — is identical.

Path B — API keys (for the portfolio demo)

When you're ready to deploy and share with clients: get OpenAI or Anthropic API keys. The code is identical — only the client init changes. Until then, Ollama is faster and free.

# Optional: add to ~/.bashrc when ready
export OPENAI_API_KEY="sk-..."       # platform.openai.com/api-keys
export ANTHROPIC_API_KEY="sk-ant-..." # console.anthropic.com
0 / 11
CHUNK 01 / 11Embeddings

"Text you can do math on"

The Mechanism

An embedding is a list of numbers (a vector) that represents the meaning of text. "heart attack"[0.23, -0.87, 0.11, ...] (1536 numbers in OpenAI's model)

The key property: similar meanings → similar vectors. "heart attack" is numerically close to "myocardial infarction" and "cardiac event". Far from "chicken soup".

This is how you search by meaning, not keyword. You don't need the exact word — you need the concept.

Why It's Not Magic

The LLM was trained on billions of text examples. It learned that "heart attack" and "myocardial infarction" appear in similar contexts. The embedding is a compressed representation of that learned context. No understanding — just learned statistical co-occurrence.

Use Cases

Semantic search · Deduplication · Classification · RAG retrieval (the bridge between user question and relevant documents)

The API (2 lines)

from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(input="heart attack", model="text-embedding-3-small")
vector = response.data[0].embedding  # 1536 floats
Free alternative — no API key needed
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")  # 80MB, downloads once
vector = model.encode("heart attack")             # numpy array, 384 dimensions

# Similarity works the same way — same concept, different function
import numpy as np
def similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Or use Ollama: ollama pull nomic-embed-text → same OpenAI embeddings API at localhost:11434/v1

Similarity

import numpy as np
def similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# 1.0 = identical · 0.0 = unrelated · -1.0 = opposite
Build this
10 medical terms. Embed all of them. Find the 3 most similar to "inflammation". Expected: "cytokines", "autoimmune", "C-reactive protein" score higher than "fracture" or "dehydration". If your results make clinical sense → you understand embeddings.
Retain (spaced repetition)
  • Embedding = text → vector where similar meanings are numerically close
  • Cosine similarity: 1.0 = same, 0.0 = unrelated
  • Use text-embedding-3-small — cheap, fast, good enough
  • The point: search by meaning, not keyword
1 / 11
CHUNK 02 / 11Vector Databases

"A search engine for meaning"

The Problem Embeddings Alone Don't Solve

You embed 10,000 patient FAQ entries. A user asks a question. You can't compare the question vector to 10,000 vectors one by one in real time. A vector database stores embeddings and retrieves the most similar ones in milliseconds, even with millions of documents.

The Mechanism

Vector DBs use approximate nearest-neighbor (ANN) algorithms (HNSW, IVF) to find similar vectors without checking every single one. Trade: 99% accuracy, 1000x faster.

Chroma — Start Here (free, local, no signup)

import chromadb

client = chromadb.Client()
collection = client.create_collection("clinic_docs")

# Add documents (Chroma auto-embeds if you give it text)
collection.add(
    documents=["Testosterone therapy increases libido",
               "HGH improves muscle mass",
               "NAD+ supports mitochondrial function"],
    ids=["doc1", "doc2", "doc3"]
)

# Query
results = collection.query(query_texts=["hormones for energy"], n_results=2)
print(results["documents"])  # most similar docs

Key Concepts

Collection = a table of documents. n_results = how many similar docs to return. Metadata filtering = find similar docs, but only from category=hormones.

Build this
Load 20 entries from your clinic protocols or FAQ into Chroma. Query with 5 different patient questions. Verify: does the most similar result actually answer the question? Wrong results → your chunks are too big (Chunk 3 explains chunking).
Retain
  • Vector DB = fast search engine for embeddings
  • Chroma = best start (local, free, no API key)
  • collection.add(documents=[...], ids=[...]) stores + auto-embeds
  • collection.query(query_texts=[...], n_results=K) retrieves top-K similar
  • Later upgrade to Pinecone when you need cloud/scale
2 / 11
CHUNK 03 / 11RAG Pipeline

"Giving LLMs access to your documents without hallucination"

The Problem

Ask Claude "what's the protocol for testosterone therapy in women over 50?" — it answers confidently from 2023 training data. Wrong, outdated, or generic. RAG fixes this: retrieve your actual protocol document first, inject it into the prompt. Claude answers from real context.

The Full Pipeline

# INDEXING (one time):
Documents → Split into chunks → Embed each chunk → Store in Chroma

# RETRIEVAL (every query):
User question → Embed question → Find similar chunks → Insert into prompt → LLM answers

Why Chunk Size Matters

Too large → multiple topics, retrieval noisy. Too small → loses context. Sweet spot: 300–500 tokens with 50-token overlap.

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(your_document)

RAG in 30 Lines

import chromadb
from openai import OpenAI

client = OpenAI()
db = chromadb.Client()
collection = db.create_collection("protocols")

collection.add(documents=chunks, ids=[f"chunk_{i}" for i in range(len(chunks))])

def retrieve(question, k=3):
    results = collection.query(query_texts=[question], n_results=k)
    return "\n\n".join(results["documents"][0])

def rag(question):
    context = retrieve(question)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer based only on the provided context. If the answer isn't in the context, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    )
    return response.choices[0].message.content

print(rag("What's the testosterone protocol for women?"))
Free alternative — Ollama drop-in
# Change only these 2 lines — rest of the RAG code is identical
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Then use "llama3.2" as the model name:
response = client.chat.completions.create(model="llama3.2", messages=[...])

# For embeddings (free):
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# In retrieve(), embed the question manually:
q_vector = embed_model.encode(question).tolist()
results = collection.query(query_embeddings=[q_vector], n_results=3)

Common RAG Failures (diagnose before shipping)

FailureSymptomFix
Chunk too largeIrrelevant content mixed into answerReduce to 300–500 tokens
Chunk too smallAnswer cuts off mid-conceptIncrease size + add overlap
No overlapMisses context at chunk boundariesAdd 50-token overlap
Query-doc mismatchRight doc exists, not retrievedHyDE: ask the LLM to write a hypothetical answer, then embed that — bridges vocabulary gap between how questions are phrased and how documents are written
Top-K too lowRight chunk ranked 4th, k=3 misses itIncrease k to 5–8
LLM ignores contextAnswers from training memoryStrengthen system prompt: "ONLY from context"

Diagnosis rule: wrong retrieval → chunking/embedding issue. Right retrieval, wrong answer → prompt issue. Confident wrong answer → answer isn't in your docs.

Build this — the €1,500–2,500 freelance product
RAG over your clinic's actual protocols or patient FAQ PDF. Test 10 real patient questions. It should refuse to answer questions not in your docs. That refusal behavior = no hallucination = production-ready.
Retain
  • RAG = retrieve relevant docs → inject into prompt → LLM answers from real context
  • Chunk at 300–500 tokens with 50-token overlap
  • "Answer only from context, if not there say so" → eliminates hallucination
  • RAG beats fine-tuning for knowledge injection in 90% of cases
  • This is the sellable product: RAG chatbot over clinic docs
3 / 11
CHUNK 04 / 11Evaluation

"How to know it actually works"

The Problem

You test with 5 easy questions. It works. You ship. Client uses it. Patient asks an edge case. It hallucinates confidently. You didn't find it because you only tested easy questions.

Minimum Viable Eval (3 things)

1. Groundedness — Build 20–30 Q&A pairs from your real documents. Score manually: 0 (wrong), 1 (partial), 2 (correct). Target: >70%.

2. Faithfulness — Ask 10 questions NOT in your documents. It should refuse every time. If it answers anyway = hallucination = not safe for health context.

3. Latency — Target: under 3 seconds. Over 5s = users abandon.

import time
start = time.time()
answer = rag("your question")
print(f"{time.time() - start:.2f}s")

LLM as Judge (auto-scoring)

def score_answer(question, expected, actual):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"""
Score this answer 0-2:
Question: {question}
Expected: {expected}
Actual: {actual}
Return only the number."""}]
    )
    return int(response.choices[0].message.content.strip())
Build this
Build a test set of 20 Q&A pairs from your clinic documents. Run your RAG on all 20 and auto-score. Fix the 5 lowest-scoring answers — trace why they failed (retrieval? chunking? prompt?).
Retain
  • Always build 20–30 test Q&A pairs before shipping anything
  • Groundedness: correct answers on in-doc questions (target >70%)
  • Faithfulness: refuses out-of-doc questions (target 100%)
  • LLM-as-judge: use a cheap model to auto-score
  • Never ship without running your test set
  • Ragas = industry-standard RAG eval framework — measures faithfulness, context precision, answer relevance. pip install ragas. Mention it in interviews even before you have used it.
4 / 11
CHUNK 05 / 11The Agent Loop

"From chatbot to something that acts"

The Difference

A chatbot says "the appointment is tomorrow at 3pm." An agent checks the calendar, finds a conflict, reschedules, sends the confirmation, and updates the EHR.

The Loop (this is all an agent is)

1. PERCEIVE  — receive input (user message, API response, tool output)
2. REASON   — decide what to do next (which tool? what parameters?)
3. ACT      — call the tool
4. OBSERVE  — read the tool's output
5. REPEAT   — go back to 1 until task is complete

In Code

import json

def agent(task, tools, max_steps=10):
    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        response = client.chat.completions.create(
            model="gpt-4o-mini", messages=messages, tools=tools  # or claude-3-5-sonnet-20241022 — both support tool calling
        )

        if response.choices[0].finish_reason == "tool_calls":
            tool_call = response.choices[0].message.tool_calls[0]
            tool_name = tool_call.function.name
            tool_args = json.loads(tool_call.function.arguments)

            result = execute_tool(tool_name, tool_args)  # your function

            messages.append(response.choices[0].message)
            messages.append({
                "role": "tool",
                "content": str(result),
                "tool_call_id": tool_call.id
            })
        else:
            return response.choices[0].message.content  # done

    return "Max steps reached"
Free alternative — 1-line swap, tool calling works identically
# Change only the client init — tool calling is OpenAI-compatible in Ollama:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Then use "llama3.2" as model — it supports tools/function calling
# The entire agent() loop above works without modification

What Can Go Wrong

Infinite loops → add max_steps. Tool errors → add error handling in tool output. Wrong parameters → better tool descriptions (Chunk 6). Hallucinated tool calls → validate inputs.

Build this
Agent with 2 tools: search_pubmed(query) (mock it) and summarize(text). Task: "Find recent research on testosterone in women over 50 and summarize it." Add print statements to see each step of the loop.
Retain
  • Agent = while loop: perceive → reason → act → observe → repeat
  • Agents act, chatbots talk — the difference is tool calling
  • Always set max_steps to prevent infinite loops
  • The model picks tools; you build the loop that executes them
  • Better tool descriptions > better model for fixing wrong tool choices
5 / 11
CHUNK 06 / 11Tool Definition

"How to give an agent hands"

The Mechanism

You describe tools in JSON schema. The LLM reads the description and decides when and how to use each tool. The description is the interface. Bad description = agent breaks.

What a Tool Must Have

1. Name — verb-first, specific (search_pubmed not tool1)
2. Description — when to use it (not just what it does)
3. Parameters — types and descriptions
4. Returns — what comes back

tools = [{
    "type": "function",
    "function": {
        "name": "search_medical_literature",
        "description": "Search PubMed for peer-reviewed medical studies. Use when the user asks about clinical evidence, treatment protocols, drug interactions, or any medical question requiring scientific backing. Returns titles, abstracts, and DOI links.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Medical search query. Be specific: include condition, treatment, population (e.g. 'testosterone replacement therapy women menopause')"
                },
                "max_results": {
                    "type": "integer",
                    "description": "Number of results. Default 5, max 20.",
                    "default": 5
                }
            },
            "required": ["query"]
        }
    }
}]

Structured Outputs — Force JSON (production-critical)

Unstructured text is unparseable. In health contexts, you need machine-readable outputs. Two approaches:

# OpenAI / Ollama — JSON mode
response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_object"},  # forces valid JSON output
    messages=[
        {"role": "system", "content": "Extract lab data. Return JSON only: {\"test\": str, \"value\": float, \"unit\": str, \"flag\": str}"},
        {"role": "user", "content": "TSH: 6.2 mIU/L (high)"}
    ]
)
import json
data = json.loads(response.choices[0].message.content)
# → {"test": "TSH", "value": 6.2, "unit": "mIU/L", "flag": "high"}

# Ollama equivalent (format param instead of response_format):
# Add "format": "json" to the requests.post() body

When to use: tool outputs that feed other tools, structured patient data extraction, any agent output that code needs to parse.

MCP — You Already Use This

MCP is a standardized way to package tools as servers that any agent can connect to. The GA4 and GSC tools in /update are MCP servers. When you build your own clinic tool → wrap it as MCP → any Claude agent can use it.

Build this
Define 3 tools for a health agent (implementations can be mocked). Write descriptions as if explaining to a smart assistant who has never used them. Test: give the agent a task requiring all 3 tools in sequence. If it picks the wrong tool → rewrite the description, not the code.

Add structured outputs to one tool: make it return a Pydantic-style JSON with defined fields. Verify the output is always parseable with json.loads().
Retain
  • Tool description = the interface. The model reads it to decide when and how to call
  • Name: verb-first, specific
  • Description must say WHEN to use it, not just what it does
  • MCP = standardized tool server — wrap your clinic tools in MCP
  • Wrong tool choice → fix description, not the model
6 / 11
CHUNK 07 / 11Agent Memory

"Agents that remember across sessions"

4 Types of Memory

Short-term — current conversation context. Limit: ~200K tokens. You pay for every token on every call.

Long-term — external storage. Two patterns: Semantic/RAG (store facts, retrieve by similarity) + Episodic (events with timestamps).

User profile — structured JSON: name, conditions, medications, past decisions. Injected into system prompt at conversation start.

Working memory — scratchpad for multi-step tasks. Reset each session.

import json, chromadb

# User profile (key-value)
def load_profile(patient_id):
    try: return json.load(open(f"profiles/{patient_id}.json"))
    except: return {}

# Episodic memory (semantic search)
memory_db = chromadb.Client()
memories = memory_db.create_collection("patient_memories")

def store_memory(patient_id, text, session_date):
    memories.add(
        documents=[text],
        metadatas=[{"patient_id": patient_id, "date": session_date}],
        ids=[f"{patient_id}_{session_date}"]
    )

def recall_memories(patient_id, query, k=3):
    results = memories.query(
        query_texts=[query],
        where={"patient_id": patient_id},
        n_results=k
    )
    return results["documents"][0]

# Build system prompt with memory
def build_context(patient_id, current_query):
    profile = load_profile(patient_id)
    relevant_memories = recall_memories(patient_id, current_query)
    return f"""Patient profile:\n{json.dumps(profile, indent=2)}

Relevant past interactions:\n{chr(10).join(relevant_memories)}"""
Build this
Agent that: (1) loads patient profile at session start, (2) stores 2–3 key facts at session end, (3) in next session retrieves relevant memories before responding. Test with 3 simulated sessions. By session 3 the agent should remember things from session 1.
Retain
  • 4 memory types: short-term (context), long-term (RAG), user profile (key-value), working (scratchpad)
  • Long-term = Chroma + patient_id metadata filter
  • Inject profile + relevant memories into system prompt at start
  • Store 2–3 key facts at end of each session
  • Without memory: every interaction is a first meeting. With it: it's a relationship.
7 / 11
CHUNK 08 / 11ReAct Pattern

"The architecture you'll use for 80% of real agents"

What ReAct Is

Reasoning + Acting, interleaved. The model explains its reasoning before each action. This improves reliability — the model can catch its own mistakes before they compound.

Thought: I need to find studies on testosterone in women. I'll search PubMed.
Action: search_medical_literature({"query": "testosterone women menopause"})
Observation: [3 studies returned]
Thought: The studies mention DHEA interaction. I should check drug interactions.
Action: check_drug_interactions({"drug_a": "testosterone", "drug_b": "DHEA"})
Observation: no contraindication found
Thought: I now have enough to answer.
Final Answer: Based on current literature...

Why It's Better

The Thought step = debuggable trace. If the action is wrong, the Thought tells you why. The model can self-correct after seeing the Observation. Without Thought steps: black box. With them: full trace.

REACT_PROMPT = """Use this format:
Thought: [what you need to do and why]
Action: [tool_name with parameters as JSON]
Observation: [you'll see the result here]
... repeat as needed ...
Thought: I have enough information.
Final Answer: [response]"""

def react_agent(question):
    messages = [
        {"role": "system", "content": REACT_PROMPT},
        {"role": "user", "content": question}
    ]
    for _ in range(10):
        response = get_completion(messages)
        if "Final Answer:" in response:
            return response.split("Final Answer:")[1].strip()
        if "Action:" in response:
            action_line = [l for l in response.split("\n") if l.startswith("Action:")][0]
            tool_name, args = parse_action(action_line)
            result = execute_tool(tool_name, args)
            messages.append({"role": "assistant", "content": response})
            messages.append({"role": "user", "content": f"Observation: {result}"})
Build this
ReAct agent that answers: "Should a 52-year-old woman with hypothyroidism take DHEA?" It should: search pubmed, check drug interactions, reason over results, give a grounded answer. Count the steps. Under 5 = well-structured. 5–8 = acceptable, investigate if consistently high. More than 8 = tools returning too much noise — tighten tool descriptions or reduce k.
Retain
  • ReAct = Reasoning + Acting interleaved. Thought before every action.
  • Thought step = debuggable trace. Wrong action? The Thought tells you why.
  • Model can self-correct based on Observations
  • Implement by parsing "Action:" lines and injecting "Observation:" results
  • Use for: multi-step research, diagnosis support, anything requiring reasoning between tools
8 / 11
CHUNK 09 / 11MCP Servers

"Wrapping your tools so any agent can use them"

What MCP Actually Is

Instead of hardcoding tool functions inside one agent, you build a server that exposes tools → any Claude agent, Claude.ai, or Claude Code connects to it. You already use MCP servers: GA4, GSC in your /update briefings.

Why Standardize

Without MCP: rebuild the same clinic tools for every new agent. With MCP: build once → reuse across all agents.

// Node.js MCP server
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";

const server = new Server({ name: "clinic-tools", version: "1.0.0" },
  { capabilities: { tools: {} } });

server.setRequestHandler("tools/list", async () => ({
  tools: [{
    name: "search_patient_history",
    description: "Search a patient's history for past consultations and test results. Use when patient mentions past treatments or you need health history context.",
    inputSchema: {
      type: "object",
      properties: {
        patient_id: { type: "string" },
        query: { type: "string", description: "What to search for in their history" }
      },
      required: ["patient_id", "query"]
    }
  }]
}));

server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "search_patient_history") {
    const { patient_id, query } = request.params.arguments;
    const result = await searchPatientHistory(patient_id, query);
    return { content: [{ type: "text", text: JSON.stringify(result) }] };
  }
});

const transport = new StdioServerTransport();
await server.connect(transport);

Add to Claude Code

// .claude/settings.json
{
  "mcpServers": {
    "clinic-tools": {
      "command": "node",
      "args": ["/path/to/your/clinic-mcp-server.js"]
    }
  }
}
Build this — closes the loop
Take one existing script (researcher tool or the RAG from Chunk 3). Wrap it as an MCP server with one tool. Connect to Claude Code via settings.json. Ask Claude Code to use your tool in a conversation. If it works: you've closed the loop between everything in this roadmap.
Retain
  • MCP = standard for packaging tools as servers any LLM client can connect to
  • Build once → reuse across agents, Claude Code, Claude.ai
  • You already have tools worth wrapping: researcher, GA4, RAG pipeline
  • tools/list endpoint: defines what tools exist and when to use them
  • tools/call endpoint: executes and returns results
  • MCP = model-to-tool standard (Anthropic). A2A = agent-to-agent standard (Google, 2025). Both are emerging — knowing the names is enough for now.
9 / 11
CHUNK 10 / 11LangChain & LangGraph

"Industry-standard frameworks — 70% of job descriptions mention these"

Why Learn Frameworks After Building From Scratch

You built the agent loop (Chunk 5), tools (Chunk 6), memory (Chunk 7), and ReAct (Chunk 8) manually. Now LangChain/LangGraph wrap all of that into reusable components. If you started here, you wouldn't understand what's happening underneath. Now you do.

LangChain — RAG Made Declarative

Your 30-line RAG (Chunk 3) becomes 10 lines with composable, tested components.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.create_documents([your_text])
vectorstore = Chroma.from_documents(docs, embeddings)

chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True  # shows WHERE the answer came from
)

result = chain.invoke({"query": "What's the testosterone protocol for women?"})
print(result["result"])
print(result["source_documents"])  # attribution — required for health context

When LangChain adds value: source attribution needed · chaining multiple steps · LangSmith tracing (free, invaluable for debugging)

When it's overkill: simple single-step RAG → use Chunk 3 raw code, it's clearer

LangGraph — Stateful Agent Workflows

LangGraph adds explicit state management to agents. Instead of implicit state in a messages list, you define nodes, edges, and conditions as a graph.

from langgraph.graph import StateGraph, END
from typing import TypedDict, List

class AgentState(TypedDict):
    messages: List[dict]
    patient_id: str
    retrieved_docs: List[str]
    final_answer: str

def retrieve_docs(state: AgentState):
    query = state["messages"][-1]["content"]
    docs = vectorstore.similarity_search(query, k=3)
    return {"retrieved_docs": [d.page_content for d in docs]}

def generate_answer(state: AgentState):
    context = "\n\n".join(state["retrieved_docs"])
    question = state["messages"][-1]["content"]
    response = llm.invoke(f"Context:\n{context}\n\nQuestion: {question}")
    return {"final_answer": response.content}

def needs_more_context(state: AgentState):
    if "I don't have information" in state.get("final_answer", ""):
        return "retrieve"  # loop back
    return "done"

workflow = StateGraph(AgentState)
workflow.add_node("retrieve", retrieve_docs)
workflow.add_node("answer", generate_answer)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "answer")
workflow.add_conditional_edges("answer", needs_more_context, {
    "retrieve": "retrieve",
    "done": END
})
app = workflow.compile()
result = app.invoke({
    "messages": [{"role": "user", "content": "What are the side effects of DHEA?"}],
    "patient_id": "patient_123", "retrieved_docs": [], "final_answer": ""
})

LangGraph vs Manual ReAct

Manual ReAct (Chunk 8)LangGraph
StateImplicit (messages list)Explicit (typed dict)
BranchingHard to addFirst-class (conditional edges)
DebuggingPrint statementsLangSmith visual trace
ProductionFragile at scaleDesigned for it

Use manual ReAct for prototyping. Use LangGraph when the workflow has >3 steps or conditional logic.

LangSmith — 3 Lines to See Everything

Free observability for LangChain/LangGraph. Every tool call, retrieval, and LLM response becomes visible in a web UI. Interview answer for "how do you debug a production agent?"

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"    # free at smith.langchain.com
os.environ["LANGCHAIN_PROJECT"] = "clinic-rag"  # groups your runs

# Now run any LangChain or LangGraph code — it traces automatically
result = chain.invoke({"query": "What is the testosterone protocol?"})
# smith.langchain.com → see retrieved docs, LLM input, LLM output, latency

Add to .env. Every run from Chunk 3 onward gets traced. You see exactly which document was returned, what the LLM was given, where the retrieval missed. Print statements are gone.

Build this — closes the framework loop
Build 1: Rewrite your Chunk 3 RAG using LangChain. Add return_source_documents=True. Verify it shows which doc the answer came from.

Build 2: Rewrite your Chunk 8 ReAct agent using LangGraph. Add one conditional edge: if the answer mentions "consult a doctor," route to a flag_for_review node instead of END.

Build 3: Enable LangSmith (3 lines above). Run a query. Go to smith.langchain.com and find the trace. Verify you can see the retrieved documents and the LLM prompt.

All three in portfolio = you match the stack in 70% of AI agent job descriptions.
Retain
  • LangChain = declarative RAG chains + standard interfaces + LangSmith tracing
  • LangGraph = explicit state machine for multi-step agents with conditional branching
  • return_source_documents=True → required for health context (attributable answers)
  • Manual first (Chunks 3+8), then framework — you know what it abstracts
  • LangGraph conditional edges = the clinical safety layer: escalate, flag, loop, stop
  • CrewAI = multi-agent framework (agents collaborating — one researches, one writes, one reviews). Appears in 37% of LangGraph job listings. Know the pattern even if you do not build it now.
10 / 11
CHUNK 11 / 11Deploy Your Agent

"A live URL beats a screenshot every time"

Why Deploy Matters More Than the Code

Clinics and employers don't read code. They click links. A working demo at your-rag.streamlit.app closes clients and gets interviews. A GitHub repo with no demo does neither. Deployment is not optional — it is the product.

Streamlit App (20 lines)

Streamlit turns any Python script into a web app. No HTML, no CSS, no server config.

# app.py — deploy your Chunk 3 RAG as a live demo
import streamlit as st
from your_rag import rag  # your function from Chunk 3

st.title("Clinic Protocol Assistant")
st.caption("Answers from clinic documents only. No hallucination.")

if "messages" not in st.session_state:
    st.session_state.messages = []

for msg in st.session_state.messages:
    st.chat_message(msg["role"]).write(msg["content"])

if question := st.chat_input("Ask about a protocol..."):
    st.session_state.messages.append({"role": "user", "content": question})
    st.chat_message("user").write(question)
    with st.spinner("Searching protocols..."):
        answer = rag(question)
    st.session_state.messages.append({"role": "assistant", "content": answer})
    st.chat_message("assistant").write(answer)
# Run locally first — verify it works
pip install streamlit
streamlit run app.py
# Opens browser at localhost:8501

Deploy to Streamlit Cloud (free, permanent URL)

# 1. Push to GitHub (public repo, no secrets in code — use .env for API keys)
git init && git add . && git commit -m "clinic rag demo"
git remote add origin https://github.com/yourname/clinic-rag
git push -u origin main

# 2. Go to share.streamlit.io
# Connect your GitHub repo → select app.py → Deploy
# You get: https://your-app.streamlit.app (permanent, shareable)

What to Put in the GitHub README (2 paragraphs, no fluff)

## Clinic RAG Assistant

AI chatbot that answers questions from clinic protocol documents — not from ChatGPT's
training data. When the answer isn't in the documents, it says so. That refusal
behavior is the point: no hallucination, no generic internet advice.

Stack: Python + Chroma (local vector DB) + sentence-transformers + Claude/GPT-4o-mini.
Demo: [your-app.streamlit.app] — live, working, ask it anything about the protocols.

Include your Ragas eval score if you ran it (Chunk 4). "Faithfulness: 0.91, Context Precision: 0.87" in the README signals production-readiness to employers.

Build this — the portfolio artifact
1. Take your Chunk 3 RAG function. Wrap it in the 20-line Streamlit app above.
2. Test locally — ask 5 patient questions. Verify refusal behavior for out-of-scope questions.
3. Push to GitHub with the 2-paragraph README.
4. Deploy to Streamlit Cloud. Get the permanent URL.

That URL is what you send to clinics and attach to job applications. Not code. Not a screenshot. A running product.
Retain
  • Streamlit = Python script → web app in 20 lines, no frontend knowledge needed
  • st.chat_input() + st.chat_message() = production-quality chat UI
  • Never hardcode API keys — use .env + python-dotenv, add .env to .gitignore
  • Streamlit Cloud is free for public repos — permanent URL, no server management
  • README Ragas score = "this person knows how to evaluate AI, not just build it"
  • The demo URL is the resume. Ship it by Day 16.
11 / 11