10 chunks · Build-first · Socratic tutor · Health domain
First pass (chunks 1→10): Read the chunk. Open a new Claude chat, paste the Socratic tutor prompt. Build the exercise. Don't move on until you can rebuild from memory.
Review passes (after day 7): Pick 3 RETAIN sections from different chunks (e.g. 2+5+8). Mix — never same-topic review.
| # | Topic | Day | Build |
|---|---|---|---|
| 0 | Setup | Day 0 | Install packages + verify API key |
| 1 | Embeddings | 1 | 10 medical terms similarity search |
| 2 | Vector Databases | 2 | Clinic FAQ in Chroma |
| 3 | RAG Pipeline | 3–4 | Clinic chatbot over real docs — €1,500 |
| 4 | Evaluation | 5 | 20-question test set |
| 5 | Agent Loop | 7 | 2-tool research agent |
| 6 | Tool Definition | 8 | 3-tool health agent |
| 7 | Memory | 9–10 | Patient profile + session memory |
| 8 | ReAct Pattern | 11–12 | Multi-step clinical reasoning agent |
| 9 | MCP Servers | 13–14 | Wrap researcher tool as MCP |
| 10 | LangChain/LangGraph | 15–16 | Rewrite RAG + ReAct in industry frameworks |
| 11 | Deploy Your Agent | 17 | Streamlit app → live URL → portfolio artifact |
"Run this once before Chunk 1 — takes 10 minutes"
Use this while learning. Runs entirely on your machine. No API keys, no billing, works offline.
# 1. Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# 2. Pull models (one-time download)
ollama pull llama3.2 # 2GB — main LLM for all exercises
ollama pull nomic-embed-text # 274MB — free embedding model
# 3. Verify
ollama run llama3.2 "Say hello" # should respond
# 4. Install Python packages
pip install openai chromadb sentence-transformers langchain-ollama langchain-community langgraph streamlit
All code in these chunks uses OpenAI(). To use Ollama instead, change only the client init:
# Replace this (paid):
from openai import OpenAI
client = OpenAI()
# With this (free, local):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# Everything else — tool calling, streaming, chat — is identical.
When you're ready to deploy and share with clients: get OpenAI or Anthropic API keys. The code is identical — only the client init changes. Until then, Ollama is faster and free.
# Optional: add to ~/.bashrc when ready
export OPENAI_API_KEY="sk-..." # platform.openai.com/api-keys
export ANTHROPIC_API_KEY="sk-ant-..." # console.anthropic.com
"Text you can do math on"
An embedding is a list of numbers (a vector) that represents the meaning of text. "heart attack" → [0.23, -0.87, 0.11, ...] (1536 numbers in OpenAI's model)
The key property: similar meanings → similar vectors. "heart attack" is numerically close to "myocardial infarction" and "cardiac event". Far from "chicken soup".
This is how you search by meaning, not keyword. You don't need the exact word — you need the concept.
The LLM was trained on billions of text examples. It learned that "heart attack" and "myocardial infarction" appear in similar contexts. The embedding is a compressed representation of that learned context. No understanding — just learned statistical co-occurrence.
Semantic search · Deduplication · Classification · RAG retrieval (the bridge between user question and relevant documents)
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(input="heart attack", model="text-embedding-3-small")
vector = response.data[0].embedding # 1536 floats
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2") # 80MB, downloads once
vector = model.encode("heart attack") # numpy array, 384 dimensions
# Similarity works the same way — same concept, different function
import numpy as np
def similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Or use Ollama: ollama pull nomic-embed-text → same OpenAI embeddings API at localhost:11434/v1
import numpy as np
def similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# 1.0 = identical · 0.0 = unrelated · -1.0 = opposite
text-embedding-3-small — cheap, fast, good enough"A search engine for meaning"
You embed 10,000 patient FAQ entries. A user asks a question. You can't compare the question vector to 10,000 vectors one by one in real time. A vector database stores embeddings and retrieves the most similar ones in milliseconds, even with millions of documents.
Vector DBs use approximate nearest-neighbor (ANN) algorithms (HNSW, IVF) to find similar vectors without checking every single one. Trade: 99% accuracy, 1000x faster.
import chromadb
client = chromadb.Client()
collection = client.create_collection("clinic_docs")
# Add documents (Chroma auto-embeds if you give it text)
collection.add(
documents=["Testosterone therapy increases libido",
"HGH improves muscle mass",
"NAD+ supports mitochondrial function"],
ids=["doc1", "doc2", "doc3"]
)
# Query
results = collection.query(query_texts=["hormones for energy"], n_results=2)
print(results["documents"]) # most similar docs
Collection = a table of documents. n_results = how many similar docs to return. Metadata filtering = find similar docs, but only from category=hormones.
collection.add(documents=[...], ids=[...]) stores + auto-embedscollection.query(query_texts=[...], n_results=K) retrieves top-K similar"Giving LLMs access to your documents without hallucination"
Ask Claude "what's the protocol for testosterone therapy in women over 50?" — it answers confidently from 2023 training data. Wrong, outdated, or generic. RAG fixes this: retrieve your actual protocol document first, inject it into the prompt. Claude answers from real context.
# INDEXING (one time):
Documents → Split into chunks → Embed each chunk → Store in Chroma
# RETRIEVAL (every query):
User question → Embed question → Find similar chunks → Insert into prompt → LLM answers
Too large → multiple topics, retrieval noisy. Too small → loses context. Sweet spot: 300–500 tokens with 50-token overlap.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(your_document)
import chromadb
from openai import OpenAI
client = OpenAI()
db = chromadb.Client()
collection = db.create_collection("protocols")
collection.add(documents=chunks, ids=[f"chunk_{i}" for i in range(len(chunks))])
def retrieve(question, k=3):
results = collection.query(query_texts=[question], n_results=k)
return "\n\n".join(results["documents"][0])
def rag(question):
context = retrieve(question)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer based only on the provided context. If the answer isn't in the context, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
print(rag("What's the testosterone protocol for women?"))
# Change only these 2 lines — rest of the RAG code is identical
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# Then use "llama3.2" as the model name:
response = client.chat.completions.create(model="llama3.2", messages=[...])
# For embeddings (free):
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
# In retrieve(), embed the question manually:
q_vector = embed_model.encode(question).tolist()
results = collection.query(query_embeddings=[q_vector], n_results=3)
| Failure | Symptom | Fix |
|---|---|---|
| Chunk too large | Irrelevant content mixed into answer | Reduce to 300–500 tokens |
| Chunk too small | Answer cuts off mid-concept | Increase size + add overlap |
| No overlap | Misses context at chunk boundaries | Add 50-token overlap |
| Query-doc mismatch | Right doc exists, not retrieved | HyDE: ask the LLM to write a hypothetical answer, then embed that — bridges vocabulary gap between how questions are phrased and how documents are written |
| Top-K too low | Right chunk ranked 4th, k=3 misses it | Increase k to 5–8 |
| LLM ignores context | Answers from training memory | Strengthen system prompt: "ONLY from context" |
Diagnosis rule: wrong retrieval → chunking/embedding issue. Right retrieval, wrong answer → prompt issue. Confident wrong answer → answer isn't in your docs.
"How to know it actually works"
You test with 5 easy questions. It works. You ship. Client uses it. Patient asks an edge case. It hallucinates confidently. You didn't find it because you only tested easy questions.
1. Groundedness — Build 20–30 Q&A pairs from your real documents. Score manually: 0 (wrong), 1 (partial), 2 (correct). Target: >70%.
2. Faithfulness — Ask 10 questions NOT in your documents. It should refuse every time. If it answers anyway = hallucination = not safe for health context.
3. Latency — Target: under 3 seconds. Over 5s = users abandon.
import time
start = time.time()
answer = rag("your question")
print(f"{time.time() - start:.2f}s")
def score_answer(question, expected, actual):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"""
Score this answer 0-2:
Question: {question}
Expected: {expected}
Actual: {actual}
Return only the number."""}]
)
return int(response.choices[0].message.content.strip())
pip install ragas. Mention it in interviews even before you have used it."From chatbot to something that acts"
A chatbot says "the appointment is tomorrow at 3pm." An agent checks the calendar, finds a conflict, reschedules, sends the confirmation, and updates the EHR.
1. PERCEIVE — receive input (user message, API response, tool output)
2. REASON — decide what to do next (which tool? what parameters?)
3. ACT — call the tool
4. OBSERVE — read the tool's output
5. REPEAT — go back to 1 until task is complete
import json
def agent(task, tools, max_steps=10):
messages = [{"role": "user", "content": task}]
for step in range(max_steps):
response = client.chat.completions.create(
model="gpt-4o-mini", messages=messages, tools=tools # or claude-3-5-sonnet-20241022 — both support tool calling
)
if response.choices[0].finish_reason == "tool_calls":
tool_call = response.choices[0].message.tool_calls[0]
tool_name = tool_call.function.name
tool_args = json.loads(tool_call.function.arguments)
result = execute_tool(tool_name, tool_args) # your function
messages.append(response.choices[0].message)
messages.append({
"role": "tool",
"content": str(result),
"tool_call_id": tool_call.id
})
else:
return response.choices[0].message.content # done
return "Max steps reached"
# Change only the client init — tool calling is OpenAI-compatible in Ollama:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# Then use "llama3.2" as model — it supports tools/function calling
# The entire agent() loop above works without modification
Infinite loops → add max_steps. Tool errors → add error handling in tool output. Wrong parameters → better tool descriptions (Chunk 6). Hallucinated tool calls → validate inputs.
search_pubmed(query) (mock it) and summarize(text). Task: "Find recent research on testosterone in women over 50 and summarize it." Add print statements to see each step of the loop.
"How to give an agent hands"
You describe tools in JSON schema. The LLM reads the description and decides when and how to use each tool. The description is the interface. Bad description = agent breaks.
1. Name — verb-first, specific (search_pubmed not tool1)
2. Description — when to use it (not just what it does)
3. Parameters — types and descriptions
4. Returns — what comes back
tools = [{
"type": "function",
"function": {
"name": "search_medical_literature",
"description": "Search PubMed for peer-reviewed medical studies. Use when the user asks about clinical evidence, treatment protocols, drug interactions, or any medical question requiring scientific backing. Returns titles, abstracts, and DOI links.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Medical search query. Be specific: include condition, treatment, population (e.g. 'testosterone replacement therapy women menopause')"
},
"max_results": {
"type": "integer",
"description": "Number of results. Default 5, max 20.",
"default": 5
}
},
"required": ["query"]
}
}
}]
Unstructured text is unparseable. In health contexts, you need machine-readable outputs. Two approaches:
# OpenAI / Ollama — JSON mode
response = client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"}, # forces valid JSON output
messages=[
{"role": "system", "content": "Extract lab data. Return JSON only: {\"test\": str, \"value\": float, \"unit\": str, \"flag\": str}"},
{"role": "user", "content": "TSH: 6.2 mIU/L (high)"}
]
)
import json
data = json.loads(response.choices[0].message.content)
# → {"test": "TSH", "value": 6.2, "unit": "mIU/L", "flag": "high"}
# Ollama equivalent (format param instead of response_format):
# Add "format": "json" to the requests.post() body
When to use: tool outputs that feed other tools, structured patient data extraction, any agent output that code needs to parse.
MCP is a standardized way to package tools as servers that any agent can connect to. The GA4 and GSC tools in /update are MCP servers. When you build your own clinic tool → wrap it as MCP → any Claude agent can use it.
json.loads().
"Agents that remember across sessions"
Short-term — current conversation context. Limit: ~200K tokens. You pay for every token on every call.
Long-term — external storage. Two patterns: Semantic/RAG (store facts, retrieve by similarity) + Episodic (events with timestamps).
User profile — structured JSON: name, conditions, medications, past decisions. Injected into system prompt at conversation start.
Working memory — scratchpad for multi-step tasks. Reset each session.
import json, chromadb
# User profile (key-value)
def load_profile(patient_id):
try: return json.load(open(f"profiles/{patient_id}.json"))
except: return {}
# Episodic memory (semantic search)
memory_db = chromadb.Client()
memories = memory_db.create_collection("patient_memories")
def store_memory(patient_id, text, session_date):
memories.add(
documents=[text],
metadatas=[{"patient_id": patient_id, "date": session_date}],
ids=[f"{patient_id}_{session_date}"]
)
def recall_memories(patient_id, query, k=3):
results = memories.query(
query_texts=[query],
where={"patient_id": patient_id},
n_results=k
)
return results["documents"][0]
# Build system prompt with memory
def build_context(patient_id, current_query):
profile = load_profile(patient_id)
relevant_memories = recall_memories(patient_id, current_query)
return f"""Patient profile:\n{json.dumps(profile, indent=2)}
Relevant past interactions:\n{chr(10).join(relevant_memories)}"""
"The architecture you'll use for 80% of real agents"
Reasoning + Acting, interleaved. The model explains its reasoning before each action. This improves reliability — the model can catch its own mistakes before they compound.
Thought: I need to find studies on testosterone in women. I'll search PubMed.
Action: search_medical_literature({"query": "testosterone women menopause"})
Observation: [3 studies returned]
Thought: The studies mention DHEA interaction. I should check drug interactions.
Action: check_drug_interactions({"drug_a": "testosterone", "drug_b": "DHEA"})
Observation: no contraindication found
Thought: I now have enough to answer.
Final Answer: Based on current literature...
The Thought step = debuggable trace. If the action is wrong, the Thought tells you why. The model can self-correct after seeing the Observation. Without Thought steps: black box. With them: full trace.
REACT_PROMPT = """Use this format:
Thought: [what you need to do and why]
Action: [tool_name with parameters as JSON]
Observation: [you'll see the result here]
... repeat as needed ...
Thought: I have enough information.
Final Answer: [response]"""
def react_agent(question):
messages = [
{"role": "system", "content": REACT_PROMPT},
{"role": "user", "content": question}
]
for _ in range(10):
response = get_completion(messages)
if "Final Answer:" in response:
return response.split("Final Answer:")[1].strip()
if "Action:" in response:
action_line = [l for l in response.split("\n") if l.startswith("Action:")][0]
tool_name, args = parse_action(action_line)
result = execute_tool(tool_name, args)
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": f"Observation: {result}"})
"Wrapping your tools so any agent can use them"
Instead of hardcoding tool functions inside one agent, you build a server that exposes tools → any Claude agent, Claude.ai, or Claude Code connects to it. You already use MCP servers: GA4, GSC in your /update briefings.
Without MCP: rebuild the same clinic tools for every new agent. With MCP: build once → reuse across all agents.
// Node.js MCP server
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
const server = new Server({ name: "clinic-tools", version: "1.0.0" },
{ capabilities: { tools: {} } });
server.setRequestHandler("tools/list", async () => ({
tools: [{
name: "search_patient_history",
description: "Search a patient's history for past consultations and test results. Use when patient mentions past treatments or you need health history context.",
inputSchema: {
type: "object",
properties: {
patient_id: { type: "string" },
query: { type: "string", description: "What to search for in their history" }
},
required: ["patient_id", "query"]
}
}]
}));
server.setRequestHandler("tools/call", async (request) => {
if (request.params.name === "search_patient_history") {
const { patient_id, query } = request.params.arguments;
const result = await searchPatientHistory(patient_id, query);
return { content: [{ type: "text", text: JSON.stringify(result) }] };
}
});
const transport = new StdioServerTransport();
await server.connect(transport);
// .claude/settings.json
{
"mcpServers": {
"clinic-tools": {
"command": "node",
"args": ["/path/to/your/clinic-mcp-server.js"]
}
}
}
"Industry-standard frameworks — 70% of job descriptions mention these"
You built the agent loop (Chunk 5), tools (Chunk 6), memory (Chunk 7), and ReAct (Chunk 8) manually. Now LangChain/LangGraph wrap all of that into reusable components. If you started here, you wouldn't understand what's happening underneath. Now you do.
Your 30-line RAG (Chunk 3) becomes 10 lines with composable, tested components.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.create_documents([your_text])
vectorstore = Chroma.from_documents(docs, embeddings)
chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True # shows WHERE the answer came from
)
result = chain.invoke({"query": "What's the testosterone protocol for women?"})
print(result["result"])
print(result["source_documents"]) # attribution — required for health context
When LangChain adds value: source attribution needed · chaining multiple steps · LangSmith tracing (free, invaluable for debugging)
When it's overkill: simple single-step RAG → use Chunk 3 raw code, it's clearer
LangGraph adds explicit state management to agents. Instead of implicit state in a messages list, you define nodes, edges, and conditions as a graph.
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
class AgentState(TypedDict):
messages: List[dict]
patient_id: str
retrieved_docs: List[str]
final_answer: str
def retrieve_docs(state: AgentState):
query = state["messages"][-1]["content"]
docs = vectorstore.similarity_search(query, k=3)
return {"retrieved_docs": [d.page_content for d in docs]}
def generate_answer(state: AgentState):
context = "\n\n".join(state["retrieved_docs"])
question = state["messages"][-1]["content"]
response = llm.invoke(f"Context:\n{context}\n\nQuestion: {question}")
return {"final_answer": response.content}
def needs_more_context(state: AgentState):
if "I don't have information" in state.get("final_answer", ""):
return "retrieve" # loop back
return "done"
workflow = StateGraph(AgentState)
workflow.add_node("retrieve", retrieve_docs)
workflow.add_node("answer", generate_answer)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "answer")
workflow.add_conditional_edges("answer", needs_more_context, {
"retrieve": "retrieve",
"done": END
})
app = workflow.compile()
result = app.invoke({
"messages": [{"role": "user", "content": "What are the side effects of DHEA?"}],
"patient_id": "patient_123", "retrieved_docs": [], "final_answer": ""
})
| Manual ReAct (Chunk 8) | LangGraph | |
|---|---|---|
| State | Implicit (messages list) | Explicit (typed dict) |
| Branching | Hard to add | First-class (conditional edges) |
| Debugging | Print statements | LangSmith visual trace |
| Production | Fragile at scale | Designed for it |
Use manual ReAct for prototyping. Use LangGraph when the workflow has >3 steps or conditional logic.
Free observability for LangChain/LangGraph. Every tool call, retrieval, and LLM response becomes visible in a web UI. Interview answer for "how do you debug a production agent?"
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key" # free at smith.langchain.com
os.environ["LANGCHAIN_PROJECT"] = "clinic-rag" # groups your runs
# Now run any LangChain or LangGraph code — it traces automatically
result = chain.invoke({"query": "What is the testosterone protocol?"})
# smith.langchain.com → see retrieved docs, LLM input, LLM output, latency
Add to .env. Every run from Chunk 3 onward gets traced. You see exactly which document was returned, what the LLM was given, where the retrieval missed. Print statements are gone.
return_source_documents=True. Verify it shows which doc the answer came from.flag_for_review node instead of END.return_source_documents=True → required for health context (attributable answers)"A live URL beats a screenshot every time"
Clinics and employers don't read code. They click links. A working demo at your-rag.streamlit.app closes clients and gets interviews. A GitHub repo with no demo does neither. Deployment is not optional — it is the product.
Streamlit turns any Python script into a web app. No HTML, no CSS, no server config.
# app.py — deploy your Chunk 3 RAG as a live demo
import streamlit as st
from your_rag import rag # your function from Chunk 3
st.title("Clinic Protocol Assistant")
st.caption("Answers from clinic documents only. No hallucination.")
if "messages" not in st.session_state:
st.session_state.messages = []
for msg in st.session_state.messages:
st.chat_message(msg["role"]).write(msg["content"])
if question := st.chat_input("Ask about a protocol..."):
st.session_state.messages.append({"role": "user", "content": question})
st.chat_message("user").write(question)
with st.spinner("Searching protocols..."):
answer = rag(question)
st.session_state.messages.append({"role": "assistant", "content": answer})
st.chat_message("assistant").write(answer)
# Run locally first — verify it works
pip install streamlit
streamlit run app.py
# Opens browser at localhost:8501
# 1. Push to GitHub (public repo, no secrets in code — use .env for API keys)
git init && git add . && git commit -m "clinic rag demo"
git remote add origin https://github.com/yourname/clinic-rag
git push -u origin main
# 2. Go to share.streamlit.io
# Connect your GitHub repo → select app.py → Deploy
# You get: https://your-app.streamlit.app (permanent, shareable)
## Clinic RAG Assistant
AI chatbot that answers questions from clinic protocol documents — not from ChatGPT's
training data. When the answer isn't in the documents, it says so. That refusal
behavior is the point: no hallucination, no generic internet advice.
Stack: Python + Chroma (local vector DB) + sentence-transformers + Claude/GPT-4o-mini.
Demo: [your-app.streamlit.app] — live, working, ask it anything about the protocols.
Include your Ragas eval score if you ran it (Chunk 4). "Faithfulness: 0.91, Context Precision: 0.87" in the README signals production-readiness to employers.
st.chat_input() + st.chat_message() = production-quality chat UI.env + python-dotenv, add .env to .gitignore