Build a Voice Tutor in an Afternoon
A hands-on walkthrough: stand up a real-time voice tutor that knows a syllabus, remembers each student, and renders math on a blackboard — no agent code.
You want an AI tutor your students can talk to — one that knows the actual syllabus, picks up where the last session left off, and can sketch a diagram while it explains the chain rule. The usual way to get there is a multi-week project: wire up a voice provider, manage a WebSocket, build a RAG pipeline, bolt on a memory store, and glue them together with a prompt loop you'll be debugging for a month.
This post does it in an afternoon, with zero agent code. We'll build a generic math (or chemistry — pick your subject) voice tutor on Matrix using four moves: create the agent, give it a syllabus, hand it a toolbox, and talk to it. Everything is configuration — a few POSTs, or the admin dashboard if you prefer clicking.
If you've read Personas as Data, Not Code, you already know the punchline: an agent here is a configured record, not a class. We're going to fill that record out.
Before you start
You need a workspace (an org) and an operator login. Sign up, then grab a token — every call below is org-scoped and JWT-gated. The full curl reference lives in docs/RUNBOOK.md; we'll use the same shapes.
export BASE=http://localhost:8080
TOKEN=$(curl -s -X POST $BASE/api/auth/login \
-H 'content-type: application/json' \
-d '{"orgSlug":"acme","email":"owner@acme.test","password":"…"}' \
| jq -r .accessToken)
One thing to settle now: voice needs a secure context. getUserMedia only hands over the mic on localhost or HTTPS. Locally that's free; on a LAN IP you go through Caddy (docker compose up -d puts everything behind tls internal). Keep that in your back pocket for the last step.
Step 1 — Create the tutor agent
An agent is an Agent entity. The fields that matter for a voice tutor are the persona (systemPrompt), the channels it serves, the voice it speaks in, and requiredCallerFields — the facts it should learn about whoever it's talking to.
AGENT=$(curl -s -X POST $BASE/api/orgs/acme/agents \
-H "Authorization: Bearer $TOKEN" \
-H 'content-type: application/json' \
-d '{
"properties": {
"agentKey": "math-tutor",
"name": "Math Tutor",
"systemPrompt": "You are a patient, encouraging math tutor for school students. Explain one step at a time, check understanding before moving on, and use simple language. When a concept is visual, draw it. Never just give the answer — guide the student to it.",
"channels": "TEXT_CHAT,VOICE_REALTIME,ASYNC_TASK",
"voice": "Kore",
"requiredCallerFields": "name,grade",
"providerKey": "<your-provider-id>",
"model": "gpt-4o-mini"
}
}')
AGENT_ID=$(echo "$AGENT" | jq -r .id)
A few notes on what you just set:
systemPromptis the whole persona. No code path is selected by this string — it's the persona slice of the assembled prompt, and you can edit it any time viaPATCHand it takes effect on the next turn. Keep it tight; the heavy lifting on who the student is comes from memory (Step 4), not from stuffing the prompt.channelsincludesVOICE_REALTIMEso the agent shows up on the browser voice page. LeavingTEXT_CHATon means the same agent also works in chat — same prompt, same memory, no drift.voiceis one of eight Gemini Live prebuilt voices:Aoede, Charon, Fenrir, Kore, Puck, Orus, Leda, Zephyr. Pick the one that fits the vibe;KoreandLedaread as warm and patient, which suits a tutor.requiredCallerFieldsis a CSV of facts the agent should learn about each student — herenameandgrade. We'll see what that buys you in Step 4.
Prefer not to curl? The dashboard does the same thing: Admin → Agents → New agent at /orgs/{slug}/admin/agents. The drawer has every field above, a voice picker, and channel toggles. Saving is a partial PATCH, so you only ever write the fields you touched.
scripts/create-teacher-agent.sh and scripts/create-math-teacher.sh are working end-to-end examples in the repo — clone one, edit the persona variables and AGENT_KEY at the top, and run it. They're the fastest way to get a reproducible agent definition you can re-run.
Step 2 — Give it the syllabus (Knowledge)
A tutor that doesn't know your curriculum will confidently teach the wrong textbook. Fix that by uploading the syllabus into a Knowledge corpus. Matrix chunks, embeds, and indexes each file automatically — no vector DB to provision, no embedding job to write.
Create the corpus first:
KB=$(curl -s -X POST $BASE/api/orgs/acme/knowledge \
-H "Authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{"properties":{
"key":"algebra-grade-9",
"name":"Grade 9 Algebra Syllabus",
"kind":"FILES"
}}')
KB_ID=$(echo "$KB" | jq -r .id)
Then drop your PDFs (or .md / .txt / .html) into it. Each upload is parsed, chunked at ~2,000 characters with 200-character overlap, embedded with text-embedding-005 (768d), and stored as KnowledgeChunk rows in Neo4j's HNSW index:
curl -s -X POST $BASE/api/orgs/acme/knowledge/$KB_ID/files \
-H "Authorization: Bearer $TOKEN" \
-F "file=@grade-9-algebra.pdf" | jq .
# → {"ok":true,"filename":"grade-9-algebra.pdf","bytes":…,"chunksWritten":N}
Now attach the corpus to the agent. The moment you do, the agent automatically gains a search_knowledge(knowledge_key, query) tool — corpus-scoped, exact-cosine ranked, with source citations. You don't wire retrieval; attaching is the wiring.
curl -s -X PATCH $BASE/api/orgs/acme/agents/$AGENT_ID \
-H "Authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d "{\"properties\":{\"knowledge\":[$KB_ID]}}"
That's the whole RAG setup — a drag-and-drop in the dashboard if you'd rather not curl. For the full mechanics (and why retrieval auto-wires instead of needing plumbing), see RAG You Set Up by Dragging a PDF Into a Browser. If your subject benefits from connecting concepts across chapters — say, a proof that depends on a theorem three sections back — flip graphragEnabled: "true" on the corpus and ingestion also builds an entity/relation graph that retrieval walks one hop at a time.
Step 3 — Hand it a toolbox (optional)
A pure syllabus tutor is already useful. But you'll often want it to look something up that isn't in the corpus — a real-world example, a current value, a unit conversion. That's what the built-in toolbox is for: web_search, fetch_url, bash, file_read, file_write, file_list, grep, all sandboxed per (org, agent).
The clean way to attach them is the seeded toolkit-essentials skill, which bundles all seven into one reusable unit. Look it up by key and add it to the agent's skills:
# find the toolkit-essentials skill id
SKILL_ID=$(curl -s -H "Authorization: Bearer $TOKEN" \
"$BASE/api/entities?type=Skill" \
| jq -r '.[] | select(.properties.key=="toolkit-essentials") | .id')
# attach it
curl -s -X PATCH $BASE/api/orgs/acme/agents/$AGENT_ID \
-H "Authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d "{\"properties\":{\"skills\":[$SKILL_ID]}}"
web_search runs against DuckDuckGo with no API key, so this costs you nothing in quota. Skip this step entirely if you want a tutor that stays strictly inside the syllabus — that's a perfectly reasonable design choice for an exam-prep agent.
Step 4 — Memory: it remembers each student
Here's the part that turns a Q&A bot into a tutor. Every agent in Matrix gets caller-aware memory for free — and crucially, one memory pool per contact, shared across channels. A student who talked to the tutor on the phone and then opened the web chat is the same person to the agent, joined by Session.userId. It remembers the last session whether you call or type. (The full mechanics are in Agents That Actually Remember You.)
You don't configure any of this — it's on. What you can steer is the requiredCallerFields we set in Step 1. With name,grade declared, the agent's prompt gets a "what you still need to learn" checklist for any field it doesn't yet know, and the five built-in memory tools (update_contact_profile, add_contact_note, and friends) save what the student tells it. So the tutor naturally asks "what grade are you in?" once, remembers the answer, and never asks again — and a post-session pass distills each conversation into a durable digest plus long-lived facts.
The payoff in practice: session two opens with the agent already knowing the student's name, grade, and that they were stuck on factoring quadratics last time. No "remind me where we left off."
Step 5 — Talk to it
Everything's wired. Open the browser-direct voice page:
https://localhost/orgs/acme/agents/math-tutor/voice
Click Begin the conversation and the browser opens a WebSocket straight to Gemini Live — the backend only mints an ephemeral token, so there's zero server in the audio path and latency stays sub-second. Your mic streams up at 16 kHz PCM; the tutor's voice streams down at 24 kHz. Start speaking and it speaks back, with barge-in: interrupt it mid-sentence and it stops instantly.
Remember the secure-context rule from the top — that URL works on localhost as-is. On a LAN IP, hit it through Caddy so HTTPS (and therefore the mic) is available.
A nice touch for a math or chemistry tutor: the agent can render math and diagrams inline on a display_canvas blackboard. Ask it to show the steps of completing the square or sketch a reaction mechanism, and the explanation lands as rendered notation, not a wall of spoken symbols. It works in chat too, so the same agent draws whether the student is calling or typing.
Want to smoke-test the text path first? The same agent answers over SSE chat without any extra setup — start a conversation against /api/orgs/acme/agents/math-tutor/chat/conversations and stream a reply. Same prompt, same syllabus, same memory.
What you actually built
In one afternoon, with no agent code, you stood up:
- a real-time voice agent with a persona, on a prebuilt Gemini Live voice;
- that knows your syllabus via auto-wired RAG over uploaded PDFs;
- with an optional sandboxed toolbox for lookups outside the corpus;
- that remembers each student across phone and web, learning the facts you declared it needs;
- and can render math on a blackboard inline.
Every piece is configuration on a generic entity model. Swap the persona and the corpus and the same recipe gives you a chemistry tutor, a language coach, or an exam-prep drill sergeant — no fork, no redeploy. Edit the prompt and it's live on the next turn.
Takeaway: the hard parts of a voice tutor — the audio pipeline, retrieval, cross-channel memory — are platform features, not project work. You spend your afternoon on the teaching, not the plumbing.
Ready to build it? Create a workspace, run POST /api/orgs/{slug}/agents (or open Admin → Agents → New agent), drop your syllabus into a Knowledge corpus, and open the /voice page. Start from scripts/create-teacher-agent.sh if you want a working definition to adapt. Your first lesson is one upload away.
Build your first agent on Matrix
Spin up a workspace, wire up tools and knowledge, give your agent a voice, and talk to it in real time — no agent code required.