voice ai twilioAI Generated

How I Built a Voice AI That Makes Phone Calls

voice ai twilio

Sebastjan Mislej2026-02-178 min read
Key Insight

A personal voice AI that calls you every morning costs about €9/month to run. The hardest part isn't the code — it's designing the right information density for a 60-second phone call.

My phone rings at 7:15 every morning. A voice tells me about my calendar, the weather in Ljubljana, and whether silver prices moved overnight. I built this thing myself. It costs me €9 a month.

Here's the full story. What works, what broke, and why I think voice is the most underrated AI interface right now.

Why Voice? Because I Never Read My Notifications

I had a perfectly good morning briefing system. Every day at 7 AM, my AI agent would compile a summary and send it to Telegram. Calendar events, weather, market data, email highlights. Clean markdown. Well-formatted.

I never read it.

Not because it was bad. Because Telegram is a firehose. By the time I opened the app, the briefing was buried under group messages, news channels, and random links. I'd scroll past it half the time.

Then I realized something obvious. Text is opt-in. Voice is opt-out. When your phone rings, you answer. When a message arrives, you might read it later. Or never.

So I built a voice AI that calls me instead.

The Stack: Twilio + ElevenLabs + OpenClaw

The system has three parts:

OpenClaw
Data & Logic
Twilio
Phone Call
ElevenLabs
Voice AI
  1. OpenClaw is my AI agent framework. It gathers the briefing data (calendar, weather, prices, emails) and triggers the call.
  2. Twilio handles the actual phone call. It dials my number and connects the audio stream.
  3. ElevenLabs converts text to speech in real time. The voice sounds natural, not robotic.

The flow:

  • A cron job fires at 7:15 AM
  • OpenClaw compiles the briefing from live data
  • It triggers a Twilio call to my phone
  • Twilio connects to an ElevenLabs conversational agent
  • The agent reads the briefing and can answer follow-up questions

The whole thing runs on a €5/month VPS. Twilio charges about €0.01 per minute. ElevenLabs has a free tier that covers my usage. Total cost: roughly €9/month including the server.

The TwiML Part

Twilio uses TwiML (Twilio Markup Language) to control calls. Here's the core of what happens when the call connects:

<Response>
  <Connect>
    <ConversationRelay
      url="wss://ai.sebastjanm.com/voice-agent"
      voice="alloy"
      welcomeGreeting="Good morning. Here's your briefing."
    />
  </Connect>
</Response>

The ConversationRelay tag is the key piece. It connects the phone call to a WebSocket endpoint where ElevenLabs handles the conversation. The caller hears a natural voice. They can interrupt, ask questions, or just listen.

Getting this right took me three evenings. The Twilio docs are decent but scattered across different pages. Most examples assume you want a basic IVR menu ("press 1 for sales"), not a full AI conversation.

Tip: Twilio's WebSocket connection sends raw audio chunks, not complete sentences. You need to handle buffering and silence detection yourself if you're building a custom backend. ElevenLabs' conversational agent handles this out of the box.

Three Ways It Broke

The Latency Problem

My first version had a 3-second delay between question and answer. Three seconds feels like forever on a phone call. People hang up.

The issue was my architecture. I was sending audio to a speech-to-text service, then to an LLM, then to text-to-speech, then back to the call. Four network hops. Each one added latency.

The fix: ElevenLabs' conversational API handles the full loop. Speech in, speech out, one round trip. Latency dropped to under 800ms. Still not instant, but good enough that the conversation feels natural.

Context Overload

I tried cramming everything into the briefing. Calendar, weather, emails, silver prices, gold prices, Bitcoin, task list, GitHub notifications. The call lasted 4 minutes. Nobody wants a 4-minute automated phone call at 7 AM.

Now I keep it under 90 seconds. Calendar and weather always. Market data only if something moved more than 2%. Emails only if something is flagged urgent. Less is more.

This lesson goes beyond voice AI. Any notification system needs to respect attention. The first version of anything tends to include too much. Cut until it hurts, then cut a little more.

Filler Phrases

The AI kept saying things like "What a wonderful question!" and "Absolutely!" before answering. Classic LLM behavior. Sounds fine in a chatbot. Sounds bizarre on a phone call.

I fixed this with a system prompt: "You are a concise morning briefing assistant. Never use filler phrases. Start answers immediately. Keep responses under 30 words unless asked to elaborate."

Prompt engineering for voice is different from text. In text, a little warmth helps. On the phone, people want speed. I spent more time tweaking the system prompt than writing the actual code.

What a Typical Call Sounds Like

"Good morning. You have three events today. First, a standup at 9:15 on Teams. Then a client call at 11 on Google Meet with four people. And a one-on-one with Matej at 3 PM. Weather is 4 degrees and cloudy. Silver moved up 1.8% overnight. No urgent emails. Anything else?"

I usually say "no thanks" and hang up. Sometimes I ask a follow-up like "what's the silver price in euros?" and it answers from the data it already has.

The whole call takes 30 to 45 seconds on a normal day.

The Real Cost Breakdown

€9
Total Monthly Cost
45s
Average Call Length
20h
Build Time
ComponentMonthly Cost
VPS (Hetzner)€5.00
Twilio calls (~30 min/month)€3.50
ElevenLabsFree tier
OpenClawSelf-hosted
Total~€9/month

I looked at commercial voice AI platforms for comparison. Most charge $0.10 to $0.15 per minute plus a monthly subscription. For my usage, that would be $30 to $40 per month. Building it myself costs a quarter of that and gives me full control.

The hidden cost is time. I spent about 20 hours over two weeks getting everything working. If your time is worth $100/hour, the break-even point is around 8 months. For me it was worth it because I learned a lot and I have a system I can extend however I want.

Why Voice AI Is Underrated

Most AI tools right now are text-based. Chatbots, copilots, search. All text. But the phone call is still one of the most powerful interfaces we have. It demands attention. It works while you're making coffee. It doesn't need you to look at a screen.

I've been running this system for two months now. I haven't missed a morning briefing since. Before, with the Telegram version, I'd miss two or three a week.

Voice also forces you to be concise. A chatbot can dump 500 words on you. A voice agent that talks for more than two minutes gets hung up on. This constraint makes you think harder about what information actually matters.

The tools are there. Twilio, ElevenLabs, and others have made voice AI accessible to solo developers. You don't need a call center budget. You need a weekend and some patience with WebSocket debugging.

What I'd Do Differently

Use Hosted Agents From Day One

Skip the custom WebSocket server. ElevenLabs' hosted solution handles 90% of what you need out of the box.

Add Conditional Logic

Make the briefing adapt. Busy day? Top 3 things only. Light day? Throw in a reminder about something you've been putting off.

Build a Snooze Feature

A simple "call me back in 10 minutes" voice command for when the phone rings and you're not ready.

What I Took Away From This

  • Voice AI for personal use is cheap and accessible in 2026. The tools exist. The APIs are good enough.
  • The biggest challenge isn't the code. It's designing the right information density for a 60-second phone call.
  • Latency kills voice experiences. Minimize network hops. Use end-to-end solutions when possible.
  • Prompt engineering for voice is its own skill. What works in a chatbot often fails on the phone.
  • Start simple. My best version is the one that does less, not more.
How hard is this to build?

If you know basic web development, you can get a working prototype in a weekend. The Twilio and ElevenLabs APIs are well-documented. The tricky part is latency optimization, which took me another week.

Does it work with any phone?

Yes. Twilio makes a regular phone call. It works on any phone number, mobile or landline. No app needed on the receiving end.

Can the AI handle real conversations?

Sort of. It handles simple follow-up questions well. Complex multi-turn conversations still feel clunky. I use it for briefings and quick lookups, not deep discussions.

What about privacy?

All data stays on my server. The briefing is compiled locally. Twilio handles call routing but doesn't store content. ElevenLabs processes audio through their API. I'm comfortable with the setup, but your tolerance may differ.

Could I use this for customer-facing calls?

You could, but I wouldn't yet. The latency and occasional awkward pauses are fine for a personal tool. For customer calls, people expect human-level responsiveness. We're close but not quite there.

Want to build your own voice AI?

Check out the tools mentioned in this article and start with a simple prototype. A weekend project today could become your most useful daily tool.

Learn more at Baseman