Koca Ventures Ltd
71-75 Shelton Street
Covent Garden, London
WC2H 9JQ, United Kingdom
Registered in England & Wales16231043

ON-PREM PHONE VOICE AGENTS

A voice agent that answers your phones —running on your own hardware.

On-premise voice agents that pick up business calls, book and route and answer the routine questions, and keep every second of audio on your own network. Privacy and data control are the point — not a cheaper copy of a cloud platform. We're honest about the limits: natural turn-taking, not a human impersonation.

Sample inbound booking call — audio + live transcript and latency readout (coming soon)
WHO IT'S FOR

Where on-prem voice earns its place

01

Clinics & regulated practices

After-hours booking, insurance and hours questions, and triage-then-transfer — with every call recording and every word kept on your own hardware. The cleanest fit when patient data can't sit on a multi-tenant cloud.

02

Dealerships & dealer groups

Service-appointment booking, multi-location call routing, and lead capture when every line is busy. The agent handles the routine ask and routes the rest to the right desk.

03

Restaurants & hospitality

Reservations, party-size and availability, menu questions, and confirmations — so the phone stops pulling staff off the floor during a rush.

04

After-hours & overflow reception

The economic driver is the missed call: a large share of inbound arrives after hours or when the line is busy, and most callers never ring back. The agent covers overflow and out-of-hours, books or takes a message, and escalates anything real.

WHAT IT DOES

The routine calls, handled — the hard ones, handed over

Booking and rescheduling, call routing, answering the questions you're asked all day, taking callbacks, and covering the hours your team can't. The model is hybrid by design: the agent handles the routine majority of calls end to end and transfers anything real to a person, with the context already gathered. The economic case is simple — the calls you miss after hours or while every line is busy are calls that mostly never ring back.

THE ON-PREM STACK — REAL TOOLS, OWNED BY YOU

Every layer runs on hardware you control

01

Speech-to-text (local)

faster-whisper as the workhorse — robust across languages including Turkish, running on your own GPU. whisper.cpp is the CPU fallback where there's no NVIDIA card.

02

Turn-taking & barge-in (local)

Silero VAD for speech detection plus LiveKit's semantic turn-detector, so the agent knows that “I need to think about that…” isn't the end of a turn. It runs on CPU and does cover Turkish — a genuine plus.

03

The dialogue model (local)

A self-hosted LLM (Qwen3 or a Llama-class 8B) served by vLLM for concurrent callers or Ollama for the simple single-line case — kept fully in VRAM on the GPU so the agentic loop stays fast.

04

Text-to-speech (local)

Kokoro or Piper for commercial-safe local voices; XTTS-v2 for cloned voices where a licence allows it. This is the layer with the honest caveat below.

05

Telephony & the SIP bridge

A real phone number via a SIP trunk (Twilio or Telnyx) bridged into a self-hosted media server — Asterisk, FreeSWITCH, or LiveKit SIP. We keep your existing PBX and route only the lines you choose to the agent.

06

Orchestration

LiveKit Agents or Pipecat ties the pipeline together — streaming every stage, handling interruptions, and running on hardware you own. We can run it on your machines or, as a managed-on-prem option, on our own two-node GPU cluster (an RTX 4090 plus an RTX 3060).

Pipeline diagram — speech-to-text, turn detection, local LLM, text-to-speech, SIP (coming soon)
HONEST LIMITS

What on-prem voice can't do (yet) — said plainly

01

Latency is ~0.5–1.2 seconds, not human-equal

A well-tuned local stack lands around half a second to just over a second, end to end. That's natural, interruptible turn-taking — but a human leaves roughly a 200ms gap, and even the fastest cloud speech is around 0.8–1.1s. We won't tell you it's indistinguishable from a person, because it isn't.

02

Turkish text-to-speech is the weak link

The commercial-safe local Turkish voice (Piper) is more robotic; the most natural one (XTTS-v2) carries a non-commercial licence that needs a separate agreement before production use. Turkish speech-to-text and turn-taking are solid — but on TTS naturalness, we set expectations honestly rather than over-sell.

03

On-prem trades cloud uptime for data control

Cloud platforms give you 99.9%+ geo-redundant uptime out of the box. Running on your own hardware means a failure on your premises is a real event that needs handling — which is why on-prem comes with a maintenance and monitoring retainer, and a hybrid failover path where it's warranted.

04

Sometimes cloud is simply the better fit

Below a certain call volume, a managed cloud platform is cheaper and faster to stand up, and we'll say so. On-prem makes sense when data sovereignty, regulation, or sustained high volume are real constraints — not as a default for everyone.

We don't sell “an AI that replaces your staff” or a voice that's indistinguishable from a human — neither is true, and you'd find out on the first call. What we build is a voice system you own: it runs where your data lives, it handles the routine load, and it's honest about where it stops and a person takes over.

Privacy by reduction: because the audio never leaves your network, there's no third-party processor in the loop and no cloud recording of your calls — the cleanest path under data-residency rules. This is on-prem / offline-capable voice, and it sits alongside our other on-premise agentic systems. It is a separate capability from our on-prem real-estate CRM, which does document intelligence and has no voice component.

QUESTIONS

Straight answers

How is this different from Vapi, Retell, or ElevenLabs?

Those are excellent cloud platforms and they're fast to ship — we won't pretend to beat them on out-of-the-box convenience or raw voice naturalness. Our wedge is different: the whole pipeline runs on your hardware, so the audio, the transcripts, and the caller data never leave your network. If your reason for not adopting voice was 'I'm not putting call recordings on a US multi-tenant cloud,' that's exactly the gap we work in.

Will callers think it's a real person?

No — and we won't claim that. A human leaves about a 200ms gap in conversation; a well-tuned local stack lands around 0.5–1.2 seconds end to end. That's natural, interruptible turn-taking, not a human impersonation. The honest framing is an agent that handles the routine majority of calls cleanly and hands the hard ones to a person.

What about Turkish?

We'll be plain about this: Turkish text-to-speech is the current weak link. The commercial-safe local option (Piper) is more robotic, and the most natural local option (XTTS-v2) carries a non-commercial licence that needs a separate agreement to use in production. Speech-to-text in Turkish is solid with the Whisper family, and the turn-taking model does cover Turkish — but on TTS naturalness we set expectations honestly rather than over-promise.

What's the catch with running it on-prem?

On-prem trades cloud's instant scale and geo-redundancy for data control you actually own — and it adds a maintenance burden. A cloud platform offers 99.9%+ uptime across regions; a hardware failure on your premises is a real-world event someone has to handle. We cover that with a monitoring and maintenance retainer, and a hybrid failover path where it makes sense. Below a certain call volume, cloud is simply cheaper, and we'll tell you when that's you.

How do you price it?

On-prem flips the cloud per-minute model. Instead of paying per minute, you pay a one-time build and integration fee (call-flow design, SIP integration into your existing PBX, model selection and tuning, handoff logic, deployment), then a flat monthly or per-line capacity charge because the compute is owned, not metered — plus a maintenance and monitoring retainer. The PSTN carrier's per-minute cost is unavoidable and passes through. We quote per engagement after we understand the call flow; there's no list price.

Is this the same as your property CRM?

No — they're separate. Our on-premise real-estate CRM does document intelligence and retrieval; it has no voice or phone component. Voice agents are a distinct capability: a new system, built around your call flow, that you own and run where your data lives. Don't read voice into the CRM, or the CRM into this.

Last reviewed:

READY TO TALK?

Tell us about your call flow

Tell us where the phone hurts — missed after-hours bookings, overflow on busy lines, routine questions eating your team's day — and we'll scope an on-prem voice agent around your real call flow. Pricing is per engagement, quoted after we understand the work.