Deepfake Detection with Voice AI: How Real-Time AI Stops Fraud & Security Threats | Carter Huffman
In this episode of An Hour of Innovation podcast, Vit Lyoshin speaks with Carter Huffman, CTO and co-founder of Modulate AI, about why voice is one of the most complex and powerful frontiers in artificial intelligence.
While most AI breakthroughs have focused on text and images, voice introduces an entirely different layer of nuance: tone, emotion, cadence, cultural context, background noise, and real-time variation. Carter explains why understanding human speech requires more than transcription; it demands contextual, emotional, and situational awareness. That complexity is exactly what makes voice AI critical in high-stakes environments.
The conversation explores how Modulate’s technology detects fraud, harassment, and even deepfake voices in real time. In gaming, voice AI helps reduce toxic behavior that drives users away. In call centers and healthcare systems, it can flag social engineering attempts and prevent security breaches before they happen. Carter shares how their ensemble model approach, using many small, specialized AI models instead of a single massive system, enables high accuracy at scale while keeping costs manageable.
Beyond the technical details, the discussion dives into ethics. The same technology that protects users and organizations could also be misused for censorship. Carter emphasizes the importance of human oversight, escalation systems, and responsible deployment.
The episode closes with a broader look at the future of voice interfaces, from today’s assistants to the long-standing “Star Trek computer” vision, and what it will take to build AI systems that truly understand not just words, but meaning.
Carter Huffman is the CTO and Co-Founder of Modulate AI, a leader in voice AI and conversational AI security. With a background in physics and audio signal processing, he has spent over a decade advancing audio machine learning systems that understand emotion, intent, and context in human speech. His work powers AI moderation systems in major gaming platforms and strengthens AI security in call centers and hospitals. In this episode, he offers rare insight into how AI voice detection works under the hood and where the future of deepfake defense is headed.
Takeaways
- Voice AI can detect a deepfake voice within the first two seconds of a phone call.
- Toxicity detection isn’t about keywords: sarcasm, tone, and context completely change meaning.
- A single toxic voice interaction can drive gamers away permanently, creating massive churn.
- Real-time AI fraud prevention must operate at low latency and high accuracy simultaneously.
- Ensemble AI models (many small specialized models) outperform one large general model in cost and precision.
- Audio AI systems often fail when the microphone setups or recording environments slightly change.
- Social engineering attacks rely on emotional pressure, which AI can detect through conversational patterns.
- AI can escalate suspicious calls to supervisors or automatically lock accounts before fraud succeeds.
- Speaker identification allows AI to track participants within a meeting, without tracking them across calls.
- Synthetic voice detection doesn’t automatically mean malicious intent; assistive tech must be considered.
- AI moderation systems must include human review and appeals to remain ethical and compliant.
- The same Voice AI technology that prevents fraud could be misused for censorship if deployed unethically.
Timestamps
00:00 Introduction
01:30 What Is Modulate AI?
03:09 Why Voice AI Is Harder Than Text AI
06:54 The Evolution of Voice AI in Gaming
10:26 How Modulate AI Works
19:05 Voice AI in Various Industries
26:31 Ethical Considerations in Voice AI Technology
32:40 Ethics in AI: Balancing Good and Bad Uses
34:32 Audio Machine Learning Challenges
41:09 The Future of Voice AI
45:45 Connect with Carter
46:31 Innovation Q&A
Connect with Carter
- Website: https://www.modulate.ai/
- LinkedIn: https://www.linkedin.com/in/carter-huffman-a9aba05b/
This Episode Is Supported By
- Google Workspace: Collaborative way of working in the cloud, from anywhere, on any device - https://referworkspace.app.goo.gl/A7wH
- Webflow: Create custom, responsive websites without coding - https://try.webflow.com/0lse98neclhe
- Monkey Digital: Unbeatable SEO. Outrank your competitors - https://www.monkeydigital.org?ref=110260
For inquiries about sponsoring An Hour of Innovation, email iris@anhourofinnovation.com
Connect with Vit
- LinkedIn: https://www.linkedin.com/in/vit-lyoshin/
- Substuck: https://anhourofinnovation.substack.com/
- X: https://x.com/vitlyoshin
- Website: https://vitlyoshin.com/contact/
Episode References
Call of Duty
https://www.callofduty.com
A major online multiplayer game referenced as one of the platforms where Modulate AI moderation technology is deployed.
Grand Theft Auto
https://www.rockstargames.com/
A widely known game mentioned as using voice moderation technology to detect toxic behavior.
Fortnite
https://www.fortnite.com
Referenced as an analogy for “voice skins,” similar to cosmetic skins in Fortnite, when discussing the company’s early product idea.
Zoom
https://zoom.us
Mentioned in two contexts: first as part of troubleshooting audio interference, and later as a tool used for maintaining long-distance relationships via video calls.
Siri
https://www.apple.com/siri/
Apple’s voice assistant referenced as an early attempt at creating conversational voice interfaces.
Alexa
https://www.amazon.com/alexa
Amazon’s voice assistant cited as another example of voice interfaces attempting to fulfill the “Star Trek computer” vision.
Star Trek
https://www.startrek.com
Referenced as the aspirational model for fully context-aware conversational AI — the “talk to the computer and it understands everything” vision.
Stephen Hawking
https://www.britannica.com/biography/Stephen-Hawking
Mentioned in a discussion about synthetic voices and identity, highlighting how even robotic voices can carry meaning and character.
Reddit
https://www.reddit.com
Used as an example when discussing AI-generated social media content and the difficulty of distinguishing real from synthetic posts.

















