OWYN'S MUSINGS

Deepfaking My Voice

Deepfaking My Voice

I've always loved music production and audio. I've also been interested in the ability of AI/ML to clone voices, but hadn't had a chance to really try it out. The purpose of this project was to learn a little more about the options that are out there, and how effective one-shot learning could be.

Additionally, I thought it'd be fun to attempt to run the inference API locally and test out some of Cloudflare's ZeroTrust networking capabilities vs. finding some cloud GPU to host these open source models.

The TLDR; is, for my open-source attempts: the results aren't that good with the few different paths I took. That being said, it still was interesting to examine the patterns of each and what is missing. My attempt with the ElevenLabs API was extremely surprising and impressive, it got the closest I've heard.

Table of Contents

Open Source Models Demo

Here's a demo of the outcome, it's poor at best in both Open Source model cases (Bark and OpenVoice). The ElevenLabs API integration case sounds really close! To enable deploying this on my little homelab server (that doesn't have a CUDA-compatible GPU), I've disabled the Bark and OpenVoice voices.


sync

Voice Cloning Process

I'd say that I did more 'tinkering' than approaching this via the scientific method. That being said, after not having immediate luck getting something to sound like me (sadly), I attempted to setup a common audio source for apples-to-apples one-shot training.

Ultimately, to get good results I'd likely need to build a broader corpus of my speech and corresponding transcriptions. I may do this eventually but I'll set this aside for now.

One-Shot Training Audio

To get a feel for my voice, here's the one-shot training audio I used. It was a combination of me reading "I am a Bunny" by Ole Risom and Richard Scarry in English and (poor) Spanish. Additionally I read a part of the README of the faster-whisper repo:

I used it to fine-tune a Bark model and also OpenVoice. I'll get into more details of those processes below.

Bark Voice Cloning

Both the Suno-AI repo and the Serp-AI fork that enables the cloning (which is what I used) have pretty good walkthroughs for how to set it up for audio generation (and fine-tuning in the Serp-AI case).

The fully generative approach that Bark employs with their standard voices yields really good voices (mostly).

That being said, I wasn't able to find a fine-tuning set that output a voice that sounded close to staying consistent phrase-to-phrase, let alone sound like me.

For more general TTS use-cases, Bark seems like a great option, and I probably could do more trials and fine-tuning with larger sample sets, but I didn't have success.

OpenVoice Voice Cloning

Feeling a bit dispondent, I looked at a few other options, but landed on OpenVoice for a second trial. OpenVoice's approach is a bit different, it leverages more standard Text-to-speech systems (in this case MeloTTS) and then transforms the 'color/tone' of the voice based on training data.

The results do have some characteristics that have my tone, but it reinforced to me that more needs to be done because it didn't match my cadence/speech-patterns/etc and therefore still didn't sound like me.

ElevenLabs Voice Cloning

Given the results of the 2 open-source attempts didn't work, I figured I'd give ElevenLabs a try for the $5 it'd cost to test out.

I went to Voices > Add a new voice > Instant Voice Clone and used the same one-shot training audio linked above. I added a few tags (accent:American, gender: male, language:English) and a brief description and created the voice.

Here's the output compared to a similar prompt examples for the open source models:

Crazy! It sounds very much like me.

Here it is wired through the API with the prompt used for others:

Turning The Open Source Models into an API

To enable this little demo, and to test out Cloudflare Zero Trust Tunnels, I wanted to set up this as a little API. I run the API locally on my Desktop with my 3-generation-old GPU, and the website has access to it.

The API is a little FastAPI python app, and the source code is owyn-voice-api.

Ultimately it is really simple, it preloads both the Bark and OpenVoice models on startup, and runs inference-on-demand depending on the /speak_as/{voice_name} parameter. OpenVoice only has 1 voice_name right now owyn-reference3, so the rest expect a npz file trained based on the Bark fine-tuning process.

There is a little cache that keeps previously generated prompt/voice_name wav file combos around.

System Architecture

The API is a pretty simple system. It uses FastAPI to scaffold up a RESTful API and leverages the models under the covers. It stores the generated audio as a WAV on the desktop, and serves the file back to the call.

The web client interfaces with the desktop via Cloudflare Zero Trust Tunnels to securely expose the API while keeping the rest of the system isolated.

Here's a link to how that works: Cloudflare Zero Trust Tunnel, and below is a mermaid diagram that outlines a bit more detail about the system as a whole.

--- title: Voice API + Cloudflare Tunnels Architecture config: theme: neutral look: handDrawn architecture: iconSize: 40 fontSize: 12 --- architecture-beta group user(hugeicons:user-group)[User] group edge(logos:cloudflare-icon)[Cloudflare Edge] group local(server)[Local Machine] group elevenlabs(server)[ElevenLabs dot io] service user_request1(hugeicons:location-user-01)[User Request 1] in user service user_request2(hugeicons:location-user-01)[User Request 2] in user service page(logos:cloudflare-icon)[Demo app] in edge service edge_server(logos:cloudflare-icon)[Edge Server] in edge service fast_api(logos:fastapi)[Inference API] in local service bark(logos:pytorch-icon)[Bark Model] in local service openvoice(logos:pytorch-icon)[OpenVoice Model] in local service audio_cache(disk)[Audio Cache] in local service cloudflared(logos:cloudflare-icon)[CloudflareD Tunnel] in local service eleventts(server) in elevenlabs junction junctionTop in local user_request1:R <--> L:page user_request2:R --> L:edge_server edge_server:R <-- L:cloudflared cloudflared:R --> L:fast_api fast_api:R <--> L:audio_cache fast_api:T --> B:junctionTop junctionTop:L --> R:bark junctionTop:R --> L:openvoice fast_api:B --> T:eleventts

TODOs

Given the results are sub-par right now, there are a lot of TODOs here. The first step is to try other TTS models to see if fine-tuning is better with those. Additionally, I can capture and provide more samples with ground-truth transcriptions to go beyond one-shot learning.

Other SDKs/APIs to Try

https://github.com/KoljaB/RealtimeTTS?tab=readme-ov-file#cuda-installation https://github.com/KoljaB/RealtimeSTT https://github.com/metavoiceio/metavoice-src

It's Too Slow

Beyond tinkering, the ultimate goal was to recreate a little avatar that can converse, but the TTS for this (at least on my RTX 2070) is nowhere near fast enough to be a valid approach.

One of the reasons to try out Realtime TTS (and STT) is to see if that has any performance improvements, or if it truly is due to my old video card and I need to upgrade.