Idea: Can we create a model that converts non-native speech to native-like speech using only synthetic data?

Can we create a model that converts non-native speech to native-like speech using only synthetic data?

Background

I’m not a native English speaker, and my pronunciation isn’t great. This is actually very common—tons of people have the same issue. Even in the United States, you can easily find immigrants who have lived there for over a decade yet still have heavily accented or “bad” pronunciation. For daily life, it’s usually not a problem at all.

But when it comes to creating content—like YouTube videos, podcasts, or audiobooks—poor pronunciation becomes a real barrier unless you have an exceptionally strong message or unique style that makes people overlook it. Listeners often identify the creator by their voice, so if the accent is strong and the speech isn’t clear, it can hurt discoverability and retention. Of course, it depends on the type of content; some niches don’t care, but in many cases it matters. The point is: a lot of non-native creators want to be understood clearly by a wide audience. I’m convinced there’s real demand for a tool that can fix this, kind of like noise reduction or voice enhancement, but for accent and pronunciation.

Idea

Fundamentally, this is a sequence-to-sequence problem, so we already have plenty of powerful architectures (Transformer-based, diffusion models, etc.) that could handle it. The classic approach would be to feed non-native speech as input and native speech as output—but that requires parallel data: the exact same sentences recorded by both non-native and native speakers. Collecting that at scale is extremely hard and expensive.

So here’s a possible solution: train everything on synthetic data using TTS models.

I know current TTS systems aren’t perfect at sounding 100% human yet, and there’s a risk that a model trained only on synthetic speech might produce output that still sounds slightly artificial. But I think there’s a smarter way to frame the problem that could avoid many of those issues.

Instead of directly mapping raw non-native audio → native audio, we can treat this as phoneme sequence correction.

Here’s the pipeline I have in mind:

Start with a clean target sentence (text).
Generate the correct (native) phoneme sequence for that sentence.
Automatically create millions of distorted phoneme sequences by applying realistic non-native error patterns (substitutions, insertions, deletions typical of certain L1 backgrounds—e.g., /r/ → /l/, /æ/ → /e/, missing word-final consonants, etc.).
Use a high-quality TTS system that accepts explicit phoneme input to generate two audio files:
One from the correct phoneme sequence (native reference)
One from the distorted phoneme sequence (synthetic non-native speech)
Train the model on pairs: distorted audio → correct audio (or directly correct phoneme sequence, or correct audio waveform).

Because the TTS is forced to pronounce exactly the phonemes we feed it, the “non-native” audio will contain the kinds of systematic errors real non-native speakers make, but everything stays perfectly aligned.

At inference time, you take real non-native speech, run it through an ASR phoneme recognizer (or a forced aligner), get the (incorrect) phoneme sequence, correct it with the trained model, then feed the corrected phonemes into the same high-quality TTS to generate clean, native-sounding speech—in the user’s own voice timbre if you also do voice conversion/cloning on top.

Alternatively, you can skip the intermediate phoneme step at inference and train an end-to-end audio → audio model purely on the synthetic parallel data we created. The key advantage is that the model learns pronunciation correction patterns rather than just statistical artifacts of real recordings.

Example:
Target sentence: “Hello, how are you?”
Correct American English phonemes (rough):
/həˈloʊ haʊ ɑr ju/

Synthetic non-native variants:

/æ/ instead of /ɑr/: həˈloʊ haʊ ær ju
/l/ instead of /r/: həˈloʊ haʊ ɑl ju
Missing reduced vowels, epenthetic vowels, etc.

We generate millions of such pairs, covering many accents and error types.

Final Thoughts

This is basically just me dumping an idea I’ve been thinking about for a while. I really hope I (or someone) can try building this someday. There are still challenges (defining what “native” means, handling prosody and rhythm beyond phonemes, making the output sound natural and not TTS-y), but I believe training on phoneme-level correction with fully synthetic parallel data could be a viable path.

Have a great day!