How AI hears accents: An audible visualization of accent clusters
Using the accent guesser, I have a Swedish accent. Danish and Australian English follow as a close tie.
It's not just the AI. Non-native speakers of English often think I have a foreign accent, too. Often they guess at English or Australian. Like I must have been born there and moved here when I was younger, right? I've also been asked if I was Scandinavian.
Interestingly I've noticed that native speakers never make this mistake. They sometimes recognize that I have a speech impediment but there's something about how I talk that is recognized with confidence as a native accent. That leads me to the (probably obvious) inference that whatever it is that non-native speakers use to judge accent and competency, it is different from what native speakers use. I'm guessing in my case, phrase-length tone contour. (Which I can sort of hear, and presumably reproduce well, even if I have trouble with the consonants.)
AI also really has trouble with transcribing my speech. I noticed that as early as the '90s with early speech recognition software. It was completely unusable. Even now AI transcription has much more trouble with me than with most people. Yet aside from a habit of sometimes mumbling, I'm told I speak quite clearly, by humans.
Hearing different things, as it were.
AI also really has trouble with transcribing my speech. I noticed that as early as the '90s with early speech recognition software. It was completely unusable.
I don't know what your transcription use cases are, but you may be able to get an improvement by fine-tuning Whisper. This would require about $4 in training costs[1], and a dataset with 5-10 hours of your labeled (transcribed) speech, which may be the bigger hurdle[2].
1. 2000 steps took me 6 hours on an A100 on Collab, fine-tuning openai/whisper-large-v3 on 12 hours of data. I can shar my notebook/script with you if you'd like.
2. I am working on a PWA that makes it simple for humans to edit initial, automated transcriptions with mistakes for feeding the correct dataset back into the pipeline for fine-tuning, but its not ready yet
It's an interesting self-contained example
standard Canadian English is my native languageMost native English speakers claim my speech is unmarked
Non-native speakers of English often think I have a foreign accent, too. Often they guess at English or Australian. Like I must have been born there and moved here when I was younger, right?
They sometimes recognize that I have a speech impediment but there's something about how I talk that is recognized with confidence as a native accent.
At least 2 or 3 times a year, someone asks me if I'm British, but me and my parents were born in Canada, and I've never even been to England, so I'm not really sure why some people think that I have a British accent. Interestingly, the accent checker guesses that my accent is
American English 89%
Australian English 3%
French 3%
which is pretty close to correct.More bizarrely? Locals often assume I'm not from around here as well. I actually don't understand it.
[1]https://www.acelinguist.com/2020/01/the-pin-pen-merger.html
I also think merry-marry-Mary are all pronounced identically. The only way I can conceive of a difference between them is to think of an exaggerated Long Island accent, which, yeah, I guess is what makes it an accent.
My partner is from the PNW and she pronounces "egg" as "ayg" (like "ayyyy-g") but when I say "egg" she can't hear the difference between what I'm saying and what she says. And she has perfect hearing. But she CAN hear the difference between "pin" and "pen", and she gets upset when i say them the same way. lol
But yeah, that's one of the things that makes accents accents. It's not just the sounds that come out of our mouths but the way we hear things, too. Kinda crazy. :)
In the example of the reverse pen/pin merger (HMS Pinafore) on that page, I couldn’t hear “penafore” to save my life. Fascinating stuff.
I used to think of the movie “Fargo” and think “haha comical upper midwestern accents.” And then at some point I realized that the characters in “No Country for Old Men” probably must sound similarly ridiculous to anyone whose grandparents and great grandparents didn’t all speak with a deep, rural West Texas accent - which mine did, so watching the movie it just seemed completely natural for the place and time at a deeply subconscious level.
AI struggles massively with my accent. I've gotten the best results out of Whisper Large v2 and even that is only perhaps 60% accurate. It's been on my todo list to experiment with using LLMs to try to clean it up further - mostly so I can do things like dictate blog post outlines to my phone on long car rides - but I haven't had as much time as I'd like to mess around with it.
This is probably because some states in Aus use Queens English passed down from the colonies.
Your accent is Dutch, my friend. I identified your accent based on subtle details in your pronunciation. Want to sound like a native English speaker?
I'm British; from Yorkshire.
When letting it know how it got it wrong there's no option more specific than "English - United Kingdom". That's kind of funny, if not absurd, to anyone who knows anything of the incredible range of accents across the UK.
I also think the question "Do you have an accent when speaking English?" is an odd one. Everyone has an accent when speaking any language.
Sure, I agree. But look at it from the perspective of a foreigner living in an English-speaking country, which is probably their target demographic.
We know that as soon as we open our mouth the locals will instantly pigeonhole us as "a foreigner". No matter how good we might be in other areas, we will never be one of "them". The degree of prejudice that may or may not exist against us doesn't matter as much as the ever present knowledge that the locals know that we are not one of them, and the fear of being dismissed because of that.
Nobody likes to stand out like that, particularly when it so clearly puts you at a disadvantage. That sort of insecurity is what this product is aimed at.
BoldVoice is very clear about being an American accent "training app", so that's not (necessarily) what's happening here, but the point remains.
Countries/Universities will let you off where you're coming from a country that has english as it's main language
Singapore is a “native” English speaking country yet has an extremely distinctive accent.
(usually seen as a negative by both Singaporeans and non-Singaporeans)
The first two days were a shock, as I felt it was a different language. But just after some time, god adjusted. And I find endearing both Singlish pronunciation and phrases.
For example, the first time I hear "ondah-cah?" I was puzzled. Then understood that it is "Monday can?". Which, as I learned, means "Would Monday work for you?".
I find Danes speaking Danish to sound like a soft Yorkshire accent, and the vowels that Yorkies use are better written in Danish, like phøne.
That's kind of funny, if not absurd, to anyone who knows anything of the incredible range of accents across the UK.
Yeah I was disappointed when I realised this post was about foreign accents and not regional accents in English across the world.
By clicking or tapping on a point, you will hear a standardized version of the corresponding recording. The reason for voice standardization is two-fold: first, it anonymizes the speaker in the original recordings in order to protect their privacy. Second, it allows us to hear each accent projected onto a neutral voice, making it easier to hear the accent differences and ignore extraneous differences like gender, recording quality, and background noise. However, there is no free lunch: it does not perfectly preserve the source accent and introduces some audible phonetic artifacts.This voice standardization model is an in-house accent-preserving voice conversion model.
This voice standardization model is an in-house accent-preserving voice conversion model.
Not sure this model works really well. As a french/spanish native speaker, I can immediately recognize an actual French or Spanish person speaking in english, but the examples here are completly foreign to me. If I had to guess where the "french" accent was from I would have guessed something like Nigeria. For example spanish have a very distinct way of pronouncing "r" in english that is just not present here. I would have been unable to correctly guess French or Spanish for the ~10 examples present in each language (mayyybe 1 for French).
The article itself is just a vector projection in 3d space … the actual reality is much complex.
Any comments on pronunciation assessment models are greatly appreciated
Then I was able to apply UMAP + HDBSCAN to this dataset and it produced a 2D plot of all my books. Later I put the discovered topic back in the db and used that to compute tf-idf for my clusters from which I could pick the top 5 terms to serve as a crude cluster label.
It took about 20 to 30 hours to finish all these steps and I was very impressed with the results. I could see my cookbooks clearly separated from my programming and math books. I could drill in and see subclusters for baking, bbq, salads etc.
Currently I'm putting it into a 2 container docker compose file, base postgresql + a python container I'm working on.
Not many people have the privilege of access to these artifacts, or the skill to interpret these abstract, multi-dimensional spaces. I want more of these visualizations, with more spaces which encode different modalities.
Is there a way to subscribe to these blog posts for auto-notification?
Obligatory xkcd: https://xkcd.com/1053/
I'd suggest training a little less on audio books.
Also, the training dataset is highly imbalanced and Spanish is the most common class, so the model predicts it as a sort of default when it isn't confident -- this could lead to artifacts in the reduced 3d space.
At any rate, I was looking forward to finding out what the accent oracle thought of my native US English accent, which sounds northern to southerners and southern to northerners, but I guess it'd probably just flag it as "American".
I'm from the south of Sweden and I've had my "accent" made fun of by people from Malmö just because I grew up outside of Helsingborg, because the accent changes that much in just 60 kilometers.
It wrongly pegged me as Swedish.
It's second choice was the place I live, and third place was where I'm from, so not too bad overall. I have been told I have a very ambiguous accent though.
When I play the different recordings, which I understand have the accent "re-applied" to a neutral voice, it's very difficult to hear any actual differences in vowels, let alone prosody. Like if I click on "French", there's something vaguely different, but it's quite... off. It certainly doesn't sound like any native French speaker I've ever heard. And after all, a huge part of accent is prosody. So I'm not sure what vocal features they're considering as "accent"?
I'm also curious what the three dimensions are supposed to represent? Obviously there's no objective answer, but if they've listened to all the samples, surely they could explain the main constrasting features each dimension seems to encode?
Also, while I admire examples of instances, it would be interesting to the map or original laguages - which is close to which, in terms of their English accents.
I've tried to do the accent oracle test few times and it catches me being Italian with a 90%+ confidence.
The interesting thing is that if I try to fake a more english accent like American...it tells me I'm polish.
Which is odd because I don't really have a polish accent and don't speak it that well. I sound Italian even in Polish.