A Sirious Problem Transcript

MEGAN: Hi, welcome to the Vocal Fries Podcast, the podcast about linguistic discrimination.

CARRIE: I’m Carrie Gillon.

MEGAN: And I’m Megan Figueroa. And today we are going to be talking about artificial intelligence generally and more specifically automatic speech recognition. We have a guest here with us, because Carrie and I do not know anything about – well, ok, I shouldn’t speak about Carrie’s ignorance on the topic – but I don’t know

CARRIE: You were correct: I know nothing.

MEGAN: Ok. We are joined by Dr. Rachael Tatman. She is a data preparation analyst at Kaggle, which is, according to its own Twitter account, the world’s largest community of data scientists. Rachael has a PhD in linguistics from the University of Washington, where she specialized in computational sociolinguistics. Her dissertation, among other very cool things, showed the ways in which automatic speech recognition falls short when dealing with sociolinguistic variation, like dialects. Welcome Rachael.

RACHAEL: Hi! Thanks for having me.

CARRIE: Hi!

MEGAN: I’m very excited to have you. I feel like, with automatic speech recognition – I don’t know if other people feel this way – but I was in the camp where I didn’t realize that I should care about what’s happening, with how automatic speech recognition is being made or to listen to voices. I didn’t know that I had to care, and now I care. Hopefully we’ll show listeners why we should care.

RACHAEL: Yeah! I can share one of my stories about automatic speech recognition. One thing that’s really difficult is children’s voices, because obviously children are a different size, and they have a lot of acoustic qualities that are different. But also children have a lot of individual variation. If you spend a lot of time with kids, what’s a “bink bink”? Is it a blankie? Is it a bottle? I’m Dyslexic, and when I was in grade school, they tried to use automatic speech recognition to like help me type faster, so I could complete assignments and turn them in. And not fail third grade. Yeah: it did not work well. I remember very distinctly that I tried to say “the walls were dark and clammy”. We were doing a creative writing exercise, and it was transcribed as “the wells we’re gathered and planning”. Which is kinda close acoustically, but also there’s some probably poor language modeling behind that, where they thought that that was a more likely sentence than the one that I’d started with.

CARRIE: Wow.

MEGAN: Lets define automatic speech recognition for the listeners, and for myself. What is automatic speech recognition?

RACHAEL: It’s the computational task of taking in an acoustic signal of some kind and rendering it as speech. When I say an acoustic signal, I mean specifically a speech acoustic signal, because also people work with whale song and bird song and stuff. It gets used a lot in especially mobile devices. If you know Google Now or Cortana – I don’t know how many people actually use Cortana – or a Bixby, which is Samsung’s virtual assistant, or Siri, which is probably the most well-known one, they all rely on automatic speech recognition to sort of understand what you’re saying and reply to your tasks. It gets used a lot in virtual assistants, which is Echo or Google Play, or Apple’s launching one soon, as well. I don’t know how much you guys keep up with tech news, but these are little devices that sit in your home, and you can be like, “hey, Siri”. I guess, I don’t know what the Apple one’s gonna be called. Or, “okay, A L E X A”. I don’t want to say it, because I don’t want to turn on everybody’s Alexa. CARRIE: Oh no! I think you just did!

ALEXA: Hmm. I’m not sure what you meant by that question.

RACHAEL: Go back to sleep Alexa. It’s everywhere, is the point. People are incorporating into new technologies. They’re getting really excited about it. People are talking about incorporating it into testing for schools, for standardized testing. People are talking about incorporating it into medical diagnostic tests. Things like – what’s that a semantic one, where you have to name a bunch of things that are similar, before you move on?

CARRIE: I don’t know.

RACHAEL: It gets used for diagnosing a lot of things, like schizophrenia and Alzheimer’s and specific learning disorders. Semantic coherence test maybe?

CARRIE: Yeah.

RACHAEL: Anyway, people have been working on using speech recognition for that, so incorporating it into this. People are using it for language assessment, for immigration and visas, a lot of very high stakes places.

MEGAN: That’s very high stakes! That’s very important.

RACHAEL: Probably my favorite thing to be upset about in this realm is people incorporating NLP, which is natural language processing, which is more as text, and also automatic speech recognition, in these algorithms that you put information into and it tells you whether or not you should hire the person.

CARRIE: UGH. Oh my god.

RACHAEL: So very, very high stakes applications. You may not always realize that your voice or your or your language is being used in this way.

MEGAN: You can’t see my face, but I’m horrified right now. Okay. It’s very important. There’s a lot of practical applications that automatic speech recognition is being used for. In all of these realms, there’s possibility of discrimination.

RACHAEL: Yeah. As far as I know, no one who has looked at an automatic speech recognition system or a text-based system, specifically looking at performance across different demographic groups on a certain task, has ever found that “nope there’s no difference, it doesn’t matter, the system is able to deal super well with people of all different backgrounds”. Looking specifically at speech, I’ve done a number of studies, and by a number I mean two, and I’m working off and on a third, because I am also working full-time – this isn’t part of my job. I’m not speaking on behalf of my company or employer. If you’re gonna yell at anybody, yell at me personally. This is my private, individual thing. What I found is that there are really, really strong dialectal differences – so differences between people who have different regional origins. Which dialects get recognized more or less accurately seems to be – I’m having a hard time picking it apart, but I think it also is a function of social class. It’s fairly difficult to find speech samples that are labeled for the person’s dialect and also their social class, and good sociolinguistics sampling methods. It’s really hard to find large annotated speech databases that you can do this analysis with, but I found really strong dialectal differences in accuracy, with general American, or mainstream American English, or mainstream US English, or standardized American English – there’s a lot of different terms for this “fancy” talk – having the lowest error rate. I found that Caucasian speakers have the lowest error rate. Looking at Caucasian speakers, African American speakers, speakers of mixed race, and the study where I had race information – I only had one Native American speaker, so I had to exclude them, because one data point is not a line. So that’s worrying.

MEGAN: Right. What does it mean to have an error? What is the practical result of an error in speech recognition?

RACHAEL: There are three types of errors. One is where a word is substituted, so you say “walls” and it hears “wells” and transcribes that. Another one is deletion, where you say something like “I did not kill that man” and “I did kill that man” is transcribed. I should say people are still using hand stenographers for court cases, as far as I know. I don’t think anyone in the legal system is using ASR, but yikes.

CARRIE: Better not.

RACHAEL: There’s also insertion, when you think that you heard a word and it wasn’t actually there. A lot of times words that’re inserted are function words like “the” and “of”, things like that.

MEGAN: So deletion, insertion, and hearing it wrong. Doing another word.

RACHAEL: Yeah those are the only three transformations you can do, yes.

MEGAN: Okay.

RACHAEL: Word error rate is just, for all the words, how many of them did you get wrong in one of these ways. Just on a frustration level, if you’re using speech recognition as a day-to-day user, and it doesn’t work real great, that’s annoying. I’m sure if you guys ever use speech recognition, like on your phones, or I have a Google home, and I’ll use it for a timer a lot. It’s actually gotten better – it used to be really bad at hearing the word “stop” like “stop the timer”. I think that might be because of the [ɑ] [ɔ] merger that some people have. That’s my pet theory. But it’s gotten a lot better at understanding “stop”. I would have to say “stop” five times while I’m standing at the kitchen with cheese smeared on my arms up to my elbows or whatever.

CARRIE: That’s really strange because there isn’t a different “stop”. I have the [ɑ] [ɔ] merger, so I can’t make the other word, but it doesn’t exist anyway.

RACHAEL: Yeah, it may be that the acoustic model is more – so speech recognition, I’m gonna say this generally – because people are futzing around with it a lot and I’m messing it up -generally has two modules. One is the acoustic model, which is “what waveforms map to what sounds” and the other is the language model, which is “what words are more likely”. When you when you put those together, out comes the other end through some fancy math the most likely, for some given set of input parameters, the most likely transcription, ideally. And my guess is that if you’re not specifically modeling the fact that some people have two vowels and some people have one vowel in that space, you may be less able to recognize those sounds generally, because you think that there’s just a lot of variation there. Especially since there’s also the Northern city shift that’s muddling that whole area as well. Sorry, should I assume a lot of phonetic backgrounds on the part of your speakers?

CARRIE: Our listeners? Yeah, I was just gonna say: maybe we should describe what the Northern vowel shift is.

RACHAEL: There are a number of vowel shifts in the United States, and if you think of individual vowels as being little swarms of bees that are clustered around flowers, sometimes the swarms of bees move on or the flower moves and the swarm follows after it, and different places have movement in different directions. I don’t know, is that a good analogy? I’m using my hands a lot. I know you guys can’t see it. Is that clear?

CARRIE: I understand what you’re saying but I’m not sure. Good question.

MEGAN: I don’t know. I like the analogy. I feel like that’s good.

RACHAEL: I would look up vowel change shifts, if I was listening to this. I’d just google them, and you’ll see some nice pictures and arrows. You’ll be like “oh!”

CARRIE: Yeah. We’ll add something to the Tumblr to explain a little bit about vowel shifts, and also the merger we were talking about, because I can’t replicate it. I can’t do that open o [ɔ].

MEGAN: I can’t either. I don’t have it.

RACHAEL: “cot” [k ɔ t] as in “I caught the ball” and then “caught” [kɑt] – nope, I have it backwards again.

CARRIE: Yep. We haven’t asked you yet, but what is computational sociolinguistics?

RACHAEL: I don’t think I made up the term, but I’m probably one of the first people to call myself that. Dong Nguyen – she’s currently at the Alan Turing Institute – has a fabulous dissertation that has a really nice review chapter that talks about the history of this emerging field. It is approaching sociolinguistic questions using computational methods, and it’s also informing computational linguistics and natural image processing and automatic speech recognition with sociolinguistic knowledge. Working on dialect adaptation, I think would fall within that – that’s when you take an automatic speech recognition system that works on one dialect and try to make it work good for other dialects as well. I’ve done some work on modeling variation in textual features by social groups. I’ve looked at political affiliation and punctuation and capitalization in tweets, and there’s pretty robust differences at least in the US between oppositional political identities. I’m trying to think of other people’s work, so it’s not just: here’s a bunch of stuff that I’ve done!

MEGAN: Basically, everyone’s trying to model everything.

RACHAEL: Basically. Or should be, hopefully. I think, historically, there hasn’t been a lot of – I think sociolinguists are much better about knowing what’s going on in computational linguistics then computational linguists are at knowing about what’s going on in sociolinguistics. I’m coming from sociolinguistics and coming to computational linguistics. I’m trying to have a big bag of Labov papers and toss them to people, be like “here you go! Here you go!”

MEGAN: Yes and Labov is a very famous sociolinguist.

RACHAEL: He is, yes. I would call him the founder of variationist sociolinguistics – which is not the only school, but it is the school that I work in mainly.

CARRIE: Yeah, I think that’s – well that’s the most famous one as far as I know.

MEGAN: Yeah. I didn’t know there were other ones. Of course there is.

RACHAEL: Yeah, I’m trying to think of names. Mostly I’ll come across it I’ll be like “oh”. I guess discourse analysis is a type of sociolinguistics.

MEGAN: Oh, okay.

CARRIE: Yes.

RACHAEL: But different bent.

CARRIE: How is automatic speech recognition trained to understand humans? I think you’ve already started to answer this, but maybe you can answer it’ll be even more, if there is more to say.

RACHAEL: Yeah. I mentioned there are two components: there’s the acoustic model and then there the language model. Usually the language model is actually trained on texts. You take a very, very, very large corpus. I think right now – I don’t know about the standard, but what I think most people would like to use would be the Google trillion word corpus, which is from scraped web text, or people use the Wall Street Journal corpus, which is several hundred million words long. You know the probability of a certain set of words occurring in a certain order, so it’s the poor man’s way of getting syntax. I’ll tell you about how it’s traditionally done. People are replacing both the pronunciation dictionary and the acoustic model, which sometimes includes the pronunciation dictionary with big neural nets. We can talk about that in a little bit, but traditionally the pronunciation dictionary was made by hand. The Carnegie Mellon the pronunciation dictionary, or CMU pronunciation dictionary, is probably the best-known one for American English. People transcribe words, and if there’s one that you need that’s not transcribed, you add it.

MEGAN: And what’s a pronunciation dictionary?

RACHAEL: It is a list of words and then how they’re pronounced. The phones, so “cat” would be [k] [æ] [t] – those three sounds in order. Then the acoustic model takes the waveform and tells you the probability of each of those sounds. So it’s like “well I’m pretty sure it’s [æ], but I guess it could also be [ɑ]”, through a process of transformations. People recently have been taking a speech corpus – usually one that’s labeled, so you know what words are spoken – and then using all of that data and shoving it into a neural net, which is a type of machine learning algorithm – it’s a family of machine learning algorithms. People use different types and flavors, and they have different structures. What neural nets are really, really good at is finding patterns in the data, and recognizing those same patterns later, without you having to tell them to do it. They learn it themselves, from just the way that the information is organized. They’ve been really, really good and useful in image processing, in particular, being able to look at a photo and be like “here is an apple”, “here is an orange” and “I have circled them helpfully for you”. They’re really good at that. But as it turns out there is more structure in language than there is in other types of data.

CARRIE: Shocking. [sarcasm]

RACHAEL: It is to some people. I’ve had a lot of frustrating conversations where people were like “but it works really good on images!” I’m like “yes, but language is different”. If it weren’t, we wouldn’t need linguistics. People wouldn’t need to study language their entire lives, if it was just like images but in sound, basically. Which I think is probably not news to any listeners of this podcast, but definitely it is news to some people. Neural nets are really good at seeing things that they’ve seen before, or identifying the types of things they’ve seen before, and if they see new things, they’re not so good at it. I think that’s really where a lot of the trouble with dialect comes in, because sociolinguistic variation is very systematic between dialect regions. One person can have multiple dialects as well. I don’t want to make it sound like you sort people into their dialects and then apply the correct model and then boom everything’s correct all the time. Because people have tried that and it works better than not doing anything, but it’s still not – I don’t know. There’s a lot of work to do, and I don’t want to make it sound like speech research engineers are just fluffing around and not knowing about language, because they do. But it’s difficult, and it hasn’t, I think, been a major focus for a lot of people recently, and I’m hoping that it will become more of a research focus.

MEGAN: You said something in one of your interviews that I wanted to read here that I liked. You say that “generally the people who are doing the training aren’t the people whose voices are in the dataset. You’ll take a dataset that’s out there that has a lot of different people’s voices, and it will work well for a large variety of people. I think the people who don’t have sociolinguistic knowledge haven’t thought about the demographic of people speaking would have an effect. I don’t think it’s maliciousness. I just think it wasn’t considered.”

RACHAEL: Yeah.

MEGAN: I think “it was a considered” part – it’s how I felt actually. I obviously very much care that people aren’t discriminated against in every aspect of life. But I just didn’t think about speech recognition.

RACHAEL: Yeah. I think we have this idea that like “oh a computer’s doing it, so it’s not gonna be biased”.

MEGAN: You’re right.

RACHAEL: That’s nice to believe that you have the ethical computer from Star Trek, but bias is built into all machine learning models. It’s one of the things you study in a machine learning class. You talk about bias and variance, and it’s there in the model, and it’s there in the data. Pretending that it can go away if you just keep adding more data is a little bit of a problem for the people who are actually using the system, and it doesn’t work as well for them as it should, maybe.

CARRIE: It’s also very naïve.

MEGAN: Yeah. Humans are the ones that are doing it, right. We’re behind the machines. Of course there’s biases. I was thinking, I’ve said I’ve never thought about this before, but I don’t use Siri, because Siri does not understand me very well at all. I’ve given up.

SIRI: I miss you Megan.

MEGAN: I didn’t take the next step. I didn’t take the next step, and think “oh why is this the case that she’s not understanding me very well”.

RACHAEL: Yeah.

CARRIE: She understands me pretty well. I have a pretty standard North American accent.

RACHAEL: A little bit of the Canadian shift.

CARRIE: I do, but it’s not enough to trick SIRI, apparently. My accent has shifted somewhat since living in the States for over nine years. I knew that speech recognition did have a problem with at least some dialects, because there’s a fairly famous skit from Burnistoun, the Scottish sketch comedy show, where he’s just saying “eleven”, and it’s one of the words where in a Scottish accent “eleven” is pretty close, so the speech recognition should have been able to pick it up. Most of the sketches is them speaking in a Scottish dialect that I think many Americans would not understand actually.

IAIN CONNELL: You ever tried voice recognition technology?

ROBERT FLORENCE: No.

IAIN CONNELL: They don’t do Scottish accents.

ROBERT FLORENCE: Eleven.

ELEVATOR: Could you please repeat that.

ROBERT FLORENCE: Eleven.

IAIN CONNELL: Eleven.

ROBERT FLORENCE: Eleven. Eleven.

IAIN CONNELL: Eleven.

ELEVATOR: Could you please repeat that.

IAIN CONNELL: Eleven. If you don’t understand the lingo, away back home your own country. [If you don’t underston the lingo, away back hame yer ain country.]

ROBERT FLORENCE: Oohh, is the talk now is it? “Away back home your own country?” [Oh, s’tha talk nae is it? “Away back tae yer ain country”?]

IAIN CONNELL: Oh, don’t start Mr Bleeding Heart – how can you be racist to a lift? [how can ye be racist tae a lift?]

ELEVATOR: Please speak slowly and clearly.

CARRIE: Anyway, it’s a really funny sketch, if you haven’t seen it. I will post it, because I think it’s funny.

MEGAN: I don’t know what it is about me. I don’t know if vocal fry would affect it at all. I’m also kind of mumbly. I try not to be mumbly on the podcast obviously, but in my normal everyday life, I am a mumbler, so that might be it. I expect Siri to understand my mumbles, but she don’t, so I gave up.

RACHAEL: But see, that’s part of the problem, because – I don’t know for sure, but I would be beyond shocked if – because I know that for sure, Google has the ability to – it retains the speech samples that you send them, and I’m sure that they fold them back into their training data, so if you’re not using it, because it doesn’t understand you, it’s pretty much never gonna understand you, is the unfortunate thing. I think that’s really part of the reason that there’s – I think – pretty strong class effects. This is this is me having a science hunch that I haven’t really banged out yet in some experimental work. I think that people who have a higher socioeconomic status and particularly professional class, mobile – not rural the other one.

CARRIE: Urban.

RACHAEL: Urban! Yeah, thank you. Especially professional, mobile, urban people have – I’m almost positive – higher cognition rates, correct word rates.

MEGAN: You mentioned something about how the language model was taking in things like The Wall Street Journal. Wouldn’t that affect it too? That’s not your acoustic signal, but it’s the way you speak? I don’t know.

RACHAEL: Yeah. No that’s fair. “‘Fiduciary’ seems to be a fairly common word that humans use all the time, so I’m gonna look for that one.”

CARRIE: I would be very surprised if class didn’t play a role. It always does. In everything that we talk about, there’s something about class going on too. But we don’t think about it as much in North America as we should.

MEGAN: We really don’t. Especially since it’s wrapped in with race and ethnicity so much. I act like I know anything beyond the States. It’s just very American.

RACHAEL: I think it’s very much the top-level thing that people think about with language variation in the UK, for sure.

MEGAN: Ah, okay.

CARRIE: Yeah. Absolutely.

MEGAN: Interesting.

RACHAEL: There’s RP, and then those weird regional dialects that we don’t like. As a person not from the UK, that’s the judgments that I’ve gotten from consuming popular media.

CARRIE: It used to be worse. Because the BBC used to only have received pronunciation with their reporters, but now you’ll hear regional varieties. Still the most prestigious versions of those varieties, but at least you’ll hear Irish dialects now. Things are slightly better.

MEGAN: You’d hope so. ASR is trained to understand humans, so you’re feeding in them these datasets, and I didn’t know this but I guess, like you said, if I talk to Siri, I’m also feeding into a dataset.

RACHAEL: Yeah. That seems very likely to me. Again, I don’t know for sure, and this may be something that’s Googlable, you could find using a search engine, and it may be something that you could not find using a search engine. The other thing about neural nets is because they’re good at seeing things they seen before, they get really good if you have a lot of data, a lot of data. I have not yet seen the company that would ignore free data that people were giving to it to improve model performance.

MEGAN: Do you have examples of automatic speech recognition failing to understand people that we can give the listeners, so they can see the problem?

RACHAEL: I can give you one from my life, which continues to drive me nuts. I’m from the South and I have a general American professional voice that I use, but especially if I’m relaxing with friends or with my family, I definitely sound more Southern. One of the things that happens in the South and also in African American English is nasal place assimilation. If you have a nasal after a stop, which are sounds like [k] [t] [p] [g] [d] [b], you will change the nasal, [m], [n], or [ŋ], to whatever the thing in front of it was. I would say “beanbag” as “beambag”, especially in an informal setting. Or a “handbag” is “hambag”. Put your things in your handbag. I think it’s a fairly common thing. Google used to always, always, always search for “beambag” when I wanted to know about “beanbags”, because I was doing research to get – I currently have one, I just turned to look at it – a really good beanbag chair. They’re very comfy! I like them. It kept telling me about “beambags”, which are not a thing! It just drove me up the wall, because lots of people do this thing. This is a normal speech process.

CARRIE: Yeah. Very common.

MEGAN: Also, a “hambag”, a bag of hams and that might be something people have.

RACHAEL: I guess a Smithfield ham does come in a bag. It comes like a little canvas bag.

MEGAN: I guess that’s where it’s trying to get you. But that’s not what you’re [meaning]. That’s funny. Okay, how do we solve this problem? What should we be thinking about when we develop automatic speech recognition databases and such? Who should be involved?

RACHAEL: Sociolinguists. Definitely hire sociolinguists. That’s my general go-to drum. It’s a hard problem. I don’t want to pretend that a sociolinguist looks at it and they’re like “ah! Fix this parameter!” and then suddenly it works great for everyone. Because the fact of minority languages or language varieties, in particular, is that they’re minority because fewer people use them. If you are trying to optimize performance and accuracy for the model as a whole, and you raise it for the people who are from minority groups – whatever those may be – if you are using the one model, that will lower it for your majority language speakers. Just adding more data isn’t necessarily going to be the fix. People have been have been working on this for a long time, and it’s a very hard problem, and I have nothing but respect for everyone who’s working on this. There’s a couple of approaches that people are doing. One is to train multiple models on different stable language varieties. In the US I might train one on West Coast generally, and as far as that is a single language variety, I’d probably train one on the Northeast, one on the northern cities, so Chicago, Michigan sort of area – Chicago’s in Illinois – Illinois, Michigan sort of area. One on the South. One also for the mid-Atlantic region. And then select one of those models, based on whichever would most accurately represent the person who’s speaking. That’s one approach. Another approach is to take the model and then change it for every single person’s voice. That will capture dialectal variation, but it will also capture individual variation. The reason that your phone doesn’t do that automatically is because it is very computationally intensive. These models are very big. They have a lot of information in them. They have a lot of parameters, and to change those, it takes a lot of raw processing power. That’s not really feasible to do for individual people, as it stands. I don’t know, maybe in five years it will be completely feasible. We’ll all have GPUs falling out of our pockets everywhere we go. I don’t know. That’s another approach that some people have taken. I don’t know, maybe with some fancy new ensembling – which is where you multiple different types of models and stick them together like – what are those, K’nex? – and they build a pipeline, and then you shove the data all the way through the pipeline, and all the different models that are connected together. Those have been getting really good results lately, so maybe some sort of clever ensembling, where you do something like demographic recognition, and then something like shifting your language model a little bit. I don’t know. I don’t know. I don’t know what people are gonna come up with.

MEGAN: This is the future. This is the future that millennials want or something. I don’t know. This is the future liberals want. If this is the future, I’m thinking about the fact that in 30 years we’re gonna be a majority-minority country. We’re on our way to this becoming a bigger and bigger problem.

RACHAEL: Yes. Definitely.

MEGAN: The fact that Siri or Alexa – sorry – has trouble understanding people that aren’t in this white –

RACHAEL: Super-privileged, small group?

MEGAN: Yeah, right. There’s a gender bias too, right? It’s males that are understood.

CARRIE: And we’re the majority.

RACHAEL: I just want to quickly intercede here – I did some in earlier work finds that it was more accurate – specifically YouTube’s automatic captions were more accurate for men than women, but I think, because I couldn’t replicate that result, the problem there was actually signal-to-noise ratio. Women tend to be a little bit quieter, because we’re a little bit smaller. If you are speaking at the same effort-level in the same environment, there’s just gonna be a little bit more noise in the signal for women, because we’re not quite as loud. I don’t know that clutter signal processing can fix that. I’m gonna keep working on this, and who I might find out that actually there are you know really strong differences, it maybe it can’t deal with things that women do more. I was gonna say “vocal fry”, but I’ve seen no evidence that women fry more than men, which I’m sure you talked about. At length.

CARRIE: Right. That was our first episode. Everybody does it. Leave us alone!

MEGAN: Leave it alone. Get the fuck off my vocal fry! What I’m hearing is this is something that we should all very much care about, because, like Carrie said, everyone else is the majority. If it’s best trained on white men that are in higher socio-economic classes, that’s not the majority. It sounds like we need to have people in the room, because, like you said, you don’t think it was considered when they were making these datasets. We need people in the room that are like “wait, I come from this community where that’s not how we talk, this is not gonna work for me or us”.

RACHAEL: Yeah, definitely.

MEGAN: I definitely want to plug a representation too. We need more people in the room.

RACHAEL: Definitely. I’ve been talking about English, because that’s what I know about, and specifically American English. I don’t want to get into British dialectology, cuz that’s crazy, crazy complex. But this is also a problem in other languages. Arabic dialects are incredibly different from each other.

CARRIE: Right.

MEGAN: Now I’m thinking about people that are bilingual.

RACHAEL: Or bidialectal.

MEGAN: Or bidialectal, for sure. That’s gonna be something else that we would want automatic speech recognition to recognize.

RACHAEL: Yeah. Absolutely. I can give people something that you can do right now – is that Mozilla, which is the company that owned the Firefox – continues to own, I think, the Firefox web browser – is currently crowdsourcing a database of voices, and voice samples. You can head over to that website, for which there is a link that I for sure can’t find. I think it’s called the Mozilla Common Voice Project, but don’t quote me on that unless it’s right.

MEGAN: We’ll put it somewhere.

RACHAEL: Mozilla is doing a collection of voices of people, and they’re specifically trying to get people from different demographic backgrounds, for specifically this problem, for knowing demographic information about someone, for having speech samples for the. They’re also having people manually check the recording, so if this is something that’s interesting, and you want to listen to a lot of voices, I’d recommend heading over there and checking it out.

MEGAN: Ah, so they are crowdsourcing automatic speech recognition. That’s a good idea. That’s a tough – how you get the most variation in the people that reply.

RACHAEL: One thing that I found in my own work, and other computational linguists have found as well, is that we know a lot about variation in speech, but a lot of the same variation also exists in text. A lot of the text that you produce in your day-to-day life, especially if it’s anywhere online, is getting fed into a lot of natural language processing tools. There are also problems with those. Things like identifying what language someone is using is not as good.

CARRIE: Yeah, I notice that on Twitter a lot. It wants to translate from French all the time.

MEGAN: Yeah.

RACHAEL: Twitter’s language ID is a hot mess. A hot mess.

CARRIE: And it’s never French. It’s never French. In fact, sometimes it’s English. I’m like “what is going on?”

MEGAN: I’ve had Estonian. Translate from Estonian.

RACHAEL: Yeah. Estonian tends to show up a lot. I’m trying to think of – I have started doing some very lackadaisical data collection. I think it seems to work on a character level, so it tends to be fairly good at languages that have a unique character set. It tends to be very good at Thai, but related Germanic languages – pfft – it does not. That’s Bing. That’s on Microsoft. They’re the back end there, so I 100% blame them. Maybe, if they hadn’t gutted their research teams, they would be able to do this better. CARRIE: Hint hint.

MEGAN: That is something that we can do immediately. Do you have something really poignant you want to say about why this is all important? What’s the takeaway message? Because we’ve been talking us this whole time about why it’s important, but what do you think is the takeaway?

RACHAEL: It’s important to hear people’s voices. Both literally and metaphorically.

CARRIE: There we go. There’s the money shot.

MEGAN: That’s the money shot. Money, money, money. See that’s what we wanted!

CARRIE: Yes. It’s important to hear people’s voices. I think that’s a good place to end.

MEGAN: Yeah, cuz that was it. Unless you have anything else, Rachael?

RACHAEL: Hmm. No, I don’t think so. I use my hands a lot, so hopefully a lot of the things that I was saying with my hands I was also saying with my voice.

MEGAN: Yeah, I realized that at our first episode, I was using my hands, and now my hands don’t even move. It comes with some experience – of my four episodes that I have done, five episodes.

CARRIE: Five! Five episodes. This is our sixth.

RACHAEL: Ooh! Lucky number 6!

CARRIE: Thank you so much, Rachael, for talking with us today.

RACHAEL: You’re welcome!

CARRIE: That was awesome. I learned a lot.

MEGAN: I know, I learned so much. I was so ignorant on this subject. So thank you. Hopefully this will be of interest to people that have no idea, but also to our listeners that really like speech recognition stuff. I know that I know that they’re there. This is very exciting. Alright, cool. I guess we want to leave everyone with one message, which is: don’t be a fucking asshole.

CARRIE: Don’t be an asshole. Bye!

CARRIE: The Vocal Fries Podcast is produced by Chris Ayers for Halftone Audio. Theme music by Nick Granum. You can find us on Tumblr, Twitter, Facebook and Instagram @vocalfriespod. You can email us at vocalfriespod@gmail.com.