As academia and industry continues to move towards virtual meeting spaces in the face of global pandemic, the accuracy of automated transcription software has become increasingly important, especially as it pertains to speech-to-text transcription over different genders, races, ethnicities, and ages. Bias towards one demographic over another may ultimately deny opportunities to groups where such software does not accurately capture their speech and thus ultimately misrepresent or prevent such individuals from participating in equal wuithin the virtual workplace.
In a study at Northeastern University's IoT lab during Fall of 2021, I worked with students Harshit Sharma and Sakthi Kripa Selvan under the supervision of Ph.D Daniel Dubios to identify and analyze inherent biases in automated transcription software services available in popular social media and video conferencing platforms.
Automated Transcription Software is a service that transcribes spoken word into written text for purposes of accessibility. In video conferencing platforms, this generally happens in real time, while in social media, audio is processed and then delivered. However, despite differences between these two processes, the end goal is the same: to provide a consistent and reliable transcription of available audio.
There were two main challenges that needed to be addressed in our experiments: developing the speech-to-text transcripts and then comparing the produced transcripts with the ground truth transcripts. Additionally, for purposes of efficiency, we needed a method of batching our audio files such that the experiment could run with minimal supervision, playing speeches from different speakers in succession over multiple video conferencing platforms simultaneously. From this need arose a necessary secondary requirement: separating speeches from the produced monolithic transcript containing the results from all speakers.
Finding source data for our experiment proved to be a difficult experience since we not only required speakers over many demographics, but also required that verified ground truth transcripts be available for each audio file. During this aspect of our research, we considered several datasets including the Mozilla Common Voice and NPR Podcasts, but ultimately decided upon the TED-LIUM v3 dataset.
With the TED-LIUM v3 dataset, we then needed a method of performing the experiment in a consistent and reproducible environment. Unfortunately, we could not simply play the audio files from a speaker and capture the audio with the microphone provided by each machine: the possibility of external audio interference introduced the potential to impact the produced transcript, which would in turn negatively impact confidence in our results. To solve this problem, I utilized the VB Audio Cable tool to reroute the machine output to a virtual output that was then piped directly into the software input, thus eliminating the potential for external audio interference while simultaneously utilizing a single machine for audio playback and audio capture.
Once we had produced transcripts for Zoom, Google Meet, BlueJeans, and Cisco Webex video conferencing softwares, we had to determine a method of separating the produced transcripts and associating them with the correct speaker. Initially, we discussed using a vocal delimiter or timestamps on the transcript to identify each speaker, but the timestamps were not consistent between the different transcript formats (.vtt, .txt, .srt) and a vocal delimiter relies on the software correctly transcribing the delimiting phrase. Ultimately, we decided on fuzzy matching, which uses the first and last sentence of the ground truth text for a speaker to find exact of highly similar matches in the produced transcript and then pull all text between the beginning and ending matches.
Once we separated the automated transcripts for each speaker and associated them with their corresponding ground truth transcript, we utilized the jiwer python library to determine word error rate (WER), match error rate (MER), and word information lost (WIL). We focused on word error rates over match error rates and word information loss since it was the most comprehensible metric and most accurately represented inaccuracy in the automated transcription service. Our understanding of these values and how they described our findings can be read about here.
Unfortunately, at the end of our experiment, we had identified several problems in our design that contributed to what we believed to be inaccurate results. The primary issue was separating transcripts from the produced monolithic transcript that resulted from batching audio files over each iteration of the experiment. As a result, our WER, MER, and WIL values were measured incorrectly across automated transcripts that did not correspond to the ground truth transcript. In the future, determining a method of separating automated transcripts should be the primary goal before further experimentation is done.
However, despite this outcome, we did manage to produce an isolated and reproducable testing environment that shows that up to two video conferencing platforms may be experimented upon simulataneously, which is a major success for the future of this project. Due to the limited number of platforms that provide such automated transcription services, simple vertical scaling of the host machine should allow for all four of the video conferencing platforms in future iterations of this study.
Comments