Technology
Liberated Learning technology centers around two core applications: using speech recognition to automatically caption spoken language and create web accessible multimedia transcripts. Building upon a proof of concept application developed at Saint Mary's University, the Consortium has been researching and developing a second generation technology called IBM ViaScribe. IBM’s Human Ability and Accessible Centre Asia Pacific additionally developed IBM Caption Editing System (CES), an application that can be distinguished by its powerful editor. In 2008, the Consortium is anticipating the release of a powerful new system that will addresses a number of outstanding technical and user challenges.
ViaScribe contains a speech recognition engine capable of transcribing live or prerecorded speech. Live speech is delivered to the system via a standard or USB microphone. Typically, public speakers wear noise-canceling wireless headsets or lavalieres (lapel mics) that record high quality sound without impeding movement. ViaScribe can also transcribe pre-recorded speech from a variety of audio and video formats, including WAV, MP3, and AVI.
During a live presentation, ViaScribe serves as a real time text display--like a closed captioning window--outputting text as it is processed by the Speech Recognition engine. Because natural spoken language generally does not lend itself to rules of grammar and punctuation, ViaScribe promotes readability by introducing a paragraph break or other markers whenever the speaker pauses to take a breath. These pauses can be customized according to the speaker’s individual speech characteristics.
The speaker can also use interactive voice commands to navigate PowerPoint slides or other applications during a live transcription, and automatically create captioned multimedia presentations.
ViaScribe saves the Speech Recognition generated transcript, audio, and optionally, screen captures and PowerPoint slides as an accessible webpage or streaming media file. This allows students to view lecture information in a format that suits their individual learning preferences.
In addition to text transcripts, ViaScribe creates a series of accessible multimedia files (SMIL, XML, WAV, RT, RTF) that can easily be published to the web creating a rich set of teaching resources.
IBM ViaScribe features an Application Program Interface (API) exposing speech recognition functions to external applications. Liberated Learning Partners have used this API to build a number of exciting new applications including Personal Displays, Real Time Editing tools, and Multi Speaker systems. (See Projects).
The Human Ability and Accessibility Center of IBM Japan has developed IBM Caption Editing System (CES) which further enhances the editing capabilities of ViaScribe. Our partners at Hiroshima University are currently using this system to make learning more accessible.
CES has the capability to leverage two popular commercial SR systems for transcription: IBM ViaVoice and Dragon Naturally Speaking. Like ViaScribe, CES displays speech recognized generated captions text for live audience. For PowerPoint users, the CES window automatically docks itself beneath slides, indexing each with the appropriate timestamp saved in a multimedia transcript.
Included in CES is a stand-alone editor "Master" as well as a "Client" that allow synchronized corrections to be made over a network. In this way, two or more editors working simultaneously from a Master and Client can correct a transcript more quickly than one editor working independently. The features provided in this application allow the speaker to manipulate many aspects of post transcript editing, enriching the final media to be displayed.
CES enables users to edit not just recognition errors, but the accompanying audio file as well, clipping or silencing passages as required while maintaining synchronization of surrounding captions with the media. CES saves the final product as a Real Player SMIL presentation or as a webpage with Windows Media Player embedded. CES allows users to easily configure the layout of text, slides, and optionally, video in Real Player.
Challenges
Since taking its first steps in the development of speech recognition as an accessibility tool in 1999, Liberated Learning has been working to overcome fundamental technology and usability issues. Accuracy, readability, user friendliness, and ease of training and editing are all areas that have required cutting edge solutions.
Improving speech recognition accuracy remains a primary challenge. The Consortium spends considerable time studying factors that affect word error rates (WER). A number of factors drive speech recognition accuracy including microphone quality, ambient noise, voice profile training, individual speech characteristics, and available acoustic and language models. A number of integrated efforts such as evaluating new speech engines, improving speech models, developing innovative post-production techniques, and investigating new training approaches are key research priorities.
Directly linked to accuracy, editing misrecognitions is another core challenge. Editing can be viewed along a continuum, ranging from no post-presentation intervention to extensive correction and modification of transcripts. To correct recognition errors, an editor can replay the digitized audio, make necessary corrections, and update the output. In lectures, this final version can be used as course notes. The Consortium is studying automated post production techniques, team-based editing tools, and new interfaces to improve editing efficiency.
For dictation systems, users have traditionally created personalized voice profiles by reading a set of predefined scripts. This conventional approach usually leads to improved recognition accuracy. However, by training a voice profile using grammatically correct, written language, the system may become biased when faced with an extemporaneous speech situation – such as what occurs in a typical lecture or presentation. Researchers have developed a unique approach to voice profile training that does not require the speaker to read a set of predefined scripts. The traditional training process is replaced by using a person's own transcribed speech to create customized voice models. A lecture is recorded behind the scenes, without any extra preparation by the speaker. The transcribed audio is then edited by a third party to create a voice profile that can then be used to capture the speaker's next lecture, resulting in incremental improvements in accuracy with each usage.
One of the Liberated Learning Consortium’s top priorities is to extend the language base available for ongoing testing and development. ViaScribe currently supports UK and US English, Japanese, Chinese (Mandarin), French, Italian. The team has recently integrated German and Spanish models, but the new training tools are not yet available for these languages. CES currently supports English and Japanese with similar integration projects underway.
For a complete list of 2008-2009 development priorities, visit Research and Development
Top
of Page |