 |
Resource Centre
This article appears in the March/April 2003 issue of Speech Technology
Magazine
April 8, 2003
The State of Desktop
Speech
By Dr. Janet M. Baker
This article focuses
primarily on state-of-the-art
speech applications
presently running on
full-function PCs,
both "desktop" and
smaller. The speech
software runs on the
PC itself, and is typically
used by a single user
at a time. Applications
and customers, drawn
from the US market,
are representative
of those throughout
the world.
"
In the beginning was
the Word ..."
We depend on words.
Despite its fleeting
and ethereal nature,
speech is the most
common means of communication
between people. Acquiring
speech and language
is a critical developmental
activity, starting
in infancy, for people
everywhere. However
to span time and
space in a more permanent
fashion, we need
to
turn spoken words
and data into text,
(handwritten,
printed, typed, etc.).
The creation of moveable
type to mass produce
printed materials
has been with us
for 550
years. For just 125
years, people have
been able to use
a typewriter keyboard
to create the printed
word themselves.
Each
of these technological
breakthroughs, improved
and refined over
the years, has dramatically
changed our environment,
our work and how
we
do it.
To go to the next
step, to enable that
text
to be rapidly processed
and disseminated,
we need to turn it
into
computer-readable
form. Word-processing
with
keyboard input to
do that, has only
come
into popular use
over the past 35
years.
The advent of commercial
large vocabulary,
general purpose,
continuous
speech recognition
dictation products,
just over 5 years
ago forms the basis
for
today's desktop speech
capabilities. In
addition to, or instead
of typing,
users can now use
speech effectively
as an input
modality. Users speak,
and their computers
can take appropriate
action on oral commands,
and, more significantly,
can immediately transcribe
natural speech into
arbitrary text on
their screen as they
speak.
Recorded speech can
be played back, for
transcription purposes,
proofreading and
other applications.
High
quality synthetic
speech can read text
and data
on demand, allowing
people to listen
to email or other
materials
while they are otherwise
engaged. While far
from perfect, these
speech input and
output capabilities
are presently
relied upon by significant
user populations.
Further research
and product
development will
continue to improve
system performance
and expand the markets
and user groups adopting
this technology.
In this day and age,
with most office
workers responsible
for generating
their own email,
reports, etc., typing
is a time-consuming
activity. Although "thought" time
is often the gating
factor for job throughput,
keyboarding is still
an important component.
Programmers, journalists
and secretaries are
among the world's fastest
keyboard afficienados,
typing up to and above
100 words per minute
(wpm). It has been
reported that the average
office worker types
at 30-40 wpm however.
We routinely speak
conversationally at
150-200 wpm.
Not surprisingly,
two professions that
have
most strongly embraced
desktop dictation
are doctors and lawyers.
These groups have
to
generate copious
amounts of text,
and have to
do it under time-pressure.
Like many professionals,
the work products
which they produce
to communicate
their expertise,
opinions, reports,
etc., and
ultimately for which
they are compensated,
are usually text.
Individual doctors
and lawyers
typically generate
thousands of pages
of text annually.
Some of these are
turned
around in a day;
many more take days
or weeks.
The human and economic
cost savings in improving
throughput and turn-around
time are tremendous.
As a public defender
once put it to me
melodramatically, "When
my paperwork is late,
my client sits in jail!"
Large legal firms,
hospitals, etc. typically
staff their own around-the-clock
or on-call transcription
services to transcribe
recordings of dictated
materials. Some U.S.
hospitals have even
resorted to using
off-shore transcription
services.
Small to medium size
legal groups and
medical practices
often scramble
to obtain satisfactory
daytime coverage.
The two-step, record-transcribe
process invoked in
all of these situations
gives rise to errors
which are not caught
by originators of
the
reports in later
reviews. Serious
consequences
have resulted with
common transcription
errors, specifically
the omission of short
words, such as "no" in "no
evidence of cancer".
When health-care professionals
directly use speech
recognition for their
dictation needs, they
receive immediate feed-back,
are spared a separate
review cycle, and can
catch and correct errors
while the information
is fresh.
Latencies in transcribed
documents and reports
result in the unavailability
of timely information,
especially critical
in the medical arena
for multiple doctors
conferring on a given
patient. Delays in
reimbursements, especially
from third-party
health insurers,
are directly
tied to the submission
of satisfactory finished
reports. So even
when an emergency
rises,
significant resources
are brought to bear,
and the crisis is
resolved, costs are
not recoverable
until the complete
reports can be submitted.
A major advantage
in rendering text
and
data immediately
into computer-readable
form
is the opportunity,
and greater likelihood
of entering it into
integrated, streamlined,
work flow processes
for document creation,
customer relation
management, hospital
information
systems, etc. Centralizing
information directly
improves its integrity,
consistency, availability
and trackability,
while reducing redundancy,
multiple sources
of
errors and time delays.
It's analagous to
the difference between
producing a typewritten
report and a word
processing
document, or between
a handwritten note
and an e-mail message.
Dictaphone, Philips
and Sony have each
integrated desktop
dictation capabilities
into their centralized
dictation products.
These are marketed
primarily to doctors,
lawyers and large
enterprises.
Another significant
group of users are
people who use dictation
software because
of disabilities or
impairments.
For many of these
people, dictation
software
allows them to work
or pursue their education;
without it, they
couldn't. Disabilities
where
this technology has
proven very useful
include mobility
impairments, paralysis,
cerebral
palsy, muscular dystrophy,
dyslexia and carpal
tunnel syndrome.
Carpal tunnel syndrome,
also
known as repetitive
stress injury (RSI),
is the single largest
occupational disability
in the United States
today. A growing
number of companies,
including
Chevron, Kodak, Southern
California Edison,
etc. make dictation
software available
to their employees
who have been injured
or who are at risk.
A real "equal
opportunity" disability
today, RSI not only
afflicts factory workers,
laborers and musicians,
it also plagues office
workers and professionals
who spend too much
time typing. Many disabled
students, from young
children to adults,
now depend on speech
recognition to do schoolwork,
conduct Internet searches,
etc. The multi-partner
Liberated Learning
Project uses dictation
software to project
real-time text transcription
during college lectures
for the benefit of
disabled and able-bodied
students alike.
Text-to-speech has
also proven very
valuable. It enables
blind and
visually impaired
people to access
computer-readable
information. For
people
who can't speak clearly,
synthetic speech
provides an effective
communication
alternative. It is
worthwhile reflecting
that disabled users
have been the instigation
for the innovation
of many of today's
major office technologies;
including the typewriter,
the telephone and
even the ball point
pen!
Surveys previously
reported by Dragon
Systems and IBM found
that heavy users
of dictation software
span many classes
of
business, government
and home users. This
technology has become
routine for many
transcribers of dictated
materials
(e.g. Veteran's Administration
hospitals for medical
reports), document
creation (e.g. Berrocal & Wilkins,
P.A., Sidley, Austin,
Brown and Wood for
legal briefs, etc.),
news story capture
(e.g Herald News, Joliet,
IL), foreign language
translators who routinely
dictate their translations
(e.g. United Nations),
quality control inspectors
working in "hands-free/eyes-free" environments
(e.g. Volkswagon),
law enforcement officers
(e.g. Los Angeles Police
Department) and many
more. Young people
write school papers
by voice, while senior
citizens talk to compose
email. For these latter
two groups, the very
young and the very
old, special speech
patterns modeling typical
acoustics and language
usage for these groups,
respectively, have
been built into some
dictation products.
Form-filling applications
are well-suited to
speech input. Forms
typically include
a combination of
fields
of well-defined,
application-specific,
restricted data (numbers,
dates, codes, states,
conditions, etc)
as well as fields
for
free-text (observations,
detailed descriptions,
special instructions,
exception reporting,
etc.). Applications
in this arena are
diverse, from financial
trading
floors to manufacturing
floors.
A growing number
of mobile workers,
especially
business people and
law enforcement officers,
routinely record
customer reports,
expenses,
time billing, data,
and other observations,
into high quality
hand-held recorders.
When these
people return to
their PC, they download
their
recorded acoustic
data to obtain a
transcript
automatically. Data
and memoranda recorded
on-the-spot are demonstrably
more accurate and
complete than later
recollections.
Who Are the Principal
Players?
The principal players
today offering speech
dictation and speech
synthesis capabilities
for desktops are
IBM and ScanSoft,
followed
by Microsoft, and
more distantly Philips
Electronics.
Focusing primarily
on speech recognition,
IBM has developed
its own technology,
commencing
seriously in the
early 1970's. IBM
offers
an extensive line
of ViaVoice products,
available in 11 world
languages. Application
develeopment tools
and runtime licences
are also available
through partners.
IBM's
products are noted
for high quality
and widely marketed.
Through an acquisition
in the Delaware Bankruptcy
Court in 2002, ScanSoft
gained rights to
the market-leading
Dragon
NaturallySpeaking
product line, as
well as other
Dragon Systems assets.
Its creator, Dragon
Systems, had been
acquired in a 100%
stock swap
by Lernout and Hauspie
in June, 2000. Shortly
thereafter, allegations
of Enron-like fraud
by L&H drove the
company into bankruptcy,
rendering the stock
virtually worthless.
Despite the exodus
of most former Dragon
employees, an extensive
line of Dragon NaturallySpeaking
products, tools and
services, continues
to be developed (see "Immortal
Code" in Wired
magazine, Feb., 2003).
Dragon products are
also available in many
world languages and
marketed internationally.
Both the ViaVoice
and Dragon NaturallySpeaking
products and licenses
are marketed and
distributed
through multiple
channels; including
retail, catalog
and Web sales; Value
Added Resellers (VARs);
Independent Software
Integrators (ISVs);
and Original Equipment
Manufacturers (OEMs)
marketing bundled
hardware and/or software
products.
Microsoft, a more
recent entrant to
the market,
has introduced its
dictation speech
engine "Whisper" and
its text-to-speech
engine "Whistler" in
its SAPI software developer
kit. These are shipped
by Microsoft's Speech.Net
initiative for inclusion
in major Microsoft
products; including,
Microsoft Encarta,
Windows 2000, Office
XP and Windows XP.
Although the features
and performance of
these engines are not
as advanced as the
Dragon and ViaVoice
products, these engines
have become readily
available and "free" to
large numbers of users.
Microsoft has not significantly
promoted or marketed
these capabilities
to date. They are available
in English, Chinese
and Japanese.
Philips focusses
its SpeechMagic engine
for use as an adjunct
to its server-based
dictation systems.
Its FreeSpeech dictation
product was withdrawn
from the highly competitive
U.S. retail distribution
several years ago.
Philips' speech engine
is primarily used
for
medical and legal
applications marketed
in Europe.
Besides desktop speech,
all of these companies
have been or are
becoming actively
engaged in
applying speech technology
onto a range of devices,
from servers to embedded
devices. A number
of other companies
are
supplying component
technology for desktops
and/or other platforms.
Hundreds of VARs
and ISVs integrate
and
customize systems
for individual customers
and market segments.
A limited number
of
companies and university
departments specialize
in making fundamental
advances in the core
speech technology,
and focussing on
especially challenging
operational
tasks.
What's Next?
Today's desktop speech
is the spring board
for advanced speech
capabilities on a
myriad of convenient
handheld
and other mobile
devices. Major desktop
applications
such as word processing
and email become
major headaches on
small
devices with miniscule
keyboards. Grafitti
is inherently slow
and no one really
wants to enter "text
by toothpick" on
a set of tiny keys.
The advent of more
powerful small devices
will usher in platforms
where speech I/O has
unique advantages.
Like the transition
from mini and mainframe
computers to PCs,
local processing
on wireless
PDAs and high-end
cell phones will
enable
users to work without
the delays and problems
inherent with remote
server access for
distributed speech
processing.
In addition to standard
database queries
checking weather,
stock and
sports scores, users
will be able to compose
email, conduct queries
of open-ended search
engines, create sales
reports and maintain
customer databases,
and record field
observations directly.
Connecting
to servers and networks,
while still very
valuable and essential
for many
applications, will
no longer be a requirement
for speech processing.
New, more sophisticated,
desktop applications
will also gain currency.
Some of today's server-based
applications will
be ported onto desktop
platforms. Desktop
computers with ever
faster processors,
networking and Internet
access (more substantial
storage, etc.) will
be able to support
more advanced speech
and language capabilities
to conduct local
audiomining
(audio search engines),
multispeaker meeting
and telephony transcription
processes, real-time
spoken language translation
and progressively
more natural language
database
queries. Feasibility
prototypes for all
these devices, small
and large, have already
been demonstrated.
It is a matter of
time and additional
R&D
improvements to bring
them succesfully to
market for consumers.
User interface improvements
will make it easier
for new users to
start using speech
systems
with less effort,
to make corrections
more
intuitively, to move
seamlessly between
diverse devices,
etc. On-going improvements
in speech recognition
accuracy, more natural
text-to-speech, and
overall system capabilities
are of paramount
importance
in creating ever
more useful and attractive
products.
Market conditioning,
an essential component
for the wide-scale
adoption of all new
technology, is now
becoming ever more
evident for speech
technology. Many
people are now beginning
to
encounter speech
technology with brief
speech interactions
for constrained command/control
or database tasks
over
the telephone. Automated
directory assistance,
prompts such as "say
or touch '1' ",
say "collect call" or "operator",
etc. are typical exemplars.
More recently, telephone
callers seeking information
such as Amtrak train
schedules even encounter
artificial interactive
personae, such as Amtrak's "Julie".
Low-end desktop speech
capabilities have been
widely disseminated
through low cost retail
products, product offerings
through AOL, and hardware/software
product bundles.The
shower of industry
awards collected by
dictation software
over the past five
years, also helps raise
consumer awareness.
Increasing familiarity
with effective speech
technology speeds its
adoption in all sectors.
Business Background
and Prospects
It is, of course,
the customers and
consumers
of all kinds of speech
technology, who realize
the greatest economic
benefits from time
savings and convenience,
reduced labor and
other cost savings.
Those
benefits include
the time a medical
doctor
saves writing reports,
or the ability of
a disabled person
to
rejoin the workforce,
as well as offloading
a telephone operator
with an automated
attendant, or providing
the safety
of hands-free, cell-phone
dialing in an automotive
environment.
Desktop speech products
started emerging
about 20 years ago.
Over
the past five years,
with the arrival
of general purpose
dictation
software, millions
of copies have shipped
worldwide. According
to PC Data's monthly
surveys of the U.S.
retail sector alone,
the desktop dictation
software sales (both
units and dollars)
have accounted for
a top category of
business software
sales for
the past several
years. Significant
revenues
for desktop dictation
sales also derive
from direct sales
and licensing
to corporate and
government customers,
VAR/ISVs,
OEMs, etc. L&H's
failure to market and
ship products during
its debacle in the
late 2000 to early
2002 timeframe, set
back industry product
sales substantially.
Coupled with the computer
industry malaise and
general economic downturn,
recovery of desktop
speech sales, though
steady, promises to
be slow. Nonetheless,
the companies supplying
desktop speech capabilities
(as contrasted with
speech companies supplying
server-based telephony
or embedded speech
technology) have consistently,
both past and present,
garnered the lion's
share of all speech
company revenues and
profits.
Despite recessionary
fears, serious set-backs
and delays, a number
of market prognosticators
still project annual
speech industry revenues
in the several hundred
million to multi-billion
dollar range over
the next decade.
As in
the past, market
advances are likely
to be directly
tied to significant
technological advances.
Presently, companies
such as IBM (with
WebSphere) and Microsoft
(with
its .NET initiative)
are focussing much
of their speech and
language R&D on
the burgeoning Web
application server
market, using speech
input/output as as
adjunct, especially
to handheld devices
and cell phones with
minimalist keyboards.
Initial systems will
focus on users gaining
access by voice for
constrained database
applications, similar
to those presently
popularized by telephony
focused speech companies.
Even greater user
and economic benefits,
however, will arise
from the porting
of
today's advanced
desktop speech
capabilities onto these small
form-factor
platforms themselves.
Meanwhile a major
issue presently
in contention
with respect to
multiple platforms accessing
Web server applications,
is the choice of
standards. Microsoft
is promoting
the SALT initiative
in opposition to
X+V (a combination
of XHTML
and VoiceXML),
advocated by IBM and others.
Harmonizing these
two standards would
confer
major benefits
to suppliers, developers,
customers
and the market.
As well evidenced by
the U.S. cell phone
industry,
multiple standards
confuse, fragment
and delay the industry
while dramatically
increasing costs
and
reducing functionality.
Technological innovations
are the key to
future success;
improved
recognition accuracy
and more natural
text-to-speech
will drive higher
utility,
market acceptance,
return-on-investment,
and higher economic
returns. Commmon
interface standards
and consistent
intuitive user
interfaces will
enhance products
and expand the
user
base. Improved
noise handling
expands
the environments
where
this technology
is effective;
affordable low-power processors
will enable handheld
devices and cell-phones
to rival today's
desktop PCs.
Progressively speech will become
a principal communication
mode, with speed
and
convenience for
both
machines and
people!
* Names and trademarks
of companies
and products
mentioned
are the property
of their respective
owners.
--------------------------------------------------------------------------------
Dr. Janet M.
Baker is co-founder
of
Dragon Systems
and has been
active in the
speech industry
for over
30 years. She
presently lectures
and writes
on speech technology,
entrepreneurship,
transferring
technology
to the market,
etc.
to
business audiences
worldwide.
She can
reached at
janet_baker@email.com.
<< Return to Main Resources Page
|
 |