Wednesday, May 5, 2010

Speech Recognition, with a theological coda

Machine translation, using a computer to translate from one natural language to another, is one of the earliest problems tackled by computer scientists in the decade after WWII. That effort was largely abandoned in the United States in the mid-1960s, but a variety of computational investigations into language did continue. By the 1970s two of them, among others, were going strong: speech recognition and speech understanding.

Speech understanding implies some underlying intelligence, some ability to reason about what is being said. For example, during the late 1970s ARPA (Advanced Research Projects Agency of the Department of Defense, now DARPA for Defense ... ) sponsored a project in which computers would take a spoken query as input and then return an answer. To answer the query the computer had to: 1) figure out what words were being said, 2) parse the sentence structure, 3) analyze the meaning, 4) formulate that meaning as a database query (about warships in this case), 5) run the query against the database to determine the answer, and 6) formulate the answer in English. A big problem.

Taken alone, that first step, determining what words were said, is speech recognition. Consider medical transcription, for example. The physician dictates notes orally into a recorder. Now we need to transcribe those notes into a written record. It would be nice to have that done by a computer. We don’t care whether or not the computer understands what the physician has said; we’re not going to ask the computer to do anything with or about what’s in the record. We just want it in written form. That turned out to be a difficult problem. But significant breakthroughs were made in the early 1980s that lead to an increasingly robust range of practical speech recognition technologies. This technology makes no attempt at “intelligence,” just lots of number crunching to conduct statistical analysis of large masses of speech data and, from that, to derive recognition "signatures" for words.

A couple of years ago I attended a seminar at Columbia University in which an IBM researcher, whose name I forget, reported that speech recognition technology was about to bottom out without, however, reaching human levels of performance. I was thus not surprised to read Robert Fortner’s recent article making that same point (h/t Tyler Cowen) – though, if you read the article, you should also read the comment by Jeff Foley. The current technology is “dumb.” This is, after all, speech recognition, not speech understanding.

Of course, we would like to know just why mere speech recognition is so difficult. That speech understanding should be difficult, that’s obvious, for understanding implies intelligence, and the construction of artificial intelligence has proven to be quite difficult. But we’re not talking about understanding, we’re talking about mere dumb recognition. Why is that difficult?

Well, obviously the speech signal is unclear and ambiguous. The sound of a given word will vary according to speech context (the words before and after, the overall intonation pattern of the sentence), from one speaker to another, and then we have to worry about extraneous sounds interfering with the speech signal. This makes it extremely difficult to give a physical specification for the sounds of words. At this point, given all the research that’s been done on the problem, I’m tempted to assert that it is, in principle, impossible to provide such a specification.

Let us assume that that is correct. If so, how then do humans recognize speech as accurately as we do? That is, if the identity of the physical word, its constituent syllables, cannot be physically specified then how is it possible for US to recognize them? If the identity isn’t physically there, where is it? In the ether? It is all well and good to say that we’re very good at difficult perceptual problems, better than computers are. We’re also better at the visual recognition of objects (and handwriting). But that doesn’t tell us what we’re doing that the computers aren’t.

There is, however, a profound difference between object recognition and speech recognition. Setting aside all the objects which humans have created, the world is full of kinds of objects that exist independently of us. Speech does not so exist. Speech is created by us and for us. Thus it would be possible to attempt speech recognition by attempting the much more difficult problem of speech understanding.

And that, I assume, is how humans are so good at speech recognition. For us speech recognition is just one aspect of the richer problem of speech understanding. We can use our knowledge of what the speaker is saying to disambiguate sounds we can’t parse. In the worst case we can ask the speaker what she meant.

Think about that. And again. In order merely to recognize the words, such a superficial act, we must understand what they mean, a deep act.

Coda: What would it mean to attempt mere object recognition – of, say, rocks, acorns, leaves, or birds – in an analogous way? That, it seems, takes us into the realm of theology, where some divine being is the “speaker” of the world whose intentions we attempt to divine so that we might properly recognize the objects that being has created. Note that I am not offering this reflection as a theological argument. I am not a believer, not in any ordinary or readily intelligible sense. But I think the parallel I'm suggesting is an interesting one. It is one that Giambattista Vico would recognize.


* * * * *

As a pendent on this reflection I recommend  Computing Interfaces, Human Scale & The Death of Interoperability by my old Gravity buddy, Michael Bowen. You might also want to check out this post on the demise of Chomsky’s notion of Universal Grammar.

4 comments:

  1. I agree with you that we'll never get human performance at recognition without understanding. Why would we? Do you think you could be trained to accurately transcribe a foreign language you didn't understand to the level of accuracy you get in your native tongue? Or even to the level you would get in a foreign language that you do understand?

    On the other hand Fortner writes off speech recognition too soon. Google's recent work has significantly advanced the state of the art there. My Android phone accurately recognizes and translates questions and phrases from English to other European languages. It accurately transcribes my simple queries and sends them to Google for searching. And it does this with no training. (I see there is a comment on his blog entry by 'dtunkelang' which confirms this experience.)

    Google has been providing free automated voice directory assistance for awhile now. I'll bet that corpus has contributed significantly to this capability.

    ReplyDelete
  2. Are you sure the Android wasn't passed through a time warp from the Enterprise?

    ReplyDelete
  3. Yes. Thinking about speak recognition is interesting. I think that children learn through speech recognition first, and then the understanding comes through repetition. If somehow we taught a computer, and 'rewarded' its neural networks for seven years, then maybe it would have the understanding of a second grader. I don't think programmers and researchers are so patient.

    ReplyDelete
  4. The evidence, as I understand it, is not so simple. On the one hand, infants learn to hear the sounds of language around them. That specialization is noticeable around 6 months; and, of course, they're also learning to speak those sounds (babbling). This process is, I suspect, somewhat autonomous from the process of understanding what others say.

    Understanding involves paying attention to the whole speech situation (Paul Bloom at Yale has written a good book about this). When mommy points and says "doggie," the infant follows mother's gaze and pointing as well as seeing the dog and hearing her speech. It's all part of a single complex interactive behavioral gestalt. The referent (the dog), the speech sound, and the speaker's intentional stance are co-present.

    ReplyDelete