Panel 1:  Noise and the Body: Human Presence in the Lab

Heidi Voskuhl, "Locating Noise: Automatic Speech Recognition and the Human/Machine Dichotomy" (Cornell S&TS)

 Abstract    Full Text    Discussion

 

Abstract

            I observed members of a group researching automatic speech recognition who deal with machine recognition in the presence of background noise. The two primary elements necessary for automatic speech recognition are programming and training of a computer speech recognizer, and the preparation of the acoustical signal to be fed to the recognizer, largely a problem of massive information reduction without reducing the overall reliability of the machine's pattern recognition abilities. My research group is well known for applying an innovative approach to the latter element; it aims at developing an algorithm for acoustical signal processing that is modeled after the human ear. This approach is meant to provide a 'better' means for signal processing in automatic speech recognition. The two criteria for 'better' in this context are (1) effective information reduction in the speech signal and (2) improvement of the rates of recognition compared with conventional methods of signal processing.

The main theme I will deal with in my paper is the ambiguity inherent in the mapping of the dichotomy of human-machine upon the dichotomy signal-noise. On the one hand, one could nicely pair 'human' and 'noise' as both standing for disorder, unpredictability, contingency, and one could nicely pair 'machine' and 'signal', as both standing for order, perfection, calculability. At the same time, the human apparatus, for my research group, is the 'ideal recognizer', the one apparatus that will always by far be the best recognizer of human speech signals. 'Human' is in this context thus also standing for order, for perfection, and for an ability to distinguish better than anything else a signal from noise. I will explicate how my informants reconcile institutional commitments with research results, and how they negotiate the respective inferiority and superiority they find in their experiments, systematically varying relevant parameters. Electronic simulation is their primary means of developing a machine as perfect as a human in creating order in the mishmash of noise and signal.

 

Full Text

§1

In the research field of automatic speech recognition, noise becomes significant on several levels, some of them are more epistemological in nature, some of them are more acoustical. When a machine recognises human speech, this process is often illustrated in the following way: first, the machine “perceives” an acoustical signal (e. g. through a microphone), it then generates an “internal representation” of the signal (this happens through various stages and algorithms of digital processing, and the result is a set of vectors, which amounts to no more than a few hundred numbers for a speech signal lasting 2 seconds), and lastly the recogniser “compares” this internally represented signal to a set of sample representations of a vocabulary it has in storage. (This idea of machine speech recognition coincides, by the way, with commonly held beliefs from cognitive science about the way humans recognise speech.)

 

§2

The presence of noise introduces a new quality into this process; the machine now no longer merely compares the offered signal, or its internal representation, with a stored set of “prototypes”. Rather, in the process of “recognising”, it also needs to distinguish noise from signal. So - the machine needs to recognise a fundamental difference among the noise snatches that it perceives from the universe of sound.

 

§3

The issue of noise and signal is a classical problem in science and technology, and in the studies thereof. It has become habitual among scientists of various branches to employ the terms noise and signal to refer in general to the distinction between good data and bad data, between right and wrong, or between meaning and chaos. Noise and signal thus stand for a, maybe stand for the, major epistemic issue in experimental inquiry into the world. Studies in S&TS have shown how non-trivial it is for scientists to make this distinction in evaluating data. These studies have taught us never to take for granted self-evident data, or clear-cut distinctions between right and wrong. (Panel 2 in the afternoon).

 

§4

It turns out that, in some cases, it is as difficult to teach the distinction between good and bad to artificial entities as it is to teach it to humans. In fact, speech recognition technologies perform very badly, surprisingly badly, in the presence of background noise. They fail under those variable conditions that have little effect on intelligibility for human listeners. They fail utterly, where humans do very well. Humans have robustness in speech recognition against things like additive background noise, against out-of-vocabulary words and sounds, against convolutional noise such as echoes, against rapid speech, against speech with spontaneous phenomena such as false starts, against dialects, and other such phenomena. Humans appear to perform effortlessly even when the background noise itself is made up of human speech, that is, when noise is not different in kind from the signal. Humans appear to be capable of focusing attention on the linguistic message during conversational speech, and of extracting this message from a jumble of background mumble. Acoustical physicists refer to this particular situation as “the cocktail party phenomenon”.

 

§5

Some intuitions about humans and machines and their respective qualities become fuzzy in the context of ASR. Indeed, there seems to be a lot of ambiguity emerging from attempts to map the dichotomy of signal and noise upon the dichotomy human and machine. On one hand, ‘human’ and ‘noise’ tend to stand for disorder, unpredictability, messiness, or contingency. On the other hand, ‘machine’ and ‘signal’ could be coupled as standing for order, calculability, meaningfulness, or intelligibility. At the same time, in the present research context, the human ear is taken to be the perfect recogniser, the one apparatus that will always by far be the best recogniser of human speech signals, distinguishing better than anything else signal from noise. Humans are the ones capable of extracting a meaningful signal from noisy messiness, or messy noisiness, both as a hearer, and as a knower.

 

§6

There is little knowledge as to why humans do so well, and why machines do so badly in recognising speech in the presence of background noise. Speech recognition technologists normally provide some kind of vulgar Darwinian explanation for the quality of human hearing: they point out that the human hearing apparatus and speaking apparatus have co-evolved, and have mutually adapted through thousands of years of evolution. With the help of the mechanism of evolution, then, the “natural” or the “biological” becomes signifier of orderliness, rationality, or functional ability to make meaningful distinctions.

 

§7

By contrast, conventional technical representations of acoustical signals do not display the crucial distinction between noise and human speech at all. Conventional technical representations are graphical images where either sound intensity information or frequency information is plotted against a time axis. Everything, in fact, is just sound, as far as these representations are concerned. Also, the distinction between speech and noise is not all that clear-cut, not even for humans. For example, at the much-vaunted cocktail party, humans have cocktail party problems as well in understanding their interlocutors, or focusing on them. So – it seems that the human is turned into a machine in the research context of ASR. In other words, as Harry Collins has suggested, it could mean that we humans are just getting better and better at mimicking machines, or being taken for machines, rather than vice-versa.

 

§8

At any rate, human speech perception seems to be capable of focusing attention on the linguistic message during conversational speech. One question for researchers in ASR is whether this is mainly due to peripheral properties of human hearing, that is, the outer, middle and inner ear, or whether this is a function of some higher level cognition in the brain. This is not entirely clear, but there is some evidence that some peripheral properties of the human ear might be responsible for successful communication.

 

§9

Because of this evidence about the contribution of the peripheral properties of the human ear, there have been attempts to model and replicate the speech processing mechanisms in the human ear. Yet in the field of ASR, there are still two fundamental research paradigms. One is the engineering approach, the other is the auditory approach. The engineering approach uses technical and mathematical solutions to the problem of ASR, the auditory approach takes seriously the human hearing apparatus as the model for a perfect recogniser. The auditory model is the younger one. It has so far not found full acceptance in the research community. There are several reasons for that. They have to do, on one hand, with concrete research techniques in ASR, and on the other hand with serious deficiencies in knowledge and understanding both of humans and machines.

 

§10

As a participant observer, I studied a research group of graduate students in physics working on speech recognition technologies. They are part of a larger research department that comprises work and expertise in medical physics, acoustics, psycho-acoustics, and DSP. One of the most prominent research achievements of this larger research department has been the development of a model of sound and speech perception in the human ear. It has been designed after ideas about the human hearing process from physiology and psycho-acoustics. It is an algorithm for the processing of digitalised acoustical signals, and its performance is meant to resemble processes found in the human ear. The processing, the filtering, the amplifying, the mathematical transformation of the signal, and the way it is digitally and visually represented, mimic the natural way of listening and understanding in the outer ear and middle ear.

 

§11

The auditory model is something like the holy cow of this research department. My informants’ task is to apply it to research in ASR. Normally, this model is used in other applications, such as diagnosis of hearing impairments, or experiments in acoustics and medical physics. So – it is not self-evident that it will be helpful for ASR. At least, it has not been developed for that end. Still, the auditory model and its underlying principle are the starting point for my informants’ studies on ASR. Normally, what they do is some kind of testing this auditory algorithm in combination with various kinds of recognisers. The idea is to find out whether this auditory way of processing signals is superior to other ‘conventional’ ways of preparing the speech signal for the recogniser, where the superiority is measured in word error rate.

 

§12

The members of my research group have found some evidence that their signal processing approach modelled after human auditory perception is superior to conventional or ‘technical’ approaches in those cases where speech is mixed with background noise. Conventional methods are superior in ‘clean’ situations, that is, in situations where speech occurs against a background of silence.

 

§13

This evidence takes on a very suggestive quality, for a very intuitive pairing of dichotomies seems to be in operation. The noisy speech signal, first of all, is the more realistic and “natural” signal, while the “clean” signal is a technical signal, generated in a laboratory context. The noisy, natural signal, furthermore, is recognised by a machine more successfully if this machine’s algorithms are modelled after the human natural or biological ear. So – machine recognition is successful when a natural signal is treated according to the natural or auditory paradigm, while the clean, laboratory-generated signal is successfully recognised by an algorithm designed according to the engineering paradigm. With the help of the measure of the recognition rate, a very suggestive contrast can be drawn between the realms of the “human”, “biological”, “real” on one hand, and the “conventional”, the “artificial”, the “technical” or “mathematical” on the other.

 

§14

It turns out, though, that evidence on the performance of a particular combination of front-end signal processing with a particular recogniser is heavily contingent on many factors. It is contingent upon the kind of hardware and software used for signal processing, training of the recogniser, and recognition. Not surprisingly, this contingency is being invoked by speech technology researchers in their disputes about the proper research paradigm in speech recognition research - the engineering paradigm, or the auditory modelling paradigm. One prominent advocate of the auditory modelling approach, Hynek Hermansky, lists three reasons why auditory models have not been fully accepted, and why they might have failed to yield satisfactory word error rates.

 

 

1)      The outcome of auditory signal processing may not be suitable for a conventional recogniser. New techniques for signal processing are often tested in a well-established overall system which is finely tuned to another technique. In a complex ASR system, many things can go wrong, and usually one of them does when a new technique substitutes an old one without proper adaptation. Research in ASR entails complex and vulnerable experimental set-ups. Error rates may increase simply because of a mismatch between properties of the new signal processing module and the remainder of the system.

 

2)      The second reason Hermansky lists is about the testing of auditory signal processing modules, which  is being done with tasks that may not make visible the weaknesses of conventional techniques. Conventional recognisers work well on clean, well-controlled laboratory data. Applying auditory models to those tasks where conventional techniques work well may not reveal advantages of auditory approaches. According to Hermansky, improvements should be sought and expected in real (that is, noisy) environments, where conventional ASR techniques fail. Again, above mentioned dichotomy operates in this argument. “Clean” data is data generated in well-controlled laboratory conditions, and this kind of data works is coupled with the engineering paradigm. It is coupled via successful word recognition rates. The couple of clean speech data and the engineering approach is then contrasted to the auditory approach which itself is appealingly coupled with “natural” or “real” data, data from the natural world which is not artificially generated. This data is recognised by an auditory or biological algorithm and yields higher recognition rates.

 

3)      Hermansky’s last point is one of epistemological scepticism. He believes that we do not know enough about humans and their way of recognising speech, and we do not understand that well enough to mimic it (or substitute it) successfully in or with machines. Auditory knowledge, for example, is knowledge from artificial stimuli in well controlled lab-environments, stemming from experiments in psycho-acoustics. Furthermore, it is not clear which insights about the human hearing apparatus one should employ in ASR. For it is not clear which elements of the human ear contribute to the extraction of linguistic messages from background noise. The hearing apparatus accounts for a lot more than the recognition of speech, such as perception of direction, spatiality, or voice recognition. This epistemological point will become important once again towards the end.

 

§15

So, there is no consensus in the community of automatic speech recognition researchers as to whether the problem of machine recognition of human speaking should or should not be modelled after the functioning of the human ear and brain, no consensus either on a level of an underlying research paradigm, or on a level of interpreting the pertinent evidence. The members of my research group are committed to their human-modelled approach, for they are part of this larger research group on medical physics, and they co-operate closely with acoustical physicists and with ear, nose, and throat physicians. That sets their parameters. This institutional commitment is the rationale of their research strategies, testing auditory algorithms of speech processing in set-ups of speech recognition technologies. In order to gain an understanding of the potential of the auditory approach, they systematically vary relevant parameters in these set-ups and probe these set-ups according to the word error rates they yield.

 

§16

What dominates their day to day research is the sheer number of parameters involved in these tests and experiments, probably at least two dozens, who are all significant enough to have a likely influence on the word error rate. These parameters are all those details in the way a signal can be processed. One can vary, for example, the range of the filtering, one can vary the dissolution of the digitalisation, one can apply, or not apply, various kinds of mathematical transformations, one can include, or not include, algorithms that model certain echo effects in the outer and middle ear, one can apply, or not apply, normalisations, and things like that.

 

§17

About some of these parameters, one can make a reasonable prediction whether or not they would enhance the recognition rate. Some of them, though, are just “mathematical” or “technical” transformations and applications which ex post facto turn out to be useful in the algorithm. But they do not follow the presumed logic of mimicking the human. Much of my informants’ work, and this is what they say about themselves, consists of blindly groping their way through the jungle of the numerous parameters in the algorithm, seeking the optimum recognition rate that their particular combination of algorithms could potentially yield. By far the most time they spend worrying about their computers. They run very complex and long calculations on networks of several computers. Many of these calculations last longer than 24 hours. These operations or tests are fragile and vulnerable, and they are numerous, and more often than not, they do not run through, but break down half way though the calculation.

 

§18

There are a few interesting deviations from this try-and-error mode of research. What my informants would do every once in a while, for example, is to listen to an acoustical signal, both in its original form and in its processed form. This deviation from their usual digital manipulations becomes relevant when they work on algorithms for background noise suppression. And it is epistemologically very rewarding when done at the right time. Listening to the signal can give them an impression on whether the algorithms are working properly, that there is no bug in one of their “scripts” (“scripts” are meta-programmes that tie sub-programmes and algorithms together), and it can give them an idea about whether the digital processing does indeed do the job it is supposed to be doing, which is suppressing background noise.


§19

Here, the human ear comes in as an epistemological tool for systematic research. In listening to their original and processed sound material, my informants make assessments about the performance of their digital machines in extracting signal from noise. They use their ears as a this primary, natural, analogue listening device, and ears become again something like infallible epistemological tools of the perfect speech recogniser – the human.

 

§20

Another way of deviating from the try-and-error-mode of research is to step back from tinkering and apply rigorously one’s methodological principles. Methodologically my informants are highly committed to an auditory approach that takes human hearing as a model to build a machine that can recognise human speech. This provides them with some guiding line as to what is sensible, and what is not sensible, in their research. It provides them with (Hermansky) “natural constraints for sensible speech processing”. The idea of modelling machine signal processing after biological signal processing has, after all, a great deal of plausibility and appeal.

 

§21

The evidence, furthermore, supports this appeal: inspiration from the functioning of the human ear has contributed a great deal to the problem of speech recognition in the presence of background noise, that is, in situations of naturally occurring speech. The auditory approach appears to have helped make the transition from the artificial lab to human reality in matters of speech recognition. From this point of view, it has brought the machine closer to the human. But as argued before, it is intriguing what kind of concept of human is involved here. Again, with the help of Collins’s suggestion one could argue that the auditory paradigm has brought the human closer to the machine. Indeed, the advocates of the auditory approach treat the human ear like a mechanism, as something that can be implemented into an algorithm. The boundaries of humans and machines are clear-cut probably only in situations where dichotomies are applied in a metaphorical way or in a polemical ways. When they are being applied, though, it happens in interesting places.

 

§23

There is an additional twist to the auditory model of signal processing. It is appealing not only through its evidential success. Rather, it generates an additional epistemological level in a research programme. Modelling humans in machines, and having a way of testing and measuring the machine’s performance on the basis of the machine’s rate of word errors, provides some confidence to believe that, in doing this research, one can contribute to an understanding of how humans work. My informants do not believe that there is an easy correspondence between their speech recognition technologies and humans, but they believe that when a certain technology is modelled after humans, and when it yields recognition rates as high as humans do, that then there may be clues to be found how and why humans do so well in speech recognition.

 

§24

Ironically, proponents of the auditory approach believe one of the fundamental problems to be that we have not understood humans well enough to build machines after them, and that this is one of the reasons why auditory models are sometimes unsuccessfully applied in ASR research. Advocates of the auditory approach believe that human perception ultimately determines which components of the signal are used for understanding a message. Thus, one should in general aim at consistencies between properties of ASR technology and properties of human hearing. While this emulation of certain properties of human hearing in ASR can be useful, it is counterproductive to make indiscriminate use of accidental knowledge about human hearing in ASR. Not all properties of human hearing are relevant to speech recognition, and some are not even well understood. Using the wrong knowledge can be worse than using no knowledge at all. In the try-and-error tinkering mode of research that seems so characteristic of ASR, the researching human is “messy”  - at least in epistemological terms. He or she does not know what’s going on and does not know whether order or disorder is being created by his or her actions.

 

§25

One very charming counter argument to the auditory approach has been that airplanes do not need flapping wings to fly, and thus ASR technologies do not need ears to recognise speech. Mimicry of nature without understanding the principle behind its design can cause us to fail, Hermansky argues, similar to the way early attempts in aviation failed when humans tried to fly by clumsily and randomly applying wings to their arms, thus blindly applying observations form nature, but quite obviously not understanding the basic laws which make the process work. Translated into research programmes this means for Hermansky that knowledge of the principle guiding a process should be deployed, rather than merely copying the appearance of the process. Here the problem of mimicking “nature” in machines is fundamentally epistemological: we do not know the difference between principle and appearance in nature.

 

§25

Researchers in ASR, even when they just want to build a machine that can substitute a human in some specific context and they borrow heuristic clues from human nature, find themselves in intricate epistemic and metaphysical issues. In positivist pretensions or constructivist interpretations knowledge about nature is non-metaphysical and/or mainly relevant as a social phenomenon. In applying these principles to the realm of machines that mimic humans, Harry Collins has suggested the notion of artificial humans as “social prostheses” to clarify general debates about the question of whether a machine is like a human when it is as good as a human.

 

§26

ASR researchers working under the engineering paradigm aim at building what Harry Collins calls a social prosthesis, that is, an artefact that successfully substitutes a human in a specific social setting, but does not necessarily resemble the human in content or appearance. Current ASR technologies are artificial listeners, potentially functioning well as social prostheses. Collins would not want to ask the question whether this machine is intelligent just because it appears to be able to listen and understand. He holds that there is no reason to assume anything about the nature or essence of these artefacts, and that their success can be explained through us, that is through our beneficial treatments of them and through the attributions that we quite willingly make to accommodate our encounters with machines.

 

§27

In the research context that I have studied, though, these prostheses are being produced under a research paradigm that prescribes to imitate human speech processing mechanisms in the artificial listener. Here, metaphysical questions about humanness and machine-ness, as well as epistemic questions about humans become quite urgent. When machines have to recognise speech in the presence of background noise, as it appears so appealingly, we have to resort to human principles and install them into an artefact. This requires to know something about how humans work and then to apply it. Only when the artefact is finished, it seems, can this type of speech recognition technologists be anti-metaphysical as to the question of whether this machine has human features or not. As long as they implement human hearing into machines, they, one way or the other, have to believe that they rebuild a human in content, or in appearance.

 

§28

Collins makes an important argument that in appearance and content a human and a social prosthesis do not necessarily have anything in common. However, Collins is interested in reminding us that we need not and maybe ought not to assume that about artefacts that surround us. By contrast, auditory speech recognition technologists are the ones who produce those artefacts in the first place. They are interested in building a functioning artefact with the help of knowledge about humans.

 

§28

Proponents of the “auditory approach” go beyond their engineering colleagues, and beyond the social prosthesis. They build a machine that functions as a listener on the basis of knowledge of a human listener. And crucially, they fall into the slippage Collins wanted them not to fall into. Once they use ideas about human ears and human brains for their manipulation of digital signals, the slippage occurs from having built a functioning machine, to having modelled parts of a human, and ultimately to having understood how the human functions. The auditory people operate in a double demand of understanding humans and building artificial humans. Two demands that, according to Harry Collins, do not have an inherent connection to each other. Yet the connection is made in the practice of auditory speech recognition technology research.

 

§29

In discussions about humans and machines, there have been participants who are anxious to insist that computers can ultimately replace humans or be like humans. This is understandable for various reasons, but it makes questionable assumptions. Against this, Collins denies the existence of barriers or clear-cut realms of formal thought and action, and situated thought and action. For him, as he says, there is only one kind of cognitive stuff in the world. So, he draws our attention, again, to the social dimension of everything, even what we think of as formal activity. This stance of denying discontinuities in realms of knowledge and action enables Collins to explain what has been achieved in creating social prostheses, in creating humans that are as good as humans in terms of social interaction between humans and machines, and to explain the attributions humans make to successfully accommodate these interactions.

 

§30

At the same time, auditory speech recognition technologists worry about stuff like that:

 

“There is a lot we know about the human auditory system. We also know, however, that utilising flapping wings is not the best way to build an air plane. So, what do we really know, and what is the ultimate secret of making automatic speech recognition work?”

 

In desiring to mimic the human in a machine, the question of knowledge becomes ever more pressing. While Collins rightly insists that we confuse imitating humans in machines with substituting humans through machines, there is another dimension to this problem in the process of designing these machines. Collins rightly insists that there is not necessarily an essential similarity in appearance or content of human and machines, saying that rather this similarity is constructed and attributed by us. In ASR technology research, scientists intentionally try to make a machine that bears this similarity. This is not to say at all that scientists would not also make these attributions and constructions. However, in looking closely at this research context, it turns out that while humans perform well in dealing with noise and in dealing with machines, they do not perform as well in acquiring knowledge about themselves to build machines from that.

 

Discussion

            Following Heidi’s talk, discussion centered on unpacking the actors’ and analysts’ categories.  Prof. Collins’ wished to know whether the ASR scientists thought that using techniques that ostensibly mimicked a human would help the recognizer distinguish human from machine noise.  He also deconstructed the “cocktail party problem” by pointing to the fact that the “noise” of the cocktail party is not random white or pink noise, but other conversations which are ontologically equivalent to the conversation which the recognizer (human or machine) is decoding.  Humans have the ability to switch between which conversation they are listening to, based on where their attention is directed.  They also have an understanding of the language in which the conversations are occurring, which aids in deciphering the sounds of conversation; do the ASR scientists think that a recognizer which does not understand the language could ever have as high a recognition rate as a human?  Prof. Lynch wished to know what the ASR scientists meant by an “ear”.  What boundaries do they draw around the ear?  What, for them, constitutes a prosthesis?  When they point to the ear and say they want to mimic it, it’s unclear what they are pointing to.  Heidi responded that, for them, “the ear” is an algorithm and a symbol, more than some discrete physical object.  Just as one would need to reproduce all of the elements of the bird to create an object that flies like a bird, so one would need to integrate all the elements of human hearing to create a recognizer that works like the human ear.  Prof Pinch wondered whether “epistemological” was an actors’ category or not, to which Heidi replied that the actors know about how much they know.  Finally, Prof. Dear asked how one distinguishes between the determination of meaning and the determination of transcribable units; according to Heidi, the key quantity is a measure of “sentence recognition.”

 

back