| System and method for an endpoint detection of speech for improved speech recognition in noisy environment -> Monitor Keywords |
|
System and method for an endpoint detection of speech for improved speech recognition in noisy environmentUSPTO Application #: 20080021707Title: System and method for an endpoint detection of speech for improved speech recognition in noisy environment Abstract: According to a disclosed embodiment, an endpointer determines the background energy of a first portion of a speech signal, and a cepstral computing module extracts one or more features of the first portion. The endpointer calculates an average distance of the first portion based on the features. Subsequently, an energy computing module measures the energy of a second portion of the speech signal, and the cepstral computing module extracts one or more features of the second portion. Based on the features of the second portion, the endpointer calculates a distance of the second portion. Thereafter, the endpointer contrasts the energy of the second portion with the background energy of the first portion, and compares the distance of the second portion with the distance of the first portion. The second portion of the speech signal is classified by the endpointer as speech or non-speech based on the contrast and the comparison. (end of abstract) Agent: Farjami & Farjami LLP - Mission Viejo, CA, US Inventors: Sahar E. Bou-Ghazale, Ayaman O. Asadi, Khaled Assaleh USPTO Applicaton #: 20080021707 - Class: 704248000 (USPTO) Related Patent Categories: Data Processing: Speech Signal Processing, Linguistics, Language Translation, And Audio Compression/decompression, Speech Signal Processing, Recognition, Voice Recognition, Endpoint Detection The Patent Description & Claims data below is from USPTO Patent Application 20080021707. Brief Patent Description - Full Patent Description - Patent Application Claims RELATED APPLICATIONS [0001] The present application claims the benefit of U.S. provisional application Ser. No. 60/272,956, filed Mar. 2, 2001, which is hereby fully incorporated by reference in the present application. BACKGROUND OF THE INVENTION [0002] 1. Field of the Invention [0003] The present invention relates generally to the field of speech recognition and, more particularly, speech recognition in noisy environments. [0004] 2. Related Art [0005] Automatic speech recognition ("ASR") refers to the ability to convert speech signals into words, or put another way, the ability of a machine to recognize human voice. ASR systems are generally categorized into three types: speaker-independent ASR, speaker-dependent ASR and speaker-verification ASR. Speaker-independent ASR can recognize a group of words from any speaker and allow any speaker to use the available vocabularies after having been trained for a standard vocabulary. Speaker-dependent ASR, on the other hand, can identify a vocabulary of words from a specific speaker after having been trained for an individual user. Training usually requires the individual to say words or phrases one or more times to train the system. A typical application is voice dialing where a caller says a phrase such as "call home" or a name from the caller's directory and the phone number is dialed automatically. Speaker-verification ASR can identify a speaker's identity by matching the speaker's voice to a previously stored pattern. Typically, speaker-verification ASR allows the speaker to choose any word/phrase in any language as the speaker's verification word/phrase, i.e. spoken password. The speaker may select a verification word/phrase at the beginning of an enrollment procedure during which the speaker-verification ASR is trained and speaker parameters are generated. Once the speaker's identity is stored, the speaker-verification ASR is able to verify whether a claimant is whom he/she claims to be. Based on such verification, the speaker-verification ASR may grant or deny the claimant's access or request. [0006] Detecting when actual speech activity contained in an input speech signal begins and ends is a basic problem for all ASR systems, and it is well-recognized that proper detection is crucial for good speech recognition accuracy. This detection process is referred to as endpointing. FIG. 1 shows a block diagram of a conventional energy-based endpointing system integrated widely in current speech recognition systems. Endpoint detection system 100 illustrated in FIG. 1 comprises endpointer 102, feature extraction module 104 and recognition system 106. [0007] Continuing with FIG. 1, endpoint detection system 100 utilizes a conventional energy-based algorithm to determine whether an input speech signal, such as speech signal 101, contains actual speech activity. Endpoint detection system 100, which receives speech signal 101 on a frame-by-frame basis, determines the beginning and/or end of speech activity by processing each frame of speech signal 101 and measuring the energy of each frame. By comparing the measured energy of each frame against a preset threshold energy value, endpoint detection system 100 determines whether an input frame has a sufficient energy value to classify as speech. The determination is based on a comparison of the energy value of the frame and a preset threshold energy value. The preset threshold energy value can be based on, for instance, an experimentally determined difference in energy between background/silence and actual speech activity. If the energy value of the input frame is below the threshold energy value, endpointer 102 classifies the contents of the frame as background/silence or "non-speech." On the other hand, if the energy value of the input frame is equal to, or greater than, the threshold energy value, endpointer 102 classifies the contents of the frame as actual speech activity. Endpointer 102 would then signal feature extraction module 104 to extract speech characteristics from the frame. A common extracting means for extracting speech characteristics is to determine a feature set such as a cepstral feature set, as is known in the art. The cepstral feature set can then be sent to recognition system 106 which processes the information it receives from feature extraction module 104 in order to "recognize" the speech contained in the input frame. [0008] Referring now to FIG. 2, graph 200 illustrates the endpointing outcome from a conventional endpoint detection system such as endpoint detection system 100 in FIG. 1. In graph 200, the energy of the input speech signal (axis 202) is plotted against the cepstral distance (axis 204). E.sub.silence point 206 on axis 202 represents the energy value of background/silence. As an example, E.sub.silence can be determined experimentally by measuring the energy value of background/silence or non-speech in different conditions such as in a moving vehicle or in a typical office and averaging the values. E.sub.silence+K point 208 represents the preset threshold energy value utilized by the endpointer, such as endpointer 102 in FIG. 1, to classify whether an input speech signal contains actual speech activity. The value K therefore represents the difference in the level of energy between background/silence, i.e. E.sub.silence, and the energy value of what the endpointer is programmed to classify as speech. [0009] It is seen in graph 200 of FIG. 2 that an energy-based algorithm produces an "all-or-nothing" outcome: if the energy of an input frame is below the threshold level, i.e. E.sub.silence+K, the frame is grouped as part of silence region 210. Conversely, if the energy value of an input frame is equal to or greater than E.sub.silence+K, it is classified as speech and grouped in speech region 212. Graph 200 shows that the classification of speech utilizing only an energy-based algorithm disregards the spectral characteristics of the speech signal. As a result, a frame which exhibits spectral characteristics similar to actual speech activity may be falsely rejected as non-speech if its energy value is too low. At the same time, a frame which has spectral characteristics very different from actual speech activity may be mistakenly classified as speech simply because it has high energy. It is recalled that with a conventional endpoint detection system such as endpoint detection system 100 in FIG. 1, only frames classified by the endpointer as speech are subsequently exposed to the recognition system for further processing. Thus, when actual speech activity is mistakenly classified by the endpointer as silence or non-speech, or when non-speech activity is erroneously grouped with speech, speech recognition accuracy is significantly diminished. [0010] Another disadvantage of the conventional energy-based endpoint detection algorithm, such as the one utilized by endpoint detection system 100, is that it has little or no immunity to background noise. In the presence of background noise, the conventional endpointer often fails to determine the accurate endpoints of a speech utterance by either (1) missing the leading or trailing low-energy sounds such as fricatives, (2) classifying clicks, pops and background noises as part of speech, or (3) falsely classifying background/silence noise as speech while missing the actual speech. Such errors lead to high false rejection rates, and reflect negatively on the overall performance of the ASR system. [0011] Thus, there is an intense need in the art for a new and improved endpoint detection system that is capable of handling background noise. It is also desired to design the endpoint detection system such that computational requirements are kept to a minimum. It is further desired that the endpoint detection system be able to detect the beginning and end of speech in real time. SUMMARY OF THE INVENTION [0012] In accordance with the purpose of the present invention as broadly described herein, there is provided for an endpoint detection of speech for improved speech recognition in noisy environments. In one aspect, the background energy of a first portion of a speech signal is determined. Following, one or more features of the first portion is extracted, and the one or more features can be, for example, cepstral vectors. An average distance is thereafter calculated for first portion base on the one or more features extracted. Subsequently, the energy of a second portion of the speech signal is measured, and one or more features of the second portion is extracted. Based on the one or more features of the second portion, a distance is then calculated for the second portion. Thereafter, the energy measured for the second portion is contrasted with the background energy of the first portion, and the distance calculated for the second portion is compared with the distance of the first portion. The second portion of the speech signal is then classified as either speech or non-speech based on the contrast and the comparison. [0013] Moreover, a system for endpoint detection of speech for improved speech recognition in noisy environments can be assembled comprising a cepstral computing module configured to extract one or more features of a first portion of a speech signal and one or more features of a second portion of the speech signal. The system further comprises an energy computing module configured to measure the energy of the second portion. Also, the system comprises an endpointer module configured to determine the background energy of the first portion and to calculate an average distance of the first portion based on the one or more feature of the first portion extracted by the cepstral computing module. The endpointer module can be further configured to calculate a distance of the second portion based on the one or more features of the second portion. In order to classify the second portion as speech or non-speech, the endpointer module is configured to contrast the energy of the second portion with the background energy of the first portion and to compare the distance of the second portion with the average distance of the second portion. [0014] These and other aspects of the present invention will become apparent with further reference to the drawings and specification, which follow. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the present invention, and be protected by the accompanying claims. BRIEF DESCRIPTION OF THE DRAWINGS [0015] The features and advantages of the present invention will become more readily apparent to those ordinarily skilled in the art after reviewing the following detailed description and accompanying drawings, wherein: [0016] FIG. 1 illustrates a block diagram of a conventional endpoint detection system utilizing an energy-based algorithm; [0017] FIG. 2 shows a graph of an endpoint detection utilizing the system of FIG. 1; [0018] FIG. 3 illustrates a block diagram of an endpoint detection system according to one embodiment of the present invention; [0019] FIG. 4 shows a graph of an endpoint detection utilizing the system of FIG. 3; [0020] FIG. 5 illustrates a flow diagram of a process for endpointing the beginning of speech according to one embodiment of the present invention; and Continue reading... Full patent description for System and method for an endpoint detection of speech for improved speech recognition in noisy environment Brief Patent Description - Full Patent Description - Patent Application Claims Click on the above for other options relating to this System and method for an endpoint detection of speech for improved speech recognition in noisy environment patent application. ### 1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored. 3. Each week you receive an email with patent applications related to your keywords. Start now! - Receive info on patent apps like System and method for an endpoint detection of speech for improved speech recognition in noisy environment or other areas of interest. ### Previous Patent Application: Speech distribution system Next Patent Application: Speech recognition system interactive agent Industry Class: Data processing: speech signal processing, linguistics, language translation, and audio compression/decompression ### FreshPatents.com Support Thank you for viewing the System and method for an endpoint detection of speech for improved speech recognition in noisy environment patent info. IP-related news and info Results in 10.63095 seconds Other interesting Feshpatents.com categories: Canon USA , Celera Genomics , Cephalon, Inc. , Cingular Wireless , Clorox , Colgate-Palmolive , Corning , Cymer , |
||