Speech Recognition

Introduction

Tocommunicatewiththemachinethroughvoicesothatthemachineunderstandswhatyousayissomethingthatpeoplehavelongdreamedof.TheimageoftheChinaInternetofThingsSchool-EnterpriseAlliancehastocomparespeechrecognitiontoa"machinehearingsystem."Voicerecognitiontechnologyisahightechnologythatallowsmachinestoconvertvoicesignalsintocorrespondingtextsorcommandsthroughtheprocessofrecognitionandunderstanding.Speechrecognitiontechnologymainlyincludesthreeaspects:featureextractiontechnology,patternmatchingcriteriaandmodeltrainingtechnology.ThevoicerecognitiontechnologyoftheInternetofVehicleshasalsobeenfullycited.Forexample,intheYiTruckInternet,youcansetthedestinationandnavigatedirectlybypressingtheone-clicktotalktothecustomerservicestaff,whichissafeandconvenient.

Historyofdevelopment

In1952,DavisandothersattheBellInstitutesuccessfullysucceededintheworld'sfirstexperimentalsystemthatcanrecognizethepronunciationof10Englishnumbers.

In1960,DenesandothersintheUnitedKingdomsucceededinthefirstcomputerspeechrecognitionsystem.

Large-scalespeechrecognitionresearchhasmadesubstantialprogressintherecognitionofsmallvocabularyandisolatedwordsafterenteringthe1970s.

Afterenteringthe1980s,thefocusofresearchgraduallyturnedtolargevocabulary,non-specificcontinuousspeechrecognition.Significantchangeshavealsotakenplaceintheresearchthinking,thatis,thetraditionaltechnicalthinkingbasedonstandardtemplatematchinghasbeguntoturntothetechnicalthinkingbasedonstatisticalmodel(HMM).Inaddition,thetechnicalideaofintroducingneuralnetworktechnologyintothespeechrecognitionproblemisagainproposed.

Afterenteringthe1990s,therehasbeennomajorbreakthroughinthesystemframeworkofspeechrecognition.However,greatprogresshasbeenmadeintheapplicationandcommercializationofspeechrecognitiontechnology.

DARPA(DefenseAdvancedResearchProjectsAgency)isa10-yearprojectfundedbytheUSDepartmentofDefense'sVisionResearchProjectsAgencyinthe1970stosupporttheresearchanddevelopmentoflanguageunderstandingsystems.

Inthe1980s,theU.S.DepartmentofDefenseVisionResearchProjectsAgencyfundeda10-yearDARPAstrategicplan,whichincludedspeechrecognitionundernoiseandconversation(spoken)recognitionsystems,andrecognitiontasksettingItis"(1000words)continuousspeechdatabasemanagement".

Inthe1990s,thisDARPAprojectwasstillongoing.Hisresearchfocushasshiftedtothenaturallanguageprocessingpartoftherecognitiondevice,andtherecognitiontaskissetas"airtravelinformationretrieval".

Japanalsoputforwardthegrandgoalofinput-outputnaturallanguageforspeechrecognitionintheFifthGenerationComputerProjectin1981.Althoughitfailedtoachievetheexpectedgoal,theresearchonspeechrecognitiontechnologyhasmadegreatprogress.Enhancementandprogressintherange.

Since1987,Japanhasproposedanewnationalproject---Advancedman-machinespokenlanguageinterfaceandautomatictelephonetranslationsystem.

DevelopmentinChina

China'sspeechrecognitionresearchstartedin1958.TheInstituteofAcousticsoftheChineseAcademyofSciencesusedelectronictubecircuitstorecognize10vowels.Itwasnotuntil1973thattheInstituteofAcousticsoftheChineseAcademyofSciencesbegancomputerspeechrecognition.Duetotheconstraintsoftheconditionsatthattime,theresearchonspeechrecognitioninChinahasbeeninaslowdevelopmentstage.

Afterenteringthe1980s,withthegradualpopularizationandapplicationofcomputerapplicationtechnologyinChinaandthefurtherdevelopmentofdigitalsignaltechnology,manydomesticunitshavethebasicconditionstostudyvoicetechnology.Atthesametime,theinternationalspeechrecognitiontechnologyhasbecomearesearchhotspotagainafteryearsofsilence,andhasdevelopedrapidly.Inthisform,manydomesticunitshaveinvestedinthisresearchwork.

InMarch1986,China'shigh-techdevelopmentplan(863plan)waslaunched.Asanimportantpartofintelligentcomputersystemresearch,speechrecognitionwasspecificallylistedasaresearchtopic.Withthesupportofthe863Program,Chinabeganorganizedresearchonspeechrecognitiontechnologyanddecidedtoholdaspecialmeetingonspeechrecognitioneverytwoyears.Sincethen,China'sspeechrecognitiontechnologyhasenteredanunprecedentedstageofdevelopment.

Patternrecognition

Thespeechrecognitionmethodinthisperiodbasicallyadoptedthetraditionalpatternrecognitionstrategy.Amongthem,theresearchworkofVelichkoandZagoruykointheSovietUnion,SaeandChibainJapan,andItakuraintheUnitedStatesatthattimearethemostrepresentative.

·ResearchintheSovietUnionlaidthefoundationfortheapplicationofpatternrecognitioninthefieldofspeechrecognition;

·ResearchinJapanshowedhowtousedynamicprogrammingtechnologyAmethodfornon-lineartimematchingbetweenstandardspeechpatterns;

·Itakura’sresearchproposeshowtoextendthelinearpredictiveanalysistechnology(LPC)forfeatureextractionofspeechsignals.

Database

Intheprocessofresearchanddevelopmentofspeechrecognition,relatedresearchersdesignedandproducedvarioustypesofChinese(includingdifferentdialects),English,etc.accordingtothepronunciationcharacteristicsofdifferentlanguages.Languagespeechdatabases,thesespeechdatabasescanprovidesufficientandscientifictrainingspeechsamplesforChinesecontinuousspeechrecognitionalgorithmresearch,systemdesign,andindustrializationofrelevantresearchinstitutesanduniversitiesathomeandabroad.Forexample:MITMedialabSpeechDataset,PitchandVoicingEstimatesforAurora2(GenecycleandtoneestimationofAurora2speechlibrary),Congressionalspeechdata,MandarinSpeechFrameData(Mandarinvoiceframedata),voicedatausedtotestblindsourceseparationalgorithms,etc.

Technologydevelopment

TheIBMSpeechResearchGroup,whichiscurrentlyaleaderinlargevocabularyspeechrecognition,beganitsresearchonlargevocabularyspeechrecognitioninthe1970s.AT&T'sBellResearchInstitutehasalsostartedaseriesofexperimentsonunspecifiedpersonspeechrecognition.After10yearsofthisresearch,theresultistheestablishmentofamethodformakingstandardtemplatesforunspecifiedspeechrecognition.

Thesignificantprogressmadeduringthisperiodincludes:

⑴ThematurityandcontinuousimprovementofHiddenMarkovModel(HMM)technologyhasbecomethemainstreammethodofspeechrecognition.

⑵Theresearchofknowledge-basedspeechrecognitionhasbeenpaidmoreandmoreattention.Whenperformingcontinuousspeechrecognition,inadditiontorecognizingacousticinformation,morelinguisticknowledge,suchaswordformation,syntax,semantics,anddialoguebackgroundknowledge,areusedtohelpfurtherrecognizeandunderstandspeech.Atthesametime,inthefieldofspeechrecognitionresearch,alanguagemodelbasedonstatisticalprobabilityhasalsobeenproduced.

(3)Theriseofresearchontheapplicationofartificialneuralnetworksinspeechrecognition.Inthesestudies,mostofthemulti-layerperceptionnetworksbasedontheback-propagationalgorithm(BPalgorithm)areused.Artificialneuralnetworkhastheabilitytodistinguishcomplexclassificationboundaries.Obviously,itisveryhelpfulforpatterndivision.Especiallyintheaspectoftelephonespeechrecognition,duetoitswideapplicationprospects,ithasbecomeahotspotofcurrentspeechrecognitionapplications.

Inaddition,thetechnologyofcontinuousvoicedictationmachineforpersonaluseisbecomingmoreandmoreperfect.Inthisregard,themostrepresentativeonesareIBM'sViaVoiceandDragon'sDragonDictatesystem.Thesesystemshavespeakeradaptationcapabilities,andnewusersdonotneedtotrainallthevocabularytocontinuouslyimprovetherecognitionrateduringuse.

ThedevelopmentofspeechrecognitiontechnologyinChina:⑴InBeijing,therearescientificresearchinstitutionsanduniversitiessuchastheInstituteofAcousticsoftheChineseAcademyofSciences,theInstituteofAutomation,TsinghuaUniversity,andNorthernJiaotongUniversity.Inaddition,HarbinInstituteofTechnology,UniversityofScienceandTechnologyofChina,andSichuanUniversityhavealsotakenaction.

⑵Now,manydomesticspeechrecognitionsystemshavebeensuccessfullydeveloped.Theperformanceofthesesystemshasitsowncharacteristics.

·Intermsofisolatedwordlargevocabularyspeechrecognition,themostrepresentativeoneisTHED-919specificpersonspeechrecognitionandunderstandingsuccessfullydevelopedin1992bytheDepartmentofElectronicEngineeringofTsinghuaUniversityandChinaElectronicDevicesCorporation.Real-timesystem.

·Intermsofcontinuousspeechrecognition,inDecember1991,theComputerCenterofSichuanUniversityimplementedacontinuousEnglish-Chinesespeechtranslationdemonstrationsystemforaspecificpersonwithlimitedthemesonamicrocomputer.

·Intermsofunspecifiedpersonspeechrecognition,thereisavoice-activatedtelephonedirectorysystemdevelopedbytheDepartmentofComputerScienceandTechnologyofTsinghuaUniversityin1987andputintopracticaluse.

Classificationapplication

Accordingtothedifferentobjectsofrecognition,speechrecognitiontaskscanberoughlydividedinto3categories,namely,isolatedwordrecognition(isolatedwordrecognition),keywordrecognition(orkeywordrecognition)Detection,keywordspotting)andcontinuousspeechrecognition.Amongthem,thetaskofisolatedwordrecognitionistorecognizeisolatedwordsthatareknowninadvance,suchas"turnon","off",etc.;thetaskofcontinuousspeechrecognitionistorecognizeanycontinuousspeech,suchasasentenceoraparagraph;continuousspeechstreamThekeyworddetectioninisforcontinuousspeech,butitdoesnotrecognizealltext,butonlydetectswhereanumberofknownkeywordsappear,suchasdetectingthewords"computer"and"world"inaparagraph.

Accordingtothetargetspeaker,speechrecognitiontechnologycanbedividedintospecificpersonspeechrecognitionandnon-specificpersonspeechrecognition.Theformercanonlyrecognizethevoiceofoneorafewpeople,whilethelattercanbeusedbyanyone..Obviously,aperson-independentspeechrecognitionsystemismoreinlinewithactualneeds,butitismuchmoredifficultthanidentifyingaspecificperson.

Inaddition,accordingtothevoicedeviceandchannel,itcanbedividedintodesktop(PC)voicerecognition,telephonevoicerecognitionandembeddeddevice(mobilephone,PDA,etc.)voicerecognition.Differentacquisitionchannelswilldistorttheacousticcharacteristicsofhumanpronunciation,soitisnecessarytoconstructtheirownrecognitionsystems.

Speech Recognition

Theapplicationfieldofspeechrecognitionisverywide.Commonapplicationsystemsinclude:voiceinputsystem,whichismoreinlinewithpeople’sdailyhabits,morenaturalandmoreefficientthankeyboardinput;voicecontrolsystem,Thatis,usingvoicetocontroltheoperationofthedeviceisfasterandmoreconvenientthanmanualcontrol.Itcanbeusedinmanyfieldssuchasindustrialcontrol,voicedialingsystem,smarthomeappliances,voice-controlledsmarttoys,etc.;intelligentdialoguequerysystem,basedoncustomervoicePerformoperationstoprovideuserswithnaturalandfriendlydatabaseretrievalservices,suchasfamilyservices,hotelservices,travelagencyservicesystems,ticketbookingsystems,medicalservices,bankingservices,stockinquiryservices,andsoon.

Recognitionmethod

Thevoicerecognitionmethodismainlypatternmatchingmethod.

Inthetrainingphase,theuserwillsayeachwordinthevocabularyonebyone,andstoreitsfeaturevectorasatemplateinthetemplatelibrary.

Intherecognitionstage,thefeaturevectoroftheinputspeechiscomparedwitheachtemplateinthetemplatelibraryinturn,andtheonewiththehighestsimilarityisoutputastherecognitionresult.

Mainproblems

Therearefivemainproblemsinspeechrecognition:

⒈Recognitionandunderstandingofnaturallanguage.First,thecontinuousspeechmustbebrokendownintounitssuchaswordsandphonemes,andsecondly,aruleforunderstandingsemanticsmustbeestablished.

⒉Thevolumeofvoiceinformationislarge.Thevoicemodeisnotonlydifferentfordifferentspeakers,butalsodifferentforthesamespeaker.Forexample,thevoiceinformationofaspeakerisdifferentwhenspeakingcasuallyandspeakingseriously.Thewayapersonspeakschangesovertime.

⒊Vaguenessofvoice.Whenthespeakerisspeaking,differentwordsmaysoundsimilar.ThisiscommoninEnglishandChinese.

⒋Thephoneticcharacteristicsofasingleletterorwordorcharacterareaffectedbythecontext,sothattheaccent,pitch,volume,andpronunciationspeedarechanged.

⒌Environmentalnoiseandinterferencehaveaseriousimpactonspeechrecognition,resultinginalowrecognitionrate.

Front-endprocessing

Front-endprocessingreferstoprocessingtheoriginalspeechbeforefeatureextraction,partiallyeliminatingtheinfluenceofnoiseanddifferentspeakers,andmakingtheprocessedsignalmoreCanreflecttheessentialcharacteristicsofspeech.Themostcommonlyusedfront-endprocessingisendpointdetectionandvoiceenhancement.Endpointdetectionreferstodistinguishingvoiceandnon-voicesignalperiodsinavoicesignal,andaccuratelydeterminingthestartingpointofthevoicesignal.Aftertheendpointdetection,thesubsequentprocessingcanonlybeperformedonthevoicesignal,whichplaysanimportantroleinimprovingtheaccuracyofthemodelandtherecognitionaccuracy.Themaintaskofspeechenhancementistoeliminatetheinfluenceofenvironmentalnoiseonspeech.ThecurrentgeneralmethodistouseWienerfiltering,whichisbetterthanotherfiltersinthecaseofhighnoise.

Acousticfeatures

Theextractionandselectionofacousticfeaturesisanimportantpartofspeechrecognition.Acousticfeatureextractionisnotonlyaprocessoflarge-scalecompressionofinformation,butalsoaprocessofsignaldeconvolution.Thepurposeistoenablethepatterndividertobetterdivide.Duetothetime-varyingnatureofthespeechsignal,featureextractionmustbeperformedonasmallsegmentofthespeechsignal,thatis,short-termanalysis.Thisperiodofanalysisisconsideredtobestableandiscalledaframe,andtheoffsetbetweenframesisusually1/2or1/3oftheframelength.Usually,itisnecessarytopre-emphasizethesignaltoincreasethehighfrequency,andaddawindowtothesignaltoavoidtheinfluenceoftheedgeoftheshort-termspeechsegment.

Acousticcharacteristics

LPC

LinearpredictiveanalysisstartswiththehumanvoiceThetransferfunctionconformstotheformofanall-poledigitalfilter,sothesignalattimencanbeestimatedbyalinearcombinationofthesignalsatseveralpreviousmoments.ThelinearpredictioncoefficientLPCcanbeobtainedbymakingthesamplingvalueoftheactualvoiceandthelinearpredictionsamplingvaluereachtheminimummeansquareerrorLMS.ThecalculationmethodsofLPCincludeautocorrelationmethod(Durbinmethod),covariancemethod,latticemethodandsoon.Thefastandeffectivecalculationguaranteesthewidespreaduseofthisacousticfeature.SimilartotheLPCpredictionparametermodel,theacousticfeaturesincludelinespectrumpairLSP,reflectioncoefficient,andsoon.

CEP

Usingthehomomorphicprocessingmethod,thediscreteFouriertransformDFTofthespeechsignaliscalculatedandthelogarithmistaken,andthentheinversetransformiDFTcanbeusedtoobtainthecepstrumcoefficient.ForLPCcepstrum(LPCCEP),afterobtainingthelinearpredictioncoefficientsofthefilter,itcanbecalculatedbyarecursiveformula.Experimentsshowthattheuseofcepstrumcanimprovethestabilityoffeatureparameters.

Mel

DifferentfromLPCandotheracousticcharacteristicsobtainedthroughthestudyofhumanvoicemechanism,MelcepstralcoefficientMFCCandperceptuallinearpredictionPLParesubjecttohumanauditorysystemresearchAcousticfeaturesderivedfromresults.Researchonthemechanismofhumanhearingfoundthatwhentwotoneswithsimilarfrequenciesareemittedatthesametime,peoplecanonlyhearonetone.Criticalbandwidthreferstosuchasuddenchangeofthebandwidthboundaryofthesubjectivefeeling.Whenthefrequencydifferencebetweentwotonesislessthanthecriticalbandwidth,peoplewillhearthetwotonesasone,whichiscalledtheshieldingeffect.TheMelscaleisoneofthemeasurementmethodsforthiscriticalbandwidth.

MFCC

First,useFFTtoconvertthetimedomainsignalintofrequencydomain,andthenconvolveitslogarithmicenergyspectrumwithatriangularfilterbankdistributedaccordingtotheMelscale,andfinallyThevectorformedbytheoutputofeachfilterundergoesdiscretecosinetransformDCT,andthefirstNcoefficientsaretaken.PLPstillusestheDurbinmethodtocalculatetheLPCparameters,butitalsousestheDCTmethodforthelogarithmicenergyspectrumoftheauditoryexcitationwhencalculatingtheautocorrelationparameters.

Acousticmodel

Themodelofthespeechrecognitionsystemisusuallycomposedofanacousticmodelandalanguagemodel,whichrespectivelycorrespondtothecalculationoftheprobabilityofspeechtosyllableandthecalculationofprobabilityofsyllabletoword.Thissectionandthenextsectionintroducethetechniquesofacousticmodelandlanguagemodelrespectively.

HMMacousticmodeling:TheconceptofMarkovmodelisadiscretetime-domainfinitestateautomata.HiddenMarkovmodelHMMmeansthattheinternalstateofthisMarkovmodelisinvisibletotheoutsideworld.Youcanonlyseetheoutputvalueateachmoment.Forspeechrecognitionsystems,theoutputvalueisusuallytheacousticfeaturecalculatedfromeachframe.TwoassumptionsmustbemadetocharacterizespeechsignalswithHMM.Oneisthatthetransitionofinternalstateisonlyrelatedtothepreviousstate,andtheotheristhattheoutputvalueisonlyrelatedtothecurrentstate(orthecurrentstatetransition).Thesetwoassumptionsgreatlyreducethemodel.Complexity.Thecorrespondingalgorithmsforscoring,decodingandtrainingofHMMareforwardalgorithm,Viterbialgorithmandforwardandbackwardalgorithm.

TheHMMusedinspeechrecognitionusuallyusesaleft-to-rightunidirectional,self-loop,andspanningtopologytomodeltherecognitionprimitives.AphonemeisathreetofivestateHMM.AwordisanHMMformedbyserializingtheHMMsofmultiplephonemesthatmakeuptheword,andtheentiremodelofcontinuousspeechrecognitionisanHMMformedbycombiningwordsandsilences.

Context-dependentmodeling:Co-pronunciationreferstoasoundthatchangesduetotheinfluenceofadjacentsoundsbeforeandafter.Fromtheperspectiveofsoundmechanism,thehumanvocalorganschangefromonesoundtoanother.Thecharacteristicscanonlybegradual,sothatthefrequencyspectrumofthelattertoneisdifferentfromthefrequencyspectrumunderotherconditions.Thecontext-dependentmodelingmethodtakesthisinfluenceintoaccountwhenmodeling,sothatthemodelcandescribethespeechmoreaccurately.TheonethatonlyconsiderstheinfluenceoftheprevioussoundiscalledBi-Phone,andtheonethatconsiderstheinfluenceoftheprevioussoundandthenextsoundCalledTri-Phone.

Englishcontext-dependentmodelingusuallyusesphonemesastheprimitives.Sincesomephonemeshavesimilareffectsontheirsubsequentphonemes,modelparameterscanbesharedthroughclusteringofphonemedecodingstates.Theresultofclusteringiscalledsenone.Thedecisiontreeisusedtoachieveanefficienttriphonetosenonecorrespondence,byansweringaseriesofquestionsaboutthecategory(yuan/consonant,unvoiced/voiced,etc.)ofthefrontandbacksounds,andfinallydeterminewhichsenoneshouldbeusedforitsHMMstate.TheclassificationregressiontreeCARTmodelisusedforpronunciationlabelingfromwordstophonemes.

Languagemodel

Languagemodelismainlydividedintotwotypes:rulemodelandstatisticalmodel.Thestatisticallanguagemodelusesprobabilitystatisticstorevealtheinherentstatisticallawsoflanguageunits.Amongthem,N-Gramissimpleandeffective,andiswidelyused.

N-Gram:ThemodelisbasedontheassumptionthattheappearanceofthenthwordisonlyrelatedtothepreviousN-1words,andisnotrelatedtoanyotherwords.TheprobabilityofthewholesentenceiseachTheproductoftheprobabilityofoccurrenceofaword.TheseprobabilitiescanbeobtainedbydirectlycountingthenumberofsimultaneousoccurrencesofNwordsfromthecorpus.CommonlyusedarebinaryBi-GramandternaryTri-Gram.

Theperformanceofalanguagemodelisusuallymeasuredbycross-entropyandcomplexity(Perplexity).Themeaningofcrossentropyisthedifficultyoftextrecognitionwiththismodel,orfromthepointofviewofcompression,eachwordneedstobecodedwithseveralbitsonaverage.Themeaningofcomplexityistousethemodeltorepresenttheaveragenumberofbranchesofthistext,anditsreciprocalcanberegardedastheaverageprobabilityofeachword.SmoothingreferstoassigningaprobabilityvaluetotheunobservedN-gramcombinationtoensurethatthewordsequencecanalwaysgetaprobabilityvaluethroughthelanguagemodel.CommonlyusedsmoothingtechniquesincludeTuringestimation,deletioninterpolationsmoothing,KatzsmoothingandKneser-Neysmoothing.

Search

Thesearchincontinuousspeechrecognitionistofindawordmodelsequencetodescribetheinputspeechsignal,therebyobtainingtheworddecodingsequence.Thesearchisbasedonscoringtheacousticmodelandthelanguagemodelintheformula.Inactualuse,itisoftennecessarytoaddahighweighttothelanguagemodelbasedonexperience,andsetalongwordpenaltyscore.

Viterbi:BasedonthedynamicprogrammingoftheViterbialgorithmateachtimepoint,calculatetheposteriorprobabilityofthedecodedstatesequencetotheobservationsequence,retainthepathwiththehighestprobability,andrecorditateachnodeCorrespondingstateinformationinordertofinallyobtaintheworddecodingsequenceinthereversedirection.TheViterbialgorithmsolvesthenon-lineartimealignmentoftheHMMmodelstatesequenceandtheacousticobservationsequence,wordboundarydetectionandwordrecognitionincontinuousspeechrecognitionwithoutlosingtheoptimalsolution,makingthisalgorithmspeechrecognitionBasicsearchstrategy.

Sincespeechrecognitioncannotpredictthesituationafterthecurrentpointintime,heuristicpruningbasedontheobjectivefunctionisdifficulttoapply.Duetothetime-alignednatureoftheViterbialgorithm,eachpathatthesametimecorrespondstothesameobservationsequence,soitiscomparable.Beamsearchonlyretainsthefirstfewpathswiththehighestprobabilityateachtime,whichgreatlyimprovesthepruning.Searchefficiency.Atthistime,theViterbi-Beamalgorithmisthemosteffectivealgorithminthecurrentspeechrecognitionsearch.N-bestsearchandmulti-passsearch:Inordertousevariousknowledgesourcesinthesearch,multiplesearchesareusuallyperformed.Thefirstpassuseslow-costknowledgesourcestogenerateacandidatelistorwordcandidategrid,basedonthisPerformasecondsearchusingcostlyknowledgesourcestogetthebestpath.Thepreviouslyintroducedknowledgesourcesincludeacousticmodels,languagemodels,andphoneticdictionary,whichcanbeusedforthefirstsearch.Inordertoachievemoreadvancedspeechrecognitionororalcomprehension,itisoftennecessarytousesomemorecostlyknowledgesources,suchas4thor5thorderN-Gram,4thorhighercontext-dependentmodel,inter-wordcorrelationmodel,segmentationModelorgrammaticalanalysis,re-scoring.Manyofthelatestreal-timelargevocabularycontinuousspeechrecognitionsystemsusethismulti-passsearchstrategy.

N-bestsearchproducesacandidatelist,andNbestpathsshouldbekeptateachnode,whichwillincreasethecomputationalcomplexitytoNtimes.Thesimplifiedapproachistokeeponlyafewwordcandidatesforeachnode,butthesub-optimalcandidatesmaybelost.Acompromiseistoconsideronlytwo-word-longpathsandkeepkpaths.Thewordcandidategridgivesmultiplecandidatesinamorecompactway.AftercorrespondingchangestotheN-bestsearchalgorithm,thealgorithmforgeneratingthecandidategridcanbeobtained.

Theforwardandbackwardsearchalgorithmisanexampleofapplyingmultiplesearches.WhenaforwardViterbisearchisperformedusingasimpleknowledgesource,theforwardprobabilityobtainedinthesearchprocesscanbeusedinthecalculationoftheobjectivefunctionofthebackwardsearch.Therefore,theheuristicAalgorithmcanbeusedforthebackwardsearch,whichiseconomicalSearchforNcandidates.

Systemimplementation

Therequirementforthespeechrecognitionsystemtoselecttherecognitionprimitivesistohaveaccuratedefinitions,getenoughdatafortraining,andbegeneral.Englishusuallyusescontext-sensitivephonememodeling,andtheco-pronunciationofChineseisnotasseriousasEnglish,sosyllablemodelingcanbeused.Thesizeofthetrainingdatarequiredbythesystemisrelatedtothecomplexityofthemodel.Themodelisdesignedtobetoocomplextoexceedthecapabilitiesoftheprovidedtrainingdata,whichwillcauseasharpdropinperformance.

Dictationmachine:Alargevocabulary,non-specific,continuousspeechrecognitionsystemisusuallycalledadictationmachine.ItsarchitectureistheHMMtopologybasedontheaforementionedacousticmodelandlanguagemodel.Whentraining,useforwardandbackwardalgorithmstoobtainmodelparametersforeachprimitive.Whenidentifying,concatenatetheprimitivesintowords,addasilentmodelbetweenwordsandintroducealanguagemodelasthetransitionprobabilitybetweenwords,formingaloopstructure,usingViterbiAlgorithmfordecoding.InviewoftheeasysegmentationofChinese,segmentingfirstandthendecodingeachsegmentisasimplifiedmethodtoimproveefficiency.

Dialogsystem:Thesystemusedtorealizehuman-machinespokendialogueiscalledthedialoguesystem.Limitedbycurrenttechnology,dialoguesystemsareoftenorientedtoanarrowfieldandlimitedvocabulary.Thesubjectmatterincludestravelquery,ticketbooking,databaseretrieval,andsoon.Itsfrontendisaspeechrecognizer,whichrecognizesthegeneratedN-bestcandidateorwordcandidategrid,analyzesitbyasyntaxanalyzertoobtainsemanticinformation,andthenthedialoguemanagerdeterminestheresponseinformation,whichisoutputbythespeechsynthesizer.Sincecurrentsystemsoftenhavelimitedvocabulary,themethodofextractingkeywordscanalsobeusedtoobtainsemanticinformation.

Adaptationandrobustness

Theperformanceofthespeechrecognitionsystemisaffectedbymanyfactors,includingdifferentspeakers,speakingmethods,environmentalnoise,transmissionchannels,andsoon.Improvingtherobustnessofthesystemistoimprovetheabilityofthesystemtoovercometheinfluenceofthesefactors,sothattheperformanceofthesystemisstableunderdifferentapplicationenvironmentsandconditions;thepurposeofself-adaptationistoautomaticallyandtargetedThesystemisadjustedtograduallyimproveperformanceduringuse.Thefollowingintroducessolutionstodifferentfactorsthataffectsystemperformance.

Thesolutionisdividedintotwocategoriesaccordingtothemethodforvoicecharacteristics(hereinafterreferredtoasfeaturemethod)andthemethodofmodeladjustment(hereinafterreferredtoasmodelmethod).Theformerneedstofindbetterandhighlyrobustfeatureparameters,oraddsomespecificprocessingmethodsonthebasisofexistingfeatureparameters.Thelatterusesasmallamountofadaptivecorpustomodifyortransformtheoriginalspeaker-independent(SI)model,makingitaspeaker-adaptive(SA)model.

Speakeradaptivefeaturemethodsincludespeakerregularizationandspeakersubspacemethod,andmodelmethodsincludeBayesianmethod,transformationmethodandmodelmergingmethod.

Thenoiseinthespeechsystemincludesenvironmentalnoiseandelectronicnoiseaddedduringtherecordingprocess.Featuremethodstoimprovetherobustnessofthesystemincludespeechenhancementandfindingfeaturesthatarenotsensitivetonoiseinterference.ModelmethodsincludeparallelmodelcombinationPMCmethodsandartificiallyaddingnoiseduringtraining.Channeldistortionincludesthedistanceofthemicrophoneduringrecording,theuseofmicrophoneswithdifferentsensitivities,preamplifierswithdifferentgains,anddifferentfilterdesigns,andsoon.Featuremethodsincludesubtractingitslong-termaveragevaluefromthecepstrumvectorandRASTAfiltering,andmodelmethodsincludecepstrumshift.

Recognitionengine

Microsofthasapplieditsownspeechrecognitionengineinbothofficeandvista.TheuseofMicrosoftspeechrecognitionengineiscompletelyfree,sotherearemanyspeechrecognitionenginesbasedonMicrosoft.Speechrecognitionapplicationsoftwaredevelopedbytherecognitionengine,suchas"VoiceGameMaster","VoiceControlExpert","Ampang","GuardVoiceRecognitionSystem"andsoon.Amongthem,"GuardVoiceRecognitionSystem"istheonlyhardwarefacilitythatcancontrolmicrocontrollers!!

In2009,Microsoftreleasedthewindows7operatingsystem,andspeechrecognitionsoftwarehasbeenbetterpromoted!

Performanceindicators

Indicators

Therearefourmainperformanceindicatorsforthespeechrecognitionsystem.①Rangeofvocabularylist:Thisreferstotherangeofwordsorphrasesthatcanberecognizedbythemachine.Iftherearenorestrictions,therangeofthevocabularylistcanbeconsideredunlimited.②Speakerrestriction:Isitonlyabletorecognizethevoiceofthedesignatedspeaker,orcanrecognizethevoiceofanyspeaker.③Trainingrequirements:Doyouwanttotrainbeforeuse,thatis,whethertoletthemachine"listen"tothegivenvoice,andhowmanytimesoftraining.④Correctrecognitionrate:theaveragepercentageofcorrectrecognition,whichisrelatedtothepreviousthreeindicators.

Summary

Theaboveintroducesthetechnologyofvariousaspectsofthespeechrecognitionsystem.Thesetechnologieshaveachievedgoodresultsinactualuse,buthowtoovercomevariousfactorsthataffectspeechstillneedsmorein-depthanalysis.Atpresent,thedictationsystemcannotbefullypracticaltoreplacekeyboardinput,butthematurityofrecognitiontechnologyhasalsopromotedtheresearchofhigher-levelspeechunderstandingtechnology.BecauseEnglishandChinesehavedifferentcharacteristics,howtousethetechnologyproposedforEnglishinChineseisalsoanimportantresearchtopic,andtheuniqueproblemsofChineseitself,suchasthefourtones,needtobesolved.

Latestprogress

Inrecentyears,especiallysince2009,withthehelpofthedevelopmentofdeeplearningresearchinthefieldofmachinelearningandtheaccumulationofbigdatacorpus,speechrecognitiontechnologyhasbeendevelopedbyleapsandbounds.

1.Newdevelopmentsintechnology

1)Introducedeeplearningresearchinthefieldofmachinelearningtothetrainingofspeechrecognitionacousticmodels,andusemulti-layerneuralnetworkswithRBMpre-trainingtogreatlyimproveTheaccuracyoftheacousticmodelisimproved.Inthisregard,Microsoftresearcherstooktheleadinmakingbreakthroughs.Aftertheyusedthedeepneuralnetworkmodel(DNN),thespeechrecognitionerrorratewasreducedby30%,whichisthefastestprogressinspeechrecognitiontechnologyinthepast20years.

2)Atpresent,mostmainstreamspeechrecognitiondecodershaveadoptedafinitestatemachine(WFST)-baseddecodingnetwork,whichcanintegratelanguagemodels,dictionaries,andacousticsharedphoneticcharactersetsintoalargeThedecodingnetworkgreatlyimprovesthespeedofdecodingandprovidesabasisforreal-timeapplicationsofspeechrecognition.

3)WiththerapiddevelopmentoftheInternetandthepopularizationofmobileterminalssuchasmobilephones,alargeamountoftextorspeechcorpuscanbeobtainedfrommultiplechannels.ThisisthelanguagemodelandacousticsinspeechrecognitionModeltrainingprovidesabundantresources,makingitpossibletobuildgenerallarge-scalelanguagemodelsandacousticmodels.Inspeechrecognition,thematchingandrichnessoftrainingdataisoneofthemostimportantfactorsthatpromotetheimprovementofsystemperformance,butthelabelingandanalysisofcorpusrequireslong-termaccumulationandprecipitation.Withtheadventoftheeraofbigdata,large-scalecorpusresourcesAccumulationwillberaisedtothestrategicheight.

2.Newtechnologyapplications

Recently,theapplicationofvoicerecognitiononmobileterminalsisthehottest.Voicedialoguerobots,voiceassistants,interactivetools,etc.emergeinendlessly,andmanyInternetcompanieshaveinvestedinmanpower,Materialandfinancialresourcestocarryoutresearchandapplicationinthisarea,thepurposeistoquicklyoccupythecustomerbasethroughthenovelandconvenientmodeofvoiceinteraction.

Currently,foreignapplicationshavealwaysbeenledbyApple'ssiri.

Onthedomesticfront,theiFLYTEK,Yunzhisheng,Shanda,JietongHuasheng,SogouVoiceAssistant,ZidongInterpretation,BaiduVoiceandothersystemshaveadoptedthelatestvoicerecognitiontechnology,andotherrelatedsystemsonthemarketSimilartechnologiesaredirectlyorindirectlyembeddedintheproducts.