Introduction
Tocommunicatewiththemachinethroughvoicesothatthemachineunderstandswhatyousayissomethingthatpeoplehavelongdreamedof.TheimageoftheChinaInternetofThingsSchool-EnterpriseAlliancehastocomparespeechrecognitiontoa"machinehearingsystem."Voicerecognitiontechnologyisahightechnologythatallowsmachinestoconvertvoicesignalsintocorrespondingtextsorcommandsthroughtheprocessofrecognitionandunderstanding.Speechrecognitiontechnologymainlyincludesthreeaspects:featureextractiontechnology,patternmatchingcriteriaandmodeltrainingtechnology.ThevoicerecognitiontechnologyoftheInternetofVehicleshasalsobeenfullycited.Forexample,intheYiTruckInternet,youcansetthedestinationandnavigatedirectlybypressingtheone-clicktotalktothecustomerservicestaff,whichissafeandconvenient.
Historyofdevelopment
In1952,DavisandothersattheBellInstitutesuccessfullysucceededintheworld'sfirstexperimentalsystemthatcanrecognizethepronunciationof10Englishnumbers.
In1960,DenesandothersintheUnitedKingdomsucceededinthefirstcomputerspeechrecognitionsystem.
Large-scalespeechrecognitionresearchhasmadesubstantialprogressintherecognitionofsmallvocabularyandisolatedwordsafterenteringthe1970s.
Afterenteringthe1980s,thefocusofresearchgraduallyturnedtolargevocabulary,non-specificcontinuousspeechrecognition.Significantchangeshavealsotakenplaceintheresearchthinking,thatis,thetraditionaltechnicalthinkingbasedonstandardtemplatematchinghasbeguntoturntothetechnicalthinkingbasedonstatisticalmodel(HMM).Inaddition,thetechnicalideaofintroducingneuralnetworktechnologyintothespeechrecognitionproblemisagainproposed.
Afterenteringthe1990s,therehasbeennomajorbreakthroughinthesystemframeworkofspeechrecognition.However,greatprogresshasbeenmadeintheapplicationandcommercializationofspeechrecognitiontechnology.
DARPA(DefenseAdvancedResearchProjectsAgency)isa10-yearprojectfundedbytheUSDepartmentofDefense'sVisionResearchProjectsAgencyinthe1970stosupporttheresearchanddevelopmentoflanguageunderstandingsystems.
Inthe1980s,theU.S.DepartmentofDefenseVisionResearchProjectsAgencyfundeda10-yearDARPAstrategicplan,whichincludedspeechrecognitionundernoiseandconversation(spoken)recognitionsystems,andrecognitiontasksettingItis"(1000words)continuousspeechdatabasemanagement".
Inthe1990s,thisDARPAprojectwasstillongoing.Hisresearchfocushasshiftedtothenaturallanguageprocessingpartoftherecognitiondevice,andtherecognitiontaskissetas"airtravelinformationretrieval".
Japanalsoputforwardthegrandgoalofinput-outputnaturallanguageforspeechrecognitionintheFifthGenerationComputerProjectin1981.Althoughitfailedtoachievetheexpectedgoal,theresearchonspeechrecognitiontechnologyhasmadegreatprogress.Enhancementandprogressintherange.
Since1987,Japanhasproposedanewnationalproject---Advancedman-machinespokenlanguageinterfaceandautomatictelephonetranslationsystem.
DevelopmentinChina
China'sspeechrecognitionresearchstartedin1958.TheInstituteofAcousticsoftheChineseAcademyofSciencesusedelectronictubecircuitstorecognize10vowels.Itwasnotuntil1973thattheInstituteofAcousticsoftheChineseAcademyofSciencesbegancomputerspeechrecognition.Duetotheconstraintsoftheconditionsatthattime,theresearchonspeechrecognitioninChinahasbeeninaslowdevelopmentstage.
Afterenteringthe1980s,withthegradualpopularizationandapplicationofcomputerapplicationtechnologyinChinaandthefurtherdevelopmentofdigitalsignaltechnology,manydomesticunitshavethebasicconditionstostudyvoicetechnology.Atthesametime,theinternationalspeechrecognitiontechnologyhasbecomearesearchhotspotagainafteryearsofsilence,andhasdevelopedrapidly.Inthisform,manydomesticunitshaveinvestedinthisresearchwork.
InMarch1986,China'shigh-techdevelopmentplan(863plan)waslaunched.Asanimportantpartofintelligentcomputersystemresearch,speechrecognitionwasspecificallylistedasaresearchtopic.Withthesupportofthe863Program,Chinabeganorganizedresearchonspeechrecognitiontechnologyanddecidedtoholdaspecialmeetingonspeechrecognitioneverytwoyears.Sincethen,China'sspeechrecognitiontechnologyhasenteredanunprecedentedstageofdevelopment.
Patternrecognition
Thespeechrecognitionmethodinthisperiodbasicallyadoptedthetraditionalpatternrecognitionstrategy.Amongthem,theresearchworkofVelichkoandZagoruykointheSovietUnion,SaeandChibainJapan,andItakuraintheUnitedStatesatthattimearethemostrepresentative.
·ResearchintheSovietUnionlaidthefoundationfortheapplicationofpatternrecognitioninthefieldofspeechrecognition;
·ResearchinJapanshowedhowtousedynamicprogrammingtechnologyAmethodfornon-lineartimematchingbetweenstandardspeechpatterns;
·Itakura’sresearchproposeshowtoextendthelinearpredictiveanalysistechnology(LPC)forfeatureextractionofspeechsignals.
Database
Intheprocessofresearchanddevelopmentofspeechrecognition,relatedresearchersdesignedandproducedvarioustypesofChinese(includingdifferentdialects),English,etc.accordingtothepronunciationcharacteristicsofdifferentlanguages.Languagespeechdatabases,thesespeechdatabasescanprovidesufficientandscientifictrainingspeechsamplesforChinesecontinuousspeechrecognitionalgorithmresearch,systemdesign,andindustrializationofrelevantresearchinstitutesanduniversitiesathomeandabroad.Forexample:MITMedialabSpeechDataset,PitchandVoicingEstimatesforAurora2(GenecycleandtoneestimationofAurora2speechlibrary),Congressionalspeechdata,MandarinSpeechFrameData(Mandarinvoiceframedata),voicedatausedtotestblindsourceseparationalgorithms,etc.
Technologydevelopment
TheIBMSpeechResearchGroup,whichiscurrentlyaleaderinlargevocabularyspeechrecognition,beganitsresearchonlargevocabularyspeechrecognitioninthe1970s.AT&T'sBellResearchInstitutehasalsostartedaseriesofexperimentsonunspecifiedpersonspeechrecognition.After10yearsofthisresearch,theresultistheestablishmentofamethodformakingstandardtemplatesforunspecifiedspeechrecognition.
Thesignificantprogressmadeduringthisperiodincludes:
⑴ThematurityandcontinuousimprovementofHiddenMarkovModel(HMM)technologyhasbecomethemainstreammethodofspeechrecognition.
⑵Theresearchofknowledge-basedspeechrecognitionhasbeenpaidmoreandmoreattention.Whenperformingcontinuousspeechrecognition,inadditiontorecognizingacousticinformation,morelinguisticknowledge,suchaswordformation,syntax,semantics,anddialoguebackgroundknowledge,areusedtohelpfurtherrecognizeandunderstandspeech.Atthesametime,inthefieldofspeechrecognitionresearch,alanguagemodelbasedonstatisticalprobabilityhasalsobeenproduced.
(3)Theriseofresearchontheapplicationofartificialneuralnetworksinspeechrecognition.Inthesestudies,mostofthemulti-layerperceptionnetworksbasedontheback-propagationalgorithm(BPalgorithm)areused.Artificialneuralnetworkhastheabilitytodistinguishcomplexclassificationboundaries.Obviously,itisveryhelpfulforpatterndivision.Especiallyintheaspectoftelephonespeechrecognition,duetoitswideapplicationprospects,ithasbecomeahotspotofcurrentspeechrecognitionapplications.
Inaddition,thetechnologyofcontinuousvoicedictationmachineforpersonaluseisbecomingmoreandmoreperfect.Inthisregard,themostrepresentativeonesareIBM'sViaVoiceandDragon'sDragonDictatesystem.Thesesystemshavespeakeradaptationcapabilities,andnewusersdonotneedtotrainallthevocabularytocontinuouslyimprovetherecognitionrateduringuse.
ThedevelopmentofspeechrecognitiontechnologyinChina:⑴InBeijing,therearescientificresearchinstitutionsanduniversitiessuchastheInstituteofAcousticsoftheChineseAcademyofSciences,theInstituteofAutomation,TsinghuaUniversity,andNorthernJiaotongUniversity.Inaddition,HarbinInstituteofTechnology,UniversityofScienceandTechnologyofChina,andSichuanUniversityhavealsotakenaction.
⑵Now,manydomesticspeechrecognitionsystemshavebeensuccessfullydeveloped.Theperformanceofthesesystemshasitsowncharacteristics.
·Intermsofisolatedwordlargevocabularyspeechrecognition,themostrepresentativeoneisTHED-919specificpersonspeechrecognitionandunderstandingsuccessfullydevelopedin1992bytheDepartmentofElectronicEngineeringofTsinghuaUniversityandChinaElectronicDevicesCorporation.Real-timesystem.
·Intermsofcontinuousspeechrecognition,inDecember1991,theComputerCenterofSichuanUniversityimplementedacontinuousEnglish-Chinesespeechtranslationdemonstrationsystemforaspecificpersonwithlimitedthemesonamicrocomputer.
·Intermsofunspecifiedpersonspeechrecognition,thereisavoice-activatedtelephonedirectorysystemdevelopedbytheDepartmentofComputerScienceandTechnologyofTsinghuaUniversityin1987andputintopracticaluse.
Classificationapplication
Accordingtothedifferentobjectsofrecognition,speechrecognitiontaskscanberoughlydividedinto3categories,namely,isolatedwordrecognition(isolatedwordrecognition),keywordrecognition(orkeywordrecognition)Detection,keywordspotting)andcontinuousspeechrecognition.Amongthem,thetaskofisolatedwordrecognitionistorecognizeisolatedwordsthatareknowninadvance,suchas"turnon","off",etc.;thetaskofcontinuousspeechrecognitionistorecognizeanycontinuousspeech,suchasasentenceoraparagraph;continuousspeechstreamThekeyworddetectioninisforcontinuousspeech,butitdoesnotrecognizealltext,butonlydetectswhereanumberofknownkeywordsappear,suchasdetectingthewords"computer"and"world"inaparagraph.
Accordingtothetargetspeaker,speechrecognitiontechnologycanbedividedintospecificpersonspeechrecognitionandnon-specificpersonspeechrecognition.Theformercanonlyrecognizethevoiceofoneorafewpeople,whilethelattercanbeusedbyanyone..Obviously,aperson-independentspeechrecognitionsystemismoreinlinewithactualneeds,butitismuchmoredifficultthanidentifyingaspecificperson.
Inaddition,accordingtothevoicedeviceandchannel,itcanbedividedintodesktop(PC)voicerecognition,telephonevoicerecognitionandembeddeddevice(mobilephone,PDA,etc.)voicerecognition.Differentacquisitionchannelswilldistorttheacousticcharacteristicsofhumanpronunciation,soitisnecessarytoconstructtheirownrecognitionsystems.
Theapplicationfieldofspeechrecognitionisverywide.Commonapplicationsystemsinclude:voiceinputsystem,whichismoreinlinewithpeople’sdailyhabits,morenaturalandmoreefficientthankeyboardinput;voicecontrolsystem,Thatis,usingvoicetocontroltheoperationofthedeviceisfasterandmoreconvenientthanmanualcontrol.Itcanbeusedinmanyfieldssuchasindustrialcontrol,voicedialingsystem,smarthomeappliances,voice-controlledsmarttoys,etc.;intelligentdialoguequerysystem,basedoncustomervoicePerformoperationstoprovideuserswithnaturalandfriendlydatabaseretrievalservices,suchasfamilyservices,hotelservices,travelagencyservicesystems,ticketbookingsystems,medicalservices,bankingservices,stockinquiryservices,andsoon.
Recognitionmethod
Thevoicerecognitionmethodismainlypatternmatchingmethod.
Inthetrainingphase,theuserwillsayeachwordinthevocabularyonebyone,andstoreitsfeaturevectorasatemplateinthetemplatelibrary.
Intherecognitionstage,thefeaturevectoroftheinputspeechiscomparedwitheachtemplateinthetemplatelibraryinturn,andtheonewiththehighestsimilarityisoutputastherecognitionresult.
Mainproblems
Therearefivemainproblemsinspeechrecognition:
⒈Recognitionandunderstandingofnaturallanguage.First,thecontinuousspeechmustbebrokendownintounitssuchaswordsandphonemes,andsecondly,aruleforunderstandingsemanticsmustbeestablished.
⒉Thevolumeofvoiceinformationislarge.Thevoicemodeisnotonlydifferentfordifferentspeakers,butalsodifferentforthesamespeaker.Forexample,thevoiceinformationofaspeakerisdifferentwhenspeakingcasuallyandspeakingseriously.Thewayapersonspeakschangesovertime.
⒊Vaguenessofvoice.Whenthespeakerisspeaking,differentwordsmaysoundsimilar.ThisiscommoninEnglishandChinese.
⒋Thephoneticcharacteristicsofasingleletterorwordorcharacterareaffectedbythecontext,sothattheaccent,pitch,volume,andpronunciationspeedarechanged.
⒌Environmentalnoiseandinterferencehaveaseriousimpactonspeechrecognition,resultinginalowrecognitionrate.
Front-endprocessing
Front-endprocessingreferstoprocessingtheoriginalspeechbeforefeatureextraction,partiallyeliminatingtheinfluenceofnoiseanddifferentspeakers,andmakingtheprocessedsignalmoreCanreflecttheessentialcharacteristicsofspeech.Themostcommonlyusedfront-endprocessingisendpointdetectionandvoiceenhancement.Endpointdetectionreferstodistinguishingvoiceandnon-voicesignalperiodsinavoicesignal,andaccuratelydeterminingthestartingpointofthevoicesignal.Aftertheendpointdetection,thesubsequentprocessingcanonlybeperformedonthevoicesignal,whichplaysanimportantroleinimprovingtheaccuracyofthemodelandtherecognitionaccuracy.Themaintaskofspeechenhancementistoeliminatetheinfluenceofenvironmentalnoiseonspeech.ThecurrentgeneralmethodistouseWienerfiltering,whichisbetterthanotherfiltersinthecaseofhighnoise.
Acousticfeatures
Theextractionandselectionofacousticfeaturesisanimportantpartofspeechrecognition.Acousticfeatureextractionisnotonlyaprocessoflarge-scalecompressionofinformation,butalsoaprocessofsignaldeconvolution.Thepurposeistoenablethepatterndividertobetterdivide.Duetothetime-varyingnatureofthespeechsignal,featureextractionmustbeperformedonasmallsegmentofthespeechsignal,thatis,short-termanalysis.Thisperiodofanalysisisconsideredtobestableandiscalledaframe,andtheoffsetbetweenframesisusually1/2or1/3oftheframelength.Usually,itisnecessarytopre-emphasizethesignaltoincreasethehighfrequency,andaddawindowtothesignaltoavoidtheinfluenceoftheedgeoftheshort-termspeechsegment.
Acousticcharacteristics
LPC
LinearpredictiveanalysisstartswiththehumanvoiceThetransferfunctionconformstotheformofanall-poledigitalfilter,sothesignalattimencanbeestimatedbyalinearcombinationofthesignalsatseveralpreviousmoments.ThelinearpredictioncoefficientLPCcanbeobtainedbymakingthesamplingvalueoftheactualvoiceandthelinearpredictionsamplingvaluereachtheminimummeansquareerrorLMS.ThecalculationmethodsofLPCincludeautocorrelationmethod(Durbinmethod),covariancemethod,latticemethodandsoon.Thefastandeffectivecalculationguaranteesthewidespreaduseofthisacousticfeature.SimilartotheLPCpredictionparametermodel,theacousticfeaturesincludelinespectrumpairLSP,reflectioncoefficient,andsoon.
CEP
Usingthehomomorphicprocessingmethod,thediscreteFouriertransformDFTofthespeechsignaliscalculatedandthelogarithmistaken,andthentheinversetransformiDFTcanbeusedtoobtainthecepstrumcoefficient.ForLPCcepstrum(LPCCEP),afterobtainingthelinearpredictioncoefficientsofthefilter,itcanbecalculatedbyarecursiveformula.Experimentsshowthattheuseofcepstrumcanimprovethestabilityoffeatureparameters.
Mel
DifferentfromLPCandotheracousticcharacteristicsobtainedthroughthestudyofhumanvoicemechanism,MelcepstralcoefficientMFCCandperceptuallinearpredictionPLParesubjecttohumanauditorysystemresearchAcousticfeaturesderivedfromresults.Researchonthemechanismofhumanhearingfoundthatwhentwotoneswithsimilarfrequenciesareemittedatthesametime,peoplecanonlyhearonetone.Criticalbandwidthreferstosuchasuddenchangeofthebandwidthboundaryofthesubjectivefeeling.Whenthefrequencydifferencebetweentwotonesislessthanthecriticalbandwidth,peoplewillhearthetwotonesasone,whichiscalledtheshieldingeffect.TheMelscaleisoneofthemeasurementmethodsforthiscriticalbandwidth.
MFCC
First,useFFTtoconvertthetimedomainsignalintofrequencydomain,andthenconvolveitslogarithmicenergyspectrumwithatriangularfilterbankdistributedaccordingtotheMelscale,andfinallyThevectorformedbytheoutputofeachfilterundergoesdiscretecosinetransformDCT,andthefirstNcoefficientsaretaken.PLPstillusestheDurbinmethodtocalculatetheLPCparameters,butitalsousestheDCTmethodforthelogarithmicenergyspectrumoftheauditoryexcitationwhencalculatingtheautocorrelationparameters.
Acousticmodel
Themodelofthespeechrecognitionsystemisusuallycomposedofanacousticmodelandalanguagemodel,whichrespectivelycorrespondtothecalculationoftheprobabilityofspeechtosyllableandthecalculationofprobabilityofsyllabletoword.Thissectionandthenextsectionintroducethetechniquesofacousticmodelandlanguagemodelrespectively.
HMMacousticmodeling:TheconceptofMarkovmodelisadiscretetime-domainfinitestateautomata.HiddenMarkovmodelHMMmeansthattheinternalstateofthisMarkovmodelisinvisibletotheoutsideworld.Youcanonlyseetheoutputvalueateachmoment.Forspeechrecognitionsystems,theoutputvalueisusuallytheacousticfeaturecalculatedfromeachframe.TwoassumptionsmustbemadetocharacterizespeechsignalswithHMM.Oneisthatthetransitionofinternalstateisonlyrelatedtothepreviousstate,andtheotheristhattheoutputvalueisonlyrelatedtothecurrentstate(orthecurrentstatetransition).Thesetwoassumptionsgreatlyreducethemodel.Complexity.Thecorrespondingalgorithmsforscoring,decodingandtrainingofHMMareforwardalgorithm,Viterbialgorithmandforwardandbackwardalgorithm.
TheHMMusedinspeechrecognitionusuallyusesaleft-to-rightunidirectional,self-loop,andspanningtopologytomodeltherecognitionprimitives.AphonemeisathreetofivestateHMM.AwordisanHMMformedbyserializingtheHMMsofmultiplephonemesthatmakeuptheword,andtheentiremodelofcontinuousspeechrecognitionisanHMMformedbycombiningwordsandsilences.
Context-dependentmodeling:Co-pronunciationreferstoasoundthatchangesduetotheinfluenceofadjacentsoundsbeforeandafter.Fromtheperspectiveofsoundmechanism,thehumanvocalorganschangefromonesoundtoanother.Thecharacteristicscanonlybegradual,sothatthefrequencyspectrumofthelattertoneisdifferentfromthefrequencyspectrumunderotherconditions.Thecontext-dependentmodelingmethodtakesthisinfluenceintoaccountwhenmodeling,sothatthemodelcandescribethespeechmoreaccurately.TheonethatonlyconsiderstheinfluenceoftheprevioussoundiscalledBi-Phone,andtheonethatconsiderstheinfluenceoftheprevioussoundandthenextsoundCalledTri-Phone.
Englishcontext-dependentmodelingusuallyusesphonemesastheprimitives.Sincesomephonemeshavesimilareffectsontheirsubsequentphonemes,modelparameterscanbesharedthroughclusteringofphonemedecodingstates.Theresultofclusteringiscalledsenone.Thedecisiontreeisusedtoachieveanefficienttriphonetosenonecorrespondence,byansweringaseriesofquestionsaboutthecategory(yuan/consonant,unvoiced/voiced,etc.)ofthefrontandbacksounds,andfinallydeterminewhichsenoneshouldbeusedforitsHMMstate.TheclassificationregressiontreeCARTmodelisusedforpronunciationlabelingfromwordstophonemes.
Languagemodel
Languagemodelismainlydividedintotwotypes:rulemodelandstatisticalmodel.Thestatisticallanguagemodelusesprobabilitystatisticstorevealtheinherentstatisticallawsoflanguageunits.Amongthem,N-Gramissimpleandeffective,andiswidelyused.
N-Gram:ThemodelisbasedontheassumptionthattheappearanceofthenthwordisonlyrelatedtothepreviousN-1words,andisnotrelatedtoanyotherwords.TheprobabilityofthewholesentenceiseachTheproductoftheprobabilityofoccurrenceofaword.TheseprobabilitiescanbeobtainedbydirectlycountingthenumberofsimultaneousoccurrencesofNwordsfromthecorpus.CommonlyusedarebinaryBi-GramandternaryTri-Gram.
Theperformanceofalanguagemodelisusuallymeasuredbycross-entropyandcomplexity(Perplexity).Themeaningofcrossentropyisthedifficultyoftextrecognitionwiththismodel,orfromthepointofviewofcompression,eachwordneedstobecodedwithseveralbitsonaverage.Themeaningofcomplexityistousethemodeltorepresenttheaveragenumberofbranchesofthistext,anditsreciprocalcanberegardedastheaverageprobabilityofeachword.SmoothingreferstoassigningaprobabilityvaluetotheunobservedN-gramcombinationtoensurethatthewordsequencecanalwaysgetaprobabilityvaluethroughthelanguagemodel.CommonlyusedsmoothingtechniquesincludeTuringestimation,deletioninterpolationsmoothing,KatzsmoothingandKneser-Neysmoothing.
Search
Thesearchincontinuousspeechrecognitionistofindawordmodelsequencetodescribetheinputspeechsignal,therebyobtainingtheworddecodingsequence.Thesearchisbasedonscoringtheacousticmodelandthelanguagemodelintheformula.Inactualuse,itisoftennecessarytoaddahighweighttothelanguagemodelbasedonexperience,andsetalongwordpenaltyscore.
Viterbi:BasedonthedynamicprogrammingoftheViterbialgorithmateachtimepoint,calculatetheposteriorprobabilityofthedecodedstatesequencetotheobservationsequence,retainthepathwiththehighestprobability,andrecorditateachnodeCorrespondingstateinformationinordertofinallyobtaintheworddecodingsequenceinthereversedirection.TheViterbialgorithmsolvesthenon-lineartimealignmentoftheHMMmodelstatesequenceandtheacousticobservationsequence,wordboundarydetectionandwordrecognitionincontinuousspeechrecognitionwithoutlosingtheoptimalsolution,makingthisalgorithmspeechrecognitionBasicsearchstrategy.
Sincespeechrecognitioncannotpredictthesituationafterthecurrentpointintime,heuristicpruningbasedontheobjectivefunctionisdifficulttoapply.Duetothetime-alignednatureoftheViterbialgorithm,eachpathatthesametimecorrespondstothesameobservationsequence,soitiscomparable.Beamsearchonlyretainsthefirstfewpathswiththehighestprobabilityateachtime,whichgreatlyimprovesthepruning.Searchefficiency.Atthistime,theViterbi-Beamalgorithmisthemosteffectivealgorithminthecurrentspeechrecognitionsearch.N-bestsearchandmulti-passsearch:Inordertousevariousknowledgesourcesinthesearch,multiplesearchesareusuallyperformed.Thefirstpassuseslow-costknowledgesourcestogenerateacandidatelistorwordcandidategrid,basedonthisPerformasecondsearchusingcostlyknowledgesourcestogetthebestpath.Thepreviouslyintroducedknowledgesourcesincludeacousticmodels,languagemodels,andphoneticdictionary,whichcanbeusedforthefirstsearch.Inordertoachievemoreadvancedspeechrecognitionororalcomprehension,itisoftennecessarytousesomemorecostlyknowledgesources,suchas4thor5thorderN-Gram,4thorhighercontext-dependentmodel,inter-wordcorrelationmodel,segmentationModelorgrammaticalanalysis,re-scoring.Manyofthelatestreal-timelargevocabularycontinuousspeechrecognitionsystemsusethismulti-passsearchstrategy.
N-bestsearchproducesacandidatelist,andNbestpathsshouldbekeptateachnode,whichwillincreasethecomputationalcomplexitytoNtimes.Thesimplifiedapproachistokeeponlyafewwordcandidatesforeachnode,butthesub-optimalcandidatesmaybelost.Acompromiseistoconsideronlytwo-word-longpathsandkeepkpaths.Thewordcandidategridgivesmultiplecandidatesinamorecompactway.AftercorrespondingchangestotheN-bestsearchalgorithm,thealgorithmforgeneratingthecandidategridcanbeobtained.
Theforwardandbackwardsearchalgorithmisanexampleofapplyingmultiplesearches.WhenaforwardViterbisearchisperformedusingasimpleknowledgesource,theforwardprobabilityobtainedinthesearchprocesscanbeusedinthecalculationoftheobjectivefunctionofthebackwardsearch.Therefore,theheuristicAalgorithmcanbeusedforthebackwardsearch,whichiseconomicalSearchforNcandidates.
Systemimplementation
Therequirementforthespeechrecognitionsystemtoselecttherecognitionprimitivesistohaveaccuratedefinitions,getenoughdatafortraining,andbegeneral.Englishusuallyusescontext-sensitivephonememodeling,andtheco-pronunciationofChineseisnotasseriousasEnglish,sosyllablemodelingcanbeused.Thesizeofthetrainingdatarequiredbythesystemisrelatedtothecomplexityofthemodel.Themodelisdesignedtobetoocomplextoexceedthecapabilitiesoftheprovidedtrainingdata,whichwillcauseasharpdropinperformance.
Dictationmachine:Alargevocabulary,non-specific,continuousspeechrecognitionsystemisusuallycalledadictationmachine.ItsarchitectureistheHMMtopologybasedontheaforementionedacousticmodelandlanguagemodel.Whentraining,useforwardandbackwardalgorithmstoobtainmodelparametersforeachprimitive.Whenidentifying,concatenatetheprimitivesintowords,addasilentmodelbetweenwordsandintroducealanguagemodelasthetransitionprobabilitybetweenwords,formingaloopstructure,usingViterbiAlgorithmfordecoding.InviewoftheeasysegmentationofChinese,segmentingfirstandthendecodingeachsegmentisasimplifiedmethodtoimproveefficiency.
Dialogsystem:Thesystemusedtorealizehuman-machinespokendialogueiscalledthedialoguesystem.Limitedbycurrenttechnology,dialoguesystemsareoftenorientedtoanarrowfieldandlimitedvocabulary.Thesubjectmatterincludestravelquery,ticketbooking,databaseretrieval,andsoon.Itsfrontendisaspeechrecognizer,whichrecognizesthegeneratedN-bestcandidateorwordcandidategrid,analyzesitbyasyntaxanalyzertoobtainsemanticinformation,andthenthedialoguemanagerdeterminestheresponseinformation,whichisoutputbythespeechsynthesizer.Sincecurrentsystemsoftenhavelimitedvocabulary,themethodofextractingkeywordscanalsobeusedtoobtainsemanticinformation.
Adaptationandrobustness
Theperformanceofthespeechrecognitionsystemisaffectedbymanyfactors,includingdifferentspeakers,speakingmethods,environmentalnoise,transmissionchannels,andsoon.Improvingtherobustnessofthesystemistoimprovetheabilityofthesystemtoovercometheinfluenceofthesefactors,sothattheperformanceofthesystemisstableunderdifferentapplicationenvironmentsandconditions;thepurposeofself-adaptationistoautomaticallyandtargetedThesystemisadjustedtograduallyimproveperformanceduringuse.Thefollowingintroducessolutionstodifferentfactorsthataffectsystemperformance.
Thesolutionisdividedintotwocategoriesaccordingtothemethodforvoicecharacteristics(hereinafterreferredtoasfeaturemethod)andthemethodofmodeladjustment(hereinafterreferredtoasmodelmethod).Theformerneedstofindbetterandhighlyrobustfeatureparameters,oraddsomespecificprocessingmethodsonthebasisofexistingfeatureparameters.Thelatterusesasmallamountofadaptivecorpustomodifyortransformtheoriginalspeaker-independent(SI)model,makingitaspeaker-adaptive(SA)model.
Speakeradaptivefeaturemethodsincludespeakerregularizationandspeakersubspacemethod,andmodelmethodsincludeBayesianmethod,transformationmethodandmodelmergingmethod.
Thenoiseinthespeechsystemincludesenvironmentalnoiseandelectronicnoiseaddedduringtherecordingprocess.Featuremethodstoimprovetherobustnessofthesystemincludespeechenhancementandfindingfeaturesthatarenotsensitivetonoiseinterference.ModelmethodsincludeparallelmodelcombinationPMCmethodsandartificiallyaddingnoiseduringtraining.Channeldistortionincludesthedistanceofthemicrophoneduringrecording,theuseofmicrophoneswithdifferentsensitivities,preamplifierswithdifferentgains,anddifferentfilterdesigns,andsoon.Featuremethodsincludesubtractingitslong-termaveragevaluefromthecepstrumvectorandRASTAfiltering,andmodelmethodsincludecepstrumshift.
Recognitionengine
Microsofthasapplieditsownspeechrecognitionengineinbothofficeandvista.TheuseofMicrosoftspeechrecognitionengineiscompletelyfree,sotherearemanyspeechrecognitionenginesbasedonMicrosoft.Speechrecognitionapplicationsoftwaredevelopedbytherecognitionengine,suchas"VoiceGameMaster","VoiceControlExpert","Ampang","GuardVoiceRecognitionSystem"andsoon.Amongthem,"GuardVoiceRecognitionSystem"istheonlyhardwarefacilitythatcancontrolmicrocontrollers!!
In2009,Microsoftreleasedthewindows7operatingsystem,andspeechrecognitionsoftwarehasbeenbetterpromoted!
Performanceindicators
Indicators
Therearefourmainperformanceindicatorsforthespeechrecognitionsystem.①Rangeofvocabularylist:Thisreferstotherangeofwordsorphrasesthatcanberecognizedbythemachine.Iftherearenorestrictions,therangeofthevocabularylistcanbeconsideredunlimited.②Speakerrestriction:Isitonlyabletorecognizethevoiceofthedesignatedspeaker,orcanrecognizethevoiceofanyspeaker.③Trainingrequirements:Doyouwanttotrainbeforeuse,thatis,whethertoletthemachine"listen"tothegivenvoice,andhowmanytimesoftraining.④Correctrecognitionrate:theaveragepercentageofcorrectrecognition,whichisrelatedtothepreviousthreeindicators.
Summary
Theaboveintroducesthetechnologyofvariousaspectsofthespeechrecognitionsystem.Thesetechnologieshaveachievedgoodresultsinactualuse,buthowtoovercomevariousfactorsthataffectspeechstillneedsmorein-depthanalysis.Atpresent,thedictationsystemcannotbefullypracticaltoreplacekeyboardinput,butthematurityofrecognitiontechnologyhasalsopromotedtheresearchofhigher-levelspeechunderstandingtechnology.BecauseEnglishandChinesehavedifferentcharacteristics,howtousethetechnologyproposedforEnglishinChineseisalsoanimportantresearchtopic,andtheuniqueproblemsofChineseitself,suchasthefourtones,needtobesolved.
Latestprogress
Inrecentyears,especiallysince2009,withthehelpofthedevelopmentofdeeplearningresearchinthefieldofmachinelearningandtheaccumulationofbigdatacorpus,speechrecognitiontechnologyhasbeendevelopedbyleapsandbounds.
1.Newdevelopmentsintechnology
1)Introducedeeplearningresearchinthefieldofmachinelearningtothetrainingofspeechrecognitionacousticmodels,andusemulti-layerneuralnetworkswithRBMpre-trainingtogreatlyimproveTheaccuracyoftheacousticmodelisimproved.Inthisregard,Microsoftresearcherstooktheleadinmakingbreakthroughs.Aftertheyusedthedeepneuralnetworkmodel(DNN),thespeechrecognitionerrorratewasreducedby30%,whichisthefastestprogressinspeechrecognitiontechnologyinthepast20years.
2)Atpresent,mostmainstreamspeechrecognitiondecodershaveadoptedafinitestatemachine(WFST)-baseddecodingnetwork,whichcanintegratelanguagemodels,dictionaries,andacousticsharedphoneticcharactersetsintoalargeThedecodingnetworkgreatlyimprovesthespeedofdecodingandprovidesabasisforreal-timeapplicationsofspeechrecognition.
3)WiththerapiddevelopmentoftheInternetandthepopularizationofmobileterminalssuchasmobilephones,alargeamountoftextorspeechcorpuscanbeobtainedfrommultiplechannels.ThisisthelanguagemodelandacousticsinspeechrecognitionModeltrainingprovidesabundantresources,makingitpossibletobuildgenerallarge-scalelanguagemodelsandacousticmodels.Inspeechrecognition,thematchingandrichnessoftrainingdataisoneofthemostimportantfactorsthatpromotetheimprovementofsystemperformance,butthelabelingandanalysisofcorpusrequireslong-termaccumulationandprecipitation.Withtheadventoftheeraofbigdata,large-scalecorpusresourcesAccumulationwillberaisedtothestrategicheight.
2.Newtechnologyapplications
Recently,theapplicationofvoicerecognitiononmobileterminalsisthehottest.Voicedialoguerobots,voiceassistants,interactivetools,etc.emergeinendlessly,andmanyInternetcompanieshaveinvestedinmanpower,Materialandfinancialresourcestocarryoutresearchandapplicationinthisarea,thepurposeistoquicklyoccupythecustomerbasethroughthenovelandconvenientmodeofvoiceinteraction.
Currently,foreignapplicationshavealwaysbeenledbyApple'ssiri.
Onthedomesticfront,theiFLYTEK,Yunzhisheng,Shanda,JietongHuasheng,SogouVoiceAssistant,ZidongInterpretation,BaiduVoiceandothersystemshaveadoptedthelatestvoicerecognitiontechnology,andotherrelatedsystemsonthemarketSimilartechnologiesaredirectlyorindirectlyembeddedintheproducts.