Page 30 - Use cases and requirements for the vehicular multimedia networks - Focus Group on Vehicular Multimedia (FG-VM)

P. 30

It becomes apparent that the determination of the location (zone) of the various emitting sources
(talkers) in the cabin and the acoustic treatment of each transmitted signal (voice command) from
each location (zone) will facilitate the correct processing of voice commands by the voice recognition
system.

8.1.2 Use-cases

8.1.2.1 Use case A – Initiating a voice recognition session
A person in a vehicle containing multiple occupants wishes to initiate a voice recognition session by
uttering a keyword, such as ''Hey Siri'', ''Alexa'' or ''Okay Google''. Each occupant is in a separate
zone of the cabin. The cabin may contain one or more microphones which may or may not be
dedicated for each zone. Each microphone picks up the voice of the occupant, but also the
voices of other occupants, or ''interference speech''. One or more multiple microphone signals
(or audio channels) may be available to a keyword spotter (KWS), which must decide not only
whether/when the keyword was spoken, but also from which zone the keyword was spoken.
The following problem scenarios may result in inadequate behavior of the KWS:
• A-1 If there is no dedicated microphone for each zone, or no means to identify the zone of
the target talker, the command may not be detected, may be rejected or wrongly executed.
• A-2 Otherwise:

– A-2-A Interfering speech may cause a KWS to fail to detect (false reject) the keyword
spoken by the target talker in the target zone microphone.
– A-2-B Concurrent sources (e.g., music, video) played into the vehicle, resulting in echo
on the microphones, may cause a KWS to fail to detect (false reject) the keyword spoken
by the target talker in the target zone microphone.
– A-2-C Interference of the target talker onto microphones outside of the target zone may
cause the KWS to detect the keyword but from the wrong zone.

Figure 8 – Acoustic processing (AEC and ZIC) on each zone dedicated microphone

Figure 8 is an illustration of use case A-2 involving KWS with N microphones/zones in a vehicle,
depicting the waveforms. Each microphone contains target speech, interfering speech and echo
(black). Talker in zone 1 is yellow, talker in zone 2 is red and talker in zone 3 is blue. Acoustic echo
cancellation (AEC) is used to subtract the echo from each microphone, and zone interference
cancellation (ZIC) is used to isolate the target speech from interfering speech in each microphone.

8.1.2.2 Use case B – Interference during a voice recognition session
Once a voice recognition session has been initiated and the target zone has been identified
(e.g., using KWS or push-to-talk), an occupant in the target zone will use voice commands to interact
with the voice recognition system. The target speech in the target zone will potentially be mixed with

20 FGVM-01R1 (2019)

25 26 27 28 29 30 31 32 33 34 35