Page 31 - Use cases and requirements for the vehicular multimedia networks - Focus Group on Vehicular Multimedia (FG-VM)

P. 31

interfering speech from other zones, and echo from media sources playing into the cabin. These may
result in incorrectly detected speech and difficulties in accomplishing the voice command.

Figure 9 – Illustration of use case B, where zone 2 has been identified as the target zone

8.1.3 Gap analysis

ITU-T SG12 is currently working on a set of recommendations relevant for VMS:
– In car communication (P.ICC). P.ICC utilizes the integrated microphones and speakers in the
motor vehicle cabin to amplify conversation to provide an improved communication between
all occupants in a motor vehicle. Furthermore, it ensures the quality of voice such that the
motor vehicle driver does not feel it necessary to turn their head to amplify their voice when
talking to other passengers. However, it is not immediately apparent that P.ICC addresses the
requirements associated with the use of voice recognition in a vehicular multimedia context.
– P11xx-00,10,20,30 series. Hands-free communications in vehicles for narrowband,
wideband, super-wideband and full-band and associated subsystems provides useful
conformance points to improve signal processing for hands-free communication within a
vehicle. However, it is not immediately apparent that the P11xx series can address the use
cases and requirements for multiple talkers, KWS, voice commands and voice recognition
sessions.

8.1.4 Requirements – Acoustic

In order for a KWS, a voice recognition or natural language processing solution to operate effectively
in a multiple-talker scenario with background media in a vehicle, the following requirements are
proposed:

R1: It shall be possible to initiate a voice recognition session from any zone in the vehicle.
AR1.1: It shall be possible to initiate a voice recognition session from any zone in presence
of speech interference and noise interference from the same or a different zone.
R2: The VMS or vehicle should provide at least one voice/speech/audio channel per zone
(e.g., one microphone per zone).
R3: It shall be possible to process each channel/zone independently.

R4: It shall be possible to identify a target zone/target speech (e.g., scenario A-1).
R5: ZIC shall be performed to suppress interfering speech from a zone whilst preserving the target
speech in that zone. (scenarios A-2-A, A-2-C and B).

R6: AEC shall be performed on each channel to suppress the echo of media sources that are picked
up in each zone (scenario A-2-B).

R7: When interfacing with cloud recognition services that are not trained specifically with vehicle
noise, it should be possible to perform noise reduction in the target zone to suppress road noise or
wind noise.

FGVM-01R1 (2019) 21

26 27 28 29 30 31 32 33 34 35 36