In this article, you learn how to design a microphone array customized for use with the Speech SDK. This is most pertinent if you're selecting, specifying, or building hardware for speech solutions.
The Speech SDK works best with a microphone array designed according to these guidelines, including the microphone geometry, component selection, and architecture.
Microphone geometry
The following array geometries are recommended for use with the Microsoft Audio Stack. Location of sound sources and rejection of ambient noise is improved with greater number of microphones with dependencies on specific applications, user scenarios, and the device form factor.
Array
Microphones
Geometry
Circular - 7 Microphones
6 Outer, 1 Center, Radius = 42.5 mm, Evenly Spaced
Circular - 4 Microphones
3 Outer, 1 Center, Radius = 42.5 mm, Evenly Spaced
Linear - 4 Microphones
Length = 120 mm, Spacing = 40 mm
Linear - 2 Microphones
Spacing = 40 mm
Microphone channels should be ordered ascending from 0, according to the numbering previously described for each array. The Microsoft Audio Stack requires another reference stream of audio playback to perform echo cancellation.
Component selection
Microphone components should be selected to accurately reproduce a signal free of noise and distortion.
The recommended properties when selecting microphones are:
Parameter
Recommended
SNR
>= 65 dB (1 kHz signal 94 dBSPL, A-weighted noise)
Amplitude Matching
± 1 dB @ 1 kHz
Phase Matching
± 2° @ 1 kHz
Acoustic Overload Point (AOP)
>= 120 dBSPL (THD = 10%)
Bit Rate
Minimum 24-bit
Sampling Rate
Minimum 16 kHz*
Frequency Response
± 3 dB, 200-8000 Hz Floating Mask*
Reliability
Storage Temperature Range -40°C to 70°C Operating Temperature Range -20°C to 55°C
*Higher sampling rates or "wider" frequency ranges might be necessary for high-quality communications (VoIP) applications
Good component selection must be paired with good electroacoustic integration in order to avoid impairing the performance of the components used. Unique use cases might also necessitate more requirements (such as operating temperature ranges).
Microphone array integration
The performance of the microphone array when integrated into a device differs from the component specification. It's important to ensure that the microphones are well matched after integration. Therefore the device performance measured after any fixed gain or EQ should meet the following recommendations:
Parameter
Recommended
SNR
>= 64 dB (1 kHz signal 94 dBSPL, A-weighted noise)
Output Sensitivity
-26 dBFS/Pa @ 1 kHz (recommended)
Amplitude Matching
± 2 dB, 200-8000 Hz
THD%*
≤ 1%, 200-8000 Hz, 94 dBSPL
Frequency Response
± 6 dB, 200-12000 Hz Floating Mask**
**A low distortion speaker is required to measure THD (for example, Neumann KH120)
**"Wider" frequency ranges might be necessary for high-quality communications (VoIP) applications
Speaker integration recommendations
As echo cancellation is necessary for speech recognition devices that contain speakers, more recommendations are provided for speaker selection and integration.
Parameter
Recommended
Linearity Considerations
No nonlinear processing after speaker reference, otherwise a hardware-based loopback reference stream is required
Speaker Loopback
Provided via WASAPI, private APIs, custom ALSA plug-in (Linux), or provided via firmware channel
THD%
Third Octave Bands minimum fifth Order, 70 dBA Playback @ 0.8 m ≤ 6.3%, 315-500 Hz ≤ 5%, 630-5000 Hz
Echo Coupling to Microphones
> -10 dB TCLw using ITU-T G.122 Annex B.4 method, normalized to mic level TCLw = TCLwmeasured + (Measured Level - Target Output Sensitivity) TCLw = TCLwmeasured + (Measured Level - (-26))
Integration design architecture
The following guidelines for architecture are necessary when integrating microphones into a device:
Parameter
Recommendation
Mic Port Similarity
All microphone ports are same length in array
Mic Port Dimensions
Port size Ø0.8-1.0 mm. Port Length / Port Diameter < 2
Mic Sealing
Sealing gaskets uniformly implemented in stack-up. Recommend > 70% compression ratio for foam gaskets
Mic Reliability
Mesh should be used to prevent dust and ingress (between PCB for bottom ported microphones and sealing gasket/top cover)
Mic Isolation
Rubber gaskets and vibration decoupling through structure, particularly for isolating any vibration paths due to integrated speakers
Sampling Clock
Device audio must be free of jitter and drop-outs with low drift
Record Capability
The device must be able to record individual channel raw streams simultaneously
Devices must not have any undiscoverable or uncontrollable hardware, firmware, or third party software-based nonlinear audio processing algorithms to/from the device
Capture Format
Capture formats must use a minimum sampling rate of 16 kHz and recommended 24-bit depth
Electrical architecture considerations
Where applicable, arrays can be connected to a USB host (such as a SoC that runs the Microsoft Audio Stack (MAS)) and interfaces to Speech services or other applications.
Hardware components such as PDM-to-TDM conversion should ensure that the dynamic range and SNR of the microphones is preserved within re-samplers.
High-speed USB Audio Class 2.0 should be supported within any audio MCUs in order to provide the necessary bandwidth for up to seven channels at higher sample rates and bit depths.