Trusted voices: secure multi-modal speaker verification for smart homes - ON-1113

Project type: Innovation
Desired discipline(s): Engineering - computer / electrical, Engineering, Computer science, Mathematical Sciences
Company: Sapir AI Inc.
Project Length: 4 to 6 months
Preferred start date: As soon as possible.
Language requirement: English
Location(s): Toronto, ON, Canada; Canada
No. of positions: 1
Desired education level: Master'sPhDPostdoctoral fellowRecent graduate
Open to applicants registered at an institution outside of Canada: No

About the company: 

Sapir Robotics is developing a next-generation home robot capable of autonomously performing complex household tasks like cleaning, cooking, and laundry. Our work blends advanced mechanical and electrical engineering with cutting-edge AI, including vision-language models (VLMs), Reinforcement Learning, and 3D computer vision. The system is designed for real-world physical interaction, requiring research into real-time spatial reasoning, scene understanding, manipulation, and intelligent behavior planning. Our goal is to make the home robot as common and transformative as the washing machine once was.

Describe the project.: 

The next generation of smart home personal assistants will move beyond simple command-response interactions to become proactive, cognitive partners. A fundamental challenge in this evolution is enabling secure and truly personalized communication. Current systems are often limited to generic wake-word detection, but future assistants operating in complex, multi-user environments must reliably know who is speaking at all times. This project addresses the critical need for continuous speaker detection and on-demand identity verification within a sophisticated, multimodal AI system.

The primary innovation to be developed is a robust, real-time speaker verification pipeline that can run on commodity edge hardware. The goal is to create a system capable of handling complex acoustic scenes with multiple speakers across different rooms. This pipeline will enable the assistant to differentiate between users for personalized responses and, crucially, to perform biometric verification for security-sensitive commands (e.g., accessing private information or controlling home security features).The final product will be a deployable software module that serves as a verification module for personalized and secure voice interactions in a smart environment.

Key tasks will include researching and prototyping state-of-the-art multi-modal speaker diarization and verification models optimized for low-latency performance. The candidate will help architect a pipeline that processes continuous audio streams, segments them by speaker, and triggers a high-confidence verification check when required. The methodology will focus on designing and testing this system for accuracy, speed, and robustness in realistic, simulated domestic environments, ensuring it can effectively interface with a broader multimodal AI framework.

Required expertise/skills: 

Machine Learning & Speech Processing:
• Demonstrated experience in speaker recognition and speaker diarization.
• Strong theoretical and practical understanding of deep learning models for voice biometrics (e.g., x-vectors, ECAPA-TDNN, etc.).
• Proficiency in digital signal processing for audio, including feature extraction techniques like MFCCs and spectrograms.

Programming & Frameworks:
• High proficiency in Python and its scientific computing libraries (NumPy, SciPy).
• Extensive hands-on experience with modern deep learning frameworks, primarily PyTorch or TensorFlow.
• Familiarity with major speech processing toolkits such as pyannote.audio, NVIDIA NeMo.

Optional Assets:
• Experience processing multi-channel audio from microphone arrays, including techniques for far-field speech enhancement
• Experience deploying models for real-time, low-latency inference.
• A record of academic publications in relevant venues (e.g., ICASSP, Interspeech).