TÁVKÖZLÉSI ÉS MÉDIAINFORMATIKAI TANSZÉK
Budapesti Műszaki és Gazdaságtudományi Egyetem - Villamosmérnöki és Informatikai Kar

Témák listája

Focus-on-Tongue: Biasing Ultrasound-to-Speech Synthesis System with Tongue Shape
This project investigates how to improve ultrasound-to-speech synthesis systems by explicitly guiding neural networks to focus on tongue contours within ultrasound images. Current models can map articulatory data into speech, but they often struggle with noise and variability in ultrasound signals. A key question arises: what features do convolutional neural networks (CNNs) actually learn from these images? The approach involves several steps: Heatmap Visualization - Using techniques like Grad-CAM to visualize CNN attention on ultrasound images. Cross-Speaker Analysis - Inspecting results across different speakers to ensure robustness. Tongue Contour Biasing - Applying DeepLabCut to extract tongue contours and regenerate images, providing a structural bias for the model. Speech Synthesis - Training the biased network to generate mel-spectrograms, then using a HiFi-GAN vocoder to convert them into speech. Evaluation - Assessing output quality with objective metrics such as MSE, Mel-Cepstral Distortion, and PESQ. The novelty of this work lies in biasing the learning process toward meaningful articulatory structures, rather than relying solely on raw image data. This can lead to more interpretable, reliable, and higher-quality speech synthesis.
Témavezető: Ibrahimov Ibrahim
FOCUS-ON-TONGUE: BIASING ULTRASOUND-TO-SPEECH SYNTHESIS SYSTEM WITH TONGUE SHAPE
This project investigates how to improve ultrasound-to-speech synthesis systems by explicitly guiding neural networks to focus on tongue contours within ultrasound images. Current models can map articulatory data into speech, but they often struggle with noise and variability in ultrasound signals. A key question arises: what features do convolutional neural networks (CNNs) actually learn from these images? The approach involves several steps: Heatmap Visualization – Using techniques like Grad-CAM to visualize CNN attention on ultrasound images. Cross-Speaker Analysis – Inspecting results across different speakers to ensure robustness. Tongue Contour Biasing – Applying DeepLabCut to extract tongue contours and regenerate images, providing a structural bias for the model. Speech Synthesis – Training the biased network to generate mel-spectrograms, then using a HiFi-GAN vocoder to convert them into speech. Evaluation – Assessing output quality with objective metrics such as MSE, Mel-Cepstral Distortion, and PESQ. The novelty of this work lies in biasing the learning process toward meaningful articulatory structures, rather than relying solely on raw image data. This can lead to more interpretable, reliable, and higher-quality speech synthesis.
Témavezető: Ibrahimov Ibrahim