Using Deep Learning to Reconstruct High-Resolution Audio

Published in

Insight

6 min readJun 23, 2017

Jeffrey Hetherly, Physics PhD and Insight AI Fellow, implemented cutting-edge research that was scheduled to be presented at ICLR 2017. He is now a Data Scientist at Lab41, an In-Q-Tel Lab, working on advances in machine learning for open source products. This project, made possible by Paperspace GPUs, also resulted in an active open source contribution to TensorFlow.

Want to learn applied Artificial Intelligence from top professionals in Silicon Valley or New York? Learn more about the Artificial Intelligence program.

Audio super-resolution aims to reconstruct a high-resolution audio waveform given a lower-resolution waveform as input. There are several potential applications for this type of upsampling in such areas as streaming audio and audio restoration. One traditional solution is to use a database of audio clips to fill in the missing frequencies in the downsampled waveform using a similarity metric (see this and this paper). Inspired by the successful applications of deep learning to image super-resolution, there is recent interest in using deep neural networks to accomplish this upsampling on raw audio waveforms. After prototyping several methods, I focused on implementing and customizing recently published research from the 2017 International Conference on Learning Representations (ICLR).

While there are a variety of domains where audio upsampling could be useful, I focused on a potential voice-over-IP application. The dataset I chose for this project is a collection of TED talks about 35 GB in size found here. Each talk is located in separate files with bit rates of 16 kilobits per second (kbps) which is considered high quality for speech audio. This dataset contains primarily well-articulated English speech in front an audience from a variety of speakers. These qualities regarding the TED talks are an approximation to what one may expect during a voice-over-IP conversation.

The preprocessing steps are outlined in the above figure. The first and last 30 seconds from each file are trimmed to remove the TED intro and closing. The files are then split into 2 second clips and a separate, 4x downsampled set of clips at 4 kbps are created along with a set at the original 16 kbps. 60% of the dataset are used during training while 20% are reserved for validation and 20% for testing.

The training workflow outlined in the above figure uses the downsampled clips of the data preprocessing steps and batch-feeds them into the model (a deep neural network) to update its weights. The model with the lowest validation score (denoted “Best Model”) is saved for later use.

The process of using the “Best Model” to upsample an audio file is given in the above figure. This workflow takes whole audio files, splices them into clips similarly to the preprocessing steps, sequentially feeds them to trained model, stitches the high-resolution clips back together, and saves the high-resolution file to disk.

Model Architecture

The model architecture I implemented was a U-Net that uses a one-dimensional analogue of subpixel convolutions instead of deconvolution layers. I used Tensorflow’s Python API to build and train the model while the subpixel convolutional layers are implemented using Tensorflow’s C++ API. The model works as follows:

The downsampled waveform was sent through eight downsampling blocks that are each made of convolutional layers with a stride of two. At each layer the number of filter banks was doubled so that while the dimension along the waveform was reduced by half, the filter bank dimension was increased by two.
The bottleneck layer was constructed identically to a downsampling block which connects to eight upsampling blocks which have residual connections to the downsampling blocks. These residual connections allowed for the sharing of features learned from the low-resolution waveform.
The upsampling blocks used a subpixel convolution that reorders information along one dimension to expand the other dimensions.
A final convolutional layer with restacking and reordering operations was residually added to the original input to yield the upsampled waveform.
The loss function used was the mean-squared error between the output waveform and the original, high-resolution waveform.

Performance

The above figure shows two quantitative measures of performance on a test sample after 10 epochs of training. On the left column are spectrograms of frequency versus time, and on the right are plots of the waveform amplitude versus time.

The first row contains the spectrogram and waveform plots for the original, high-resolution audio sample.
The middle row contains similar plots for the 4x downsampled version of the original audio sample. Notice that 3/4 of the highest frequencies are missing in the downsampled frequency plot.
The last row contains the spectrograms and waveform plots for the output of the trained model.

Inset are two quantitative measures of performance: the signal-to-noise ratio (SNR) and the log-spectral distance (LSD). Higher SNR values represent clearer-sounding audio while lower LSD values indicate matching frequency content. The LSD value shows the neural network is attempting to restore the higher frequencies wherever appropriate. However, the slightly lower SNR value implies that the audio may not be as clear-sounding.

The paper that inspired this architecture claimed to train on 400 epochs of data whereas I could train on only 10 epochs due to time constraints. A longer training period would likely result in increased clarity in the reconstructed waveform. Below you can listen to sample audio clips from the test set. The first 5 sec clip is the original audio at 16 kbps, the second is the downsampled audio at 4kbps, and the last is the reconstructed audio at 16kbps.

Random clip at 16 kbps from the test set.

Downsampled version of the above clip. Notice that all high frequency content is missing.

Reconstructed clip. Much of the high frequency content has been restored at the expense of clarity.

Open-source contributions

The reconstruction of downsampled audio can have a variety of applications, and what is even more exciting is the possibilities of applying these techniques to other non-audio signals. I encourage you to adapt and modify the code available in my github repo to experiment along these lines.

In addition to making available the code for these experiments, I had a desire to contribute additional open source materials for the growing applied AI community. Since the subpixel convolution layer is a general operation that might be useful to deep learning researchers and engineers alike, I’ve been contributing back to TensorFlow and working closely with their team to integrate into their codebase.

Want to learn applied Artificial Intelligence from top professionals in Silicon Valley or New York? Learn more about the Artificial Intelligence program.

Are you a company working in AI and would like to get involved in the Insight AI Fellows Program? Feel free to email us.

Using Deep Learning to Reconstruct High-Resolution Audio

Model Architecture

Performance

Open-source contributions

Written by Jeffrey Hetherly