Audio

In this section, we will learn how to use representations of audio data in machine learning.

Audio files can be represented in a variety of ways. The most common is the waveform, which is a time series of the amplitude of the sound wave at each time point. The waveform is a one-dimensional array of numbers. The sampling rate is the number of samples per second.

To load an audio file, we can use the librosa library. The librosa.load function returns the waveform and the sampling rate.

Note

You may have to install the librosa library using !pip install librosa in a new code cell for the code below to work.

The audio file can be downloaded from this link.

from matplotlib import pyplot as plt 
plt.style.use('dark_background')

import librosa
y, sr = librosa.load('../assets/StarWars3.wav')
plt.plot(y);
plt.xlabel('Time (samples)');
plt.ylabel('Amplitude');
plt.title('Star Wars Theme\nSampling rate: %s Hz\nLength: %s seconds' % (sr, len(y)/sr));

sr, len(y)

(22050, 66150)

S = librosa.stft(y)
S.shape

(1025, 130)

Power Spectral Density (PSD) is a measure of the power of a signal at different frequencies. The PSD is calculated using the Fourier Transform. The PSD is a useful representation of audio data because it is often easier to distinguish different sounds in the frequency domain than in the time domain.

import numpy as np 

fig, ax = plt.subplots()
img = librosa.display.specshow(librosa.amplitude_to_db(np.abs(S), ref=np.max),
    y_axis='log', x_axis='time', ax=ax);
ax.set_title('Power spectrogram');
fig.colorbar(img, ax=ax, format="%+2.0f dB");