Let’s Augment a Audio Data🔊 Part 1

Published in

Analytics Vidhya

5 min readMar 29, 2021

One of the biggest challanges in Automatic Speech Recognition is the preparation and augmentation of audio data. Audio data analysis could be in time or frequency domain, which adds additional complex compared with other data sources such as images and text.

What is Augmentation ?

Deep neural networks achieved state of the art performances in many artificial intelligence fields, like image classification , object detection and audio classification. However, they usually need a very large amount of labelled data to obtain good results and these data might not be available due to high labelling costs or due to the scarcity of the samples. Data augmentation is a powerful tool to improve the performance of neural networks. It consists in modifying the original samples to create new ones, without changing their labels. This leads to a much larger training set and, hence, to better results.

(cleaned & labled data(“Better Data” == “Better Outcome”)) != (uncleaned & unlabled(“ garbage in” == “garbage out”))

On the internet, there are excessively stories, tutorials and codes about image and text augmentation techniques. Also, we have a more powerful library to do image augmenation like albumentations, Imgaug etc.., Unlike image augmentation techniques, there is not enough information about audio data augmentation techniques.

Let’s deep dive into audio signal processing and augmentation pool 🔊

Before getting started we will split the blog into two parts, In the first section we will discuss the basics of libROSA and working with audio data and in the second part, we will discuss Audio augmentation.

Section 1

An introduction to libROSA for working with audio:

Loading audio file
Audio timeline
Audio plotting
Tempo estimation
Finding the pitch level
Computing the mel-scaled spectrogram

Prerequisites:

NumPy (instructions)
Audio signal processing (instructions)
Matplotlib (instructions)
LibROSA (instructions)

For installing the libROSA you just need to run the following command in your command line:

pip install libROSA

In your Python code, you can import it as:

import librosa as lr

We will use Matplotlib library for plotting the results and Numpy library to handle the data as array.

Loading your audio file :

The first step towards our analysis is to load an audio library into our code. This is done using librosa.core.load() function. Audio will be automatically resampled to the given rate (default = 22050). To preserve the native sampling rate of the file, use sr=None.

librosa.core.load()

path: is the path to the audio file and is a string parameter
sr: is the sampling rate
mono: is the option (true/ false) to convert it into mono file.
offset: is a floating point number which is the starting time to read the file duration is a floating point number which signifies how much of the file to load.
dtype: is the numeric representation of data can be float32, float16, int8 and others.
res_type: is the type of resampling (one option is kaiser_best)

Timeline for your audio

In this code, we will print the timeline of the audio file. We will simply load the audio, convert it into a numpy array and print the output for one sample (by dividing by sampling rate).

Audio timeline

Output :

[0.00000000e+00 4.53514739e-05 9.07029478e-05 ... 2.63027211e+01
 2.63027664e+01 2.63028118e+01]

Plotting the audio :

Plotting the audio as Time v/s Sound amplitude

Output :

Plotting and finding the estimating tempo

Tempo was originally used to describe the timing of music, or the speed at which a piece of music is played or can be defined as beats per second.

It will return tempo as an array

y: audio time series
sr: sampling rate of the time series
onset_envelope: pre-computed onset strength envelope
hop_length: hop length of the time series
start_bpm: initial guess of the BPM
std_bpm: standard deviation of tempo distribution
ac_size: length (in seconds) of the auto-correlation window
max_tempo: estimate tempo below this threshold
aggregate: for estimating global tempo. If None, then tempo is estimated independently for each frame.

[112.34714674]

Finding and plotting the pitch

The sensation of a frequency is commonly referred to as the pitch of a sound. A high pitch sound corresponds to a high frequency sound wave and a low pitch sound corresponds to a low frequency sound wave.

y: audio signal
sr: audio sampling rate of y
S: magnitude or power spectrogram
n_fft: number of FFT bins to use, if y is provided.
hop_length: number of samples to hop
threshold: A bin in spectrum S is considered a pitch when it is greater than threshold * ref(S)
fmin: lower frequency cutoff.
fmax: upper frequency cutoff.
win_length: Each frame of audio is windowed by window().
ref:scalar or callable for pitch

It returns:

pitches:np.ndarray [shape=(d, t)]
magnitudes:np.ndarray [shape=(d,t)]
Where d is the subset of FFT bins within fmin and fmax.

pitches[f, t] contains instantaneous frequency at bin f, time t.
magnitudes[f, t] contains the corresponding magnitudes.
Both pitches and magnitudes take value 0 at bins of non-maximal magnitude

Output :

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
///
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Compute a mel-scaled spectrogram

An object of type MelSpectrogram represents an acoustic time-frequency representation of a sound

y: audio time-series
sr: sampling rate of y
S: spectrogram
n_fft: length of the FFT window
hop_length: number of samples between successive frames. See librosa.core.stft
win_length: Each frame of audio is windowed by window().
power: Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.
kwargs:additional keyword arguments
Mel filter bank parameters. See librosa.filters.mel for details.

It returns:

S:np.ndarray [shape=(n_mels, t)]
Mel spectrogram

Section 2

Basic Audio Data Augmentation

For now, we write three methods to apply new effects on the given audio file. These methods are “add_noise”, “shift” and “stretch”. In the “add_noise” method, we add random noise which is generated by the NumPy library to given audio. In the “shift” method, we shift given audio data by using the NumPy library again. Lastly, the “stretch” method applies a time_stretch that belongs to librosa effects. Use the following functions to do the basics of audio augmentation.

In the next part, I will explain about audiomentations library to do extensive audio data augmentation.

Let’s Augment a Audio Data🔊 Part 1

An introduction to libROSA for working with audio:

Loading your audio file :

Timeline for your audio

Plotting the audio :

Plotting and finding the estimating tempo

Finding and plotting the pitch

Compute a mel-scaled spectrogram

Basic Audio Data Augmentation

Vijay Anandan - Associate Engineer 1 - Cognizant | LinkedIn

A curiosity-driven data scientist, eager to leverage machine learning and data analytics to extract meaningful…

Written by vijay Anandan