Let’s Augment a Audio Data🔊 Part 1

vijay Anandan
Analytics Vidhya
Published in
5 min readMar 29, 2021

--

One of the biggest challanges in Automatic Speech Recognition is the preparation and augmentation of audio data. Audio data analysis could be in time or frequency domain, which adds additional complex compared with other data sources such as images and text.

What is Augmentation ?

Deep neural networks achieved state of the art performances in many artificial intelligence fields, like image classification , object detection and audio classification. However, they usually need a very large amount of labelled data to obtain good results and these data might not be available due to high labelling costs or due to the scarcity of the samples. Data augmentation is a powerful tool to improve the performance of neural networks. It consists in modifying the original samples to create new ones, without changing their labels. This leads to a much larger training set and, hence, to better results.

(cleaned & labled data(“Better Data” == “Better Outcome”)) != (uncleaned & unlabled(“ garbage in” == “garbage out”))

On the internet, there are excessively stories, tutorials and codes about image and text augmentation techniques. Also, we have a more powerful library to do image augmenation like albumentations, Imgaug etc.., Unlike image augmentation techniques, there is not enough information about audio data augmentation techniques.

Let’s deep dive into audio signal processing and augmentation pool 🔊

Before getting started we will split the blog into two parts, In the first section we will discuss the basics of libROSA and working with audio data and in the second part, we will discuss Audio augmentation.

Section 1

An introduction to libROSA for working with audio:

  1. Loading audio file
  2. Audio timeline
  3. Audio plotting
  4. Tempo estimation
  5. Finding the pitch level
  6. Computing the mel-scaled spectrogram

Prerequisites:

For installing the libROSA you just need to run the following command in your command line:

pip install libROSA 

In your Python code, you can import it as:

import librosa as lr

We will use Matplotlib library for plotting the results and Numpy library to handle the data as array.

Loading your audio file :

The first step towards our analysis is to load an audio library into our code. This is done using librosa.core.load() function. Audio will be automatically resampled to the given rate (default = 22050). To preserve the native sampling rate of the file, use sr=None.

librosa.core.load()
  • path: is the path to the audio file and is a string parameter
  • sr: is the sampling rate
  • mono: is the option (true/ false) to convert it into mono file.
  • offset: is a floating point number which is the starting time to read the file duration is a floating point number which signifies how much of the file to load.
  • dtype: is the numeric representation of data can be float32, float16, int8 and others.
  • res_type: is the type of resampling (one option is kaiser_best)

Timeline for your audio

In this code, we will print the timeline of the audio file. We will simply load the audio, convert it into a numpy array and print the output for one sample (by dividing by sampling rate).

Audio timeline

Output :

[0.00000000e+00 4.53514739e-05 9.07029478e-05 ... 2.63027211e+01
2.63027664e+01 2.63028118e+01]

Plotting the audio :

Plotting the audio as Time v/s Sound amplitude

Output :

Plotting and finding the estimating tempo

Tempo was originally used to describe the timing of music, or the speed at which a piece of music is played or can be defined as beats per second.

It will return tempo as an array

  • y: audio time series
  • sr: sampling rate of the time series
  • onset_envelope: pre-computed onset strength envelope
  • hop_length: hop length of the time series
  • start_bpm: initial guess of the BPM
  • std_bpm: standard deviation of tempo distribution
  • ac_size: length (in seconds) of the auto-correlation window
  • max_tempo: estimate tempo below this threshold
  • aggregate: for estimating global tempo. If None, then tempo is estimated independently for each frame.
[112.34714674]

Finding and plotting the pitch

The sensation of a frequency is commonly referred to as the pitch of a sound. A high pitch sound corresponds to a high frequency sound wave and a low pitch sound corresponds to a low frequency sound wave.

  • y: audio signal
  • sr: audio sampling rate of y
  • S: magnitude or power spectrogram
  • n_fft: number of FFT bins to use, if y is provided.
  • hop_length: number of samples to hop
  • threshold: A bin in spectrum S is considered a pitch when it is greater than threshold * ref(S)
  • fmin: lower frequency cutoff.
  • fmax: upper frequency cutoff.
  • win_length: Each frame of audio is windowed by window().
  • ref:scalar or callable for pitch

It returns:

  • pitches:np.ndarray [shape=(d, t)]
  • magnitudes:np.ndarray [shape=(d,t)]
    Where d is the subset of FFT bins within fmin and fmax.
  1. pitches[f, t] contains instantaneous frequency at bin f, time t.
  2. magnitudes[f, t] contains the corresponding magnitudes.
  3. Both pitches and magnitudes take value 0 at bins of non-maximal magnitude

Output :

[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
///
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]

Compute a mel-scaled spectrogram

An object of type MelSpectrogram represents an acoustic time-frequency representation of a sound

  • y: audio time-series
  • sr: sampling rate of y
  • S: spectrogram
  • n_fft: length of the FFT window
  • hop_length: number of samples between successive frames. See librosa.core.stft
  • win_length: Each frame of audio is windowed by window().
  • power: Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.
  • kwargs:additional keyword arguments
    Mel filter bank parameters. See librosa.filters.mel for details.

It returns:

  1. S:np.ndarray [shape=(n_mels, t)]
    Mel spectrogram

Section 2

Basic Audio Data Augmentation

For now, we write three methods to apply new effects on the given audio file. These methods are “add_noise”, “shift” and “stretch”. In the “add_noise” method, we add random noise which is generated by the NumPy library to given audio. In the “shift” method, we shift given audio data by using the NumPy library again. Lastly, the “stretch” method applies a time_stretch that belongs to librosa effects. Use the following functions to do the basics of audio augmentation.

In the next part, I will explain about audiomentations library to do extensive audio data augmentation.

--

--