Creating Spectrograms From Audio Files Using Python | Alakise

Creating Spectrograms From Audio Files Using Python

Creating Spectrograms From Audio Files Using Python

I needed an audio spectrogram generator for a machine learning algorithm I wanted to produce, but all the codes I encountered were missing, old or incorrect.

The problem I encountered in all the codes, the output of the code they wrote was always at a standard resolution. I’ve arranged the output resolution to be equal to the maximum resolution that the audio file can provide, making it the best fit for analysis(not sure). I also standardized the expression of sound intensity as dBFS.

Only supports mono 16bit 44.1kHz .wav files. But it is easy to convert audio files using certain websites.

I will also write a Python script to bulk convertion of audio files to this format, one day.



The code is tested using SciPy 1.3.1, NumPy 1.17.0, Matplotlib 3.1.1 under Windows 10 with Python 3.7 and Python 3.5. Similiar versions of those libraries probably works.


Audio files are actually records of periodic sampling of the sound levels of frequencies. I want to touch on these notions first.

Sound Level
Differences in signal or sound levels are measured in decibels (dB). So a measurement of 10 dB means that one signal or sound is 10 decibels louder than another. It is a relative scale.

It is a common error to say that, for instance, the sound level of a human speech is 50-60 dB.  The level of the human speech is 50-60 dB SPL(Sound Pressure Level), where 0 dB SPL is the reference level. 0 dB SPL is the hearing limit of average person, anything quieter would be imperceptible. But dB SPL relates only to actual sound, not to signals.

I’ve scaled plot to dbFS, FS stands for ‘Full Scale’ and 0 dBFS is the highest signal level achievable in a digital audio file and all levels in audio files relative to this value.

Audio frequency is the speed of the sound’s vibration which determines the pitch of the sound. Even if you are not familiar with audio processing, this notion widely known.

Sampling Rate
The sampling rate is the number of times per second that the amplitude of the signal is measured and so has dimensions of samples per second. So, if you divide the total number of samples in the audio file by the sampling rate of the file, you will find the total duration of the audio file. For further information about sampling rate search for “Nyquist-Shannon Sampling Theorem”

Fourier Transform and Short Time Fourier Transform
If we evaluate a sound wave as a time-volume graph, we cannot obtain information about the frequency domain, and if we apply a Fourier transform to this wave, it loses its time domain. In short, time represention obfuscates frequency and frequency represention obfuscates time. Therefore, a meaningful image in terms of frequency, sound intensity and time can be obtained by applying the Fourier transform to the short interval parts of the sound data called short time Fourier transform.


Tested with audio files in “examples” folder. Following images shows two of them:

1kHz-10kHz Sweep -40dbFS
Me saying “Merhaba Dünya”(hello world)


You can run the code on the command line using:

python "examples/1kHz-20dbFS.wav" l # opens labelled output in window
python "examples/1kHz-20dbFS.wav" ls # saves labelled
python "examples/1kHz-20dbFS.wav" s # saves unlabelled output
python "examples/1kHz-20dbFS.wav"  # opens unlabelled output in window
1kHz sinus wave at 20dbFS for 3 seconds.
1kHz-20dbFS.wav – 1kHZ sinus wave at -20dBFS looks like this

The third argument passed on the command line can take two letters: ‘l’ for labelled, and ‘s’ for save. Set your output folder in code.

import sys
import numpy as np
import as wav
import ntpath

from numpy.lib import stride_tricks
from matplotlib import pyplot as plt

output_folder = 'outputs'  # set your output folder and make sure it exists

# short-time Fourier Transformation(STFT)
def stft(sig, frame_size, overlap_factor=0.5, window=np.hanning):
    win = window(frame_size)
    hop_size = int(frame_size - np.floor(overlap_factor * frame_size))

    # zeros at beginning (thus center of 1st window should be for sample nr. 0)   
    samples = np.append(np.zeros(int(np.floor(frame_size / 2.0))), sig)
    # cols for windowing
    cols = np.ceil((len(samples) - frame_size) / float(hop_size)) + 1
    # zeros at end (thus samples can be fully covered by frames)
    samples = np.append(samples, np.zeros(frame_size))

    frames = stride_tricks.as_strided(samples, shape=(int(cols), frame_size), strides=(samples.strides[0] * hop_size, samples.strides[0])).copy()
    frames *= win

    return np.fft.rfft(frames)    

def log_scale_spec(spec, sr=44100, factor=20.):
    time_bins, frequency_bins = np.shape(spec)

    scale = np.linspace(0, 1, frequency_bins) ** factor
    scale *= (frequency_bins-1)/max(scale)
    scale = np.unique(np.round(scale))

    # Creates spectrogram with new frequency bins
    new_spectrogram = np.complex128(np.zeros([time_bins, len(scale)]))
    for i in range(0, len(scale)):        
        if i == len(scale)-1:
            new_spectrogram[:,i] = np.sum(spec[:,int(scale[i]):], axis=1)
            new_spectrogram[:,i] = np.sum(spec[:,int(scale[i]):int(scale[i+1])], axis=1)

    # Lists center frequency of bins
    all_frequencies = np.abs(np.fft.fftfreq(frequency_bins*2, 1./sr)[:frequency_bins+1])
    frequemcies = []
    for i in range(0, len(scale)):
        if i == len(scale)-1:
            frequemcies += [np.mean(all_frequencies[int(scale[i]):])]
            frequemcies += [np.mean(all_frequencies[int(scale[i]):int(scale[i+1])])]

    return new_spectrogram, frequemcies

def plot_audio_spectrogram(audio_path, binsize=2**10, plot_path=None, argv = '', colormap="jet"):
    sample_rate, samples =
    s = stft(samples, binsize)
    new_spectrogram, freq = log_scale_spec(s, factor=1.0, sr=sample_rate)
    data = 20. * np.log10(np.abs(new_spectrogram) / 10e+6)  #dBFS

    time_bins, freq_bins = np.shape(data)

    print("Time bins: ", time_bins)
    print("Frequency bins: ", freq_bins)
    print("Sample rate: ", sample_rate)
    print("Samples: ",len(samples))
    # horizontal resolution correlated with audio length  (samples / sample length = audio length in seconds). If you use this(I've no idea why). I highly recommend to use "gaussian" interpolation.
    #plt.figure(figsize=(len(samples) / sample_rate, freq_bins / 100))
    plt.figure(figsize=(time_bins/100, freq_bins/100)) # resolution equal to audio data resolution, dpi=100 as default
    plt.imshow(np.transpose(data), origin="lower", aspect="auto", cmap=colormap, interpolation="none")

    # Labels
    plt.xlim([0, time_bins-1])
    plt.ylim([0, freq_bins])

    if 'l' in argv: # Add Labels
    else: # No Labels

    x_locations = np.float32(np.linspace(0, time_bins-1, 10))
    plt.xticks(x_locations, ["%.02f" % l for l in ((x_locations*len(samples)/time_bins)+(0.5*binsize))/sample_rate])
    y_locations = np.int16(np.round(np.linspace(0, freq_bins-1, 20)))
    plt.yticks(y_locations, ["%.02f" % freq[i] for i in y_locations])

    if 's' in argv: # Save
        print('Unlabeled output saved as.png')
        print('Graphic interface...')


    return data
if len(sys.argv) > 2:
    ims = plot_audio_spectrogram(sys.argv[1], 2**10, output_folder + '/'+ ntpath.basename(sys.argv[1].replace('.wav','')) + '.png',  sys.argv[2])
    ims = plot_audio_spectrogram(sys.argv[1], 2**10, None, '')

Leave a Reply

Your email address will not be published. Required fields are marked *