Audio signals found in the wild contain multi-scale dependencies that prove particularly difficult to model, leading many previous efforts at data-driven audio synthesis to focus on more constrained domains such as texture synthesis or training small parametric models.
NSynth is an audio dataset containing 305,979 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI pian o (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.
We encourage the broader community to use NSynth as a benchmark and entry point into audio machine learning.