Data
Genre Annotations
All four training genre datasets are distributed as TSV files with the following format:
[RecordingID] [ReleaseGroupID] [genre/subgenre label] [genre/subgenre label] ...
A real data example:
6bb7e980-791c-44b5-9024-cc7c90bc8230 969ebfe8-0786-3ee0-b49b-3005fe653aa4 metal metal---heavymetal metal---progressivemetal rock rock---progressiverock
92a70a47-98c4-43fd-8b1f-972657f627c3 7378d3cf-a3a9-3fe3-825b-70d6f0230250 country country---countryfolk folk
c7bee376-0020-461a-90a7-d5af73cfff05 6e652b2f-6f94-47ef-834a-a85d25921fce soul
93597a3e-cdca-4123-bcf5-343ff8debbe2 47de1259-bdeb-3f11-b612-4976887dca5c pop pop---ballad
a4d017d4-e75b-4eac-8f46-1b000ef407b0 9b1640de-4eb7-3071-b6a3-1c6f76c1a1b4 electronic electronic---ambient electronic---downtempo pop rock rock---indie rock---spacerock
27b7cf35-0238-4316-b2fd-c589a866603a b6f21355-5e8e-33f7-acbf-03d99e9e90f9 electronic electronic---bigbeat electronic---techno
Each line corresponds to one recording (a music track or song), and contains all its ground-truth genre and subgenre labels. recordingmbid
is the MusicBrainz identifier of the particular recording. To distinguish between genre and subgenre labels, subgenre strings are compound and contain ---
as a separator between a parent genre and an actual subgenre name. For example, rock
, electronic
, jazz
and hip hop
are genres, while electronic---ambient
, rock---singersongwriter
and jazz---latinjazz
are subgenres.
Additionally, we provide releasegroupmbid
for each recording, which is a MusicBrainz identifier of a release group (an album, single, or compilation) that it belongs to. This data may be useful if one wants to avoid an “album effect” [4], which consists in potential overestimation of the performance of a classifier when a test set contains music recordings from the same albums as the training set.
Groundtruth files have a header
recordingmbid releasegroupmbid genre1 genre2 ... genren
to show that the first two columns contain MusicBrainz IDs, and subsequent columns contain genre annotations. As the number of annotations per recording differ, this header contains as many rows as necessary to provide a header to the row with the most annotations. Additionally, rows with fewer genre annotations are padded with the field separator (a tab) to ensure that all rows have the same number of columns. You should ensure that you remove “empty” annotations if your preferred tool to read these files does not do this automatically.
Genre annotations are ordered alphabetically. There is no correlation between the annotations of two different recordings in the same column.
Music features
We provide a dataset of music features precomputed from audio for every music recording. The dataset can be downloaded as an archive. It contains a JSON file with music features for every RecordingID
. See an example JSON file.
All music features are taken from the community-built database AcousticBrainz and were extracted from audio using Essentia, an open-source library for music audio analysis [2]. They are grouped into categories (low-level, rhythm, and tonal) and are explained in detail here. Only statistical characterization of time frames is provided (bag of features), no frame-level data is available.
Development and Test Data
The development data contains:
- music features for all recordings in AllMusic, Discogs, Lastfm and Tagtraum datasets (~30GB bz2 archives, 83GB uncompressed JSON files). Each filename corresponds to a
RecordingID
(which is a UUID). They are split into 8 separate archives according to the first hex digits of their RecordingIDs. - Because there is substantial overlap between the Recordings in each dataset, we provide a single series of archives which contain data for all datasets.
- All archives will uncompress into a directory named
acousticbrainz-mediaeval-train
. Data files are named in the form54/54551aad-fb76-4e22-8725-fd495c32b155.json
, where the file is inside a subdirectory named by the first two letters of its RecordingId. - You may find that the data files have a value in the
metadata.tags.musicbrainz_recordingid
field which is different to the RecordingID used in the filename. This is to be expected due to Musicbrainz ID redirects. - four archives with ground-truth genre annotations (AllMusic, Discogs, Lastfm, Tagtraum - see format description above)
The test data contains four archives of music features for recordings with anonymized RecordingIDs. To avoid a potential album effect [4], no recording in the test set contains music from the same release groups as the recordings in the train set.
- Although RecordingIDs are UUIDs, they have been randomly anonymised and do not correspond to any MusicBrainz IDs on musicbrainz.org
All data is compressed with bzip2. Checksums are provided to ensure that you have correctly downloaded the archives.
Download
The development and test data for Discogs, Lastfm and Tagtraum is now publicly available on Zenodo.
The development data (genre ground truth) and test data for AllMusic requires signing the Data Usage agreement by participants and is also available on Zenodo.
Notes
To give an idea of the scale of the data, we report some statistics for the train datasets.
AllMusic:
- 1353213 recordings by 163654 releasegroups
- 21 genres, 745 subgenres
- 1.33 genres and 3.15 subgenres per recording on average
- genre/subgenre distribution
Discogs:
- 904944 recordings by 118475 releasegroups
- 15 genres, 300 subgenres
- 1.37 genres and 1.69 subgenres per recording on average
- genre/subgenre distribution
Lastfm:
- 566710 recordings by 115161 releasegroups
- 30 genres, 297 subgenres
- 1.14 genres and 1.28 subgenres per recordings on average
- genre/subgenre distribution
Tagtraum:
- 486740 recordings by 69025 releasegroups
- 31 genres, 265 subgenres
- 1.13 genres and 1.72 subgenres per recording on average
- genre/subgenre distribution
Genre/subgenre taxonomy and distribution in terms of recordings and releasegroups for all four development datasets are reported here.
The datasets are partially intersected.
Note that the data we provide is very large-scale. It includes a large number of music recordings and many of music features for those recordings. Participants are free to use all of the data to train their systems or only its part.
Please, contact the organizers if you have further questions or need help.