Results

The final submissions are (115 runs):

For the baseline:

The solid grey lines mark the performance of the random baseline, and the dotted lines mark the performance of the popularity-based baseline. The numbers indicate whether the run belongs to task1 or task2.

All results are also available in a CSV file.

Per-track average scores

Per-track Precision vs Recall for all labels

AllMusic - per track - all labels Discogs - per track - all labels

Lastfm - per track - all labels Tagtraum - per track - all labels

Per-track Precision vs Recall for genre labels

AllMusic - per track - genre labels Discogs - per track - genre labels

Lastfm - per track - genre labels Tagtraum - per track - genre labels

Per-track Precision vs Recall for subgenre labels

AllMusic - per track - subgenre labels Discogs - per track - subgenre labels

Lastfm - per track - subgenre labels Tagtraum - per track - subgenre labels

Per-label average scores

Per-label Precision vs Recall for all labels

AllMusic - per label - all labels Discogs - per label - all labels

Lastfm - per label - all labels Tagtraum - per label - all labels

Per-label Precision vs Recall for genre labels

AllMusic - per label - genre labels Discogs - per label - genre labels

Lastfm - per label - genre labels Tagtraum - per label - genre labels

Per-label Precision vs Recall for subgenre labels

AllMusic - per label - subgenre labels Discogs - per label - subgenre labels

Lastfm - per label - subgenre labels Tagtraum - per label - subgenre labels

Results adjusted by genre-subgenre hierarchies

The submissions to the task were required to include all predicted genres and subgenres explicitly. Therefore, we did not explicitly consider hierarchical relations in the evaluation.

We conducted an additional evaluation with an adjustment for such relations, because most submissions did not explicitly predict the genres of the predicted subgenres. In these cases, we expanded all predictions to also include the corresponding genres, even if they were missing in the original submissions. Such correction may increase genre recall and alter precision, because more genres will be present in predictions, including relevant and irrelevant ones. Note that the results at the subgenre label do not change.

The plots below demonstrate Precision, Recall and F-scores with and without label expansion. The inspection of these results revealed no significant difference in performance. Recall changes very little, with the exception of ICSI. Still, its F-scores remain virtually the same due to the low precision.

All results are also available in a CSV file.

Per-track F-score, Precision and Recall

Precision - per track - all labels Precision - per track - genre labels

Recall - per track - all labels Recall - per track - genre labels

F-score - per track - all labels F-score - per track - genre labels

Per-label F-score, Precision and Recall

Precision - per label - all labels Precision - per label - genre labels

Recall - per label - all labels Recall - per label - genre labels

F-score - per label - all labels F-score - per label - genre labels