Results
The final submissions are (115 runs):
baseline
(8 runs): 2 runs for task1 on all 4 datasets.dbis
(40 runs): 5 runs for task1 and 5 runs for task2 on all 4 datasets.icsi
(15 runs): 5 runs for task1 on 3 datasets (noallmusic
).jku
(32 runs): 3 runs for task1 and 5 runs for task2 on all 4 datasets.kart
(16 runs): 4 runs for task1 on all 4 datasets.samirit
(4 runs): 1 run for task1 on all 4 datasets.
For the baseline:
run1
is random, following the distribution of labels found in the development sets.run2
always predicts the most popular genre in the development set.
The solid grey lines mark the performance of the random baseline, and the dotted lines mark the performance of the popularity-based baseline. The numbers indicate whether the run belongs to task1 or task2.
All results are also available in a CSV file.
See the complete MediaEval2017 procedings including the working note papers for all submissions here.
Per-track average scores
Per-track Precision vs Recall for all labels
Per-track Precision vs Recall for genre labels
Per-track Precision vs Recall for subgenre labels
Per-label average scores
Per-label Precision vs Recall for all labels
Per-label Precision vs Recall for genre labels
Per-label Precision vs Recall for subgenre labels
Results adjusted by genre-subgenre hierarchies
The submissions to the task were required to include all predicted genres and subgenres explicitly. Therefore, we did not explicitly consider hierarchical relations in the evaluation.
We conducted an additional evaluation with an adjustment for such relations, because most submissions did not explicitly predict the genres of the predicted subgenres. In these cases, we expanded all predictions to also include the corresponding genres, even if they were missing in the original submissions. Such correction may increase genre recall and alter precision, because more genres will be present in predictions, including relevant and irrelevant ones. Note that the results at the subgenre label do not change.
The plots below demonstrate Precision, Recall and F-scores with and without label expansion. The inspection of these results revealed no significant difference in performance. Recall changes very little, with the exception of ICSI. Still, its F-scores remain virtually the same due to the low precision.
All results are also available in a CSV file.
Per-track F-score, Precision and Recall
Per-label F-score, Precision and Recall