Automatic songwriting aims to generate lyrics and/or melodies to aid human music creation. In this study, we address the long-range lyric-and-melody generation, which has received less attention compared to the lyric-to-melody and melody-to-lyric generation. We propose a novel unified model designed to effectively integrate multi-modal features and simultaneously generate lyrics and melody. To accommodate much longer sequences, we employ four transformer decoders to separately model lyrics and three-note values. Both qualitative and quantitative results show that our method can create coherent lyric-melody pairs with much longer context.
Publisher
Ulsan National Institute of Science and Technology