1. Introduction
In recent years, significant advancements have been achieved in the automatic music generation field due to the development of machine learning, with numerous algorithms and techniques having been developed to produce high-quality original music [
1,
2]. Music generation methods can be broadly classified into rule-based and data-driven approaches.
Initially, owing to a lack of computational recourse and massive data support, rule-based approaches relying on predefined rules and constraints were used to generate music in various styles, such as classical, jazz, and pop music. These approaches often use musical knowledge representation systems to encode musical concepts and rules, such as tonality, rhythm, and harmony [
3,
4,
5,
6]. However, analyzing and encoding music theory into artificial features using rule-based approaches is difficult. Furthermore, even if rules are defined, it is challenging to apply them to other genres, styles, and instruments.
Data-driven approaches can utilize large amounts of data more efficiently than rule-based methods owing to the large increase in data volume. Data-driven approaches employ deep learning techniques to analyze and generate music based on datasets. Generative adversarial networks (GANs) have significantly advanced the music generation field. For example, continuous recurrent neural networks with adversarial training (C-RNN-GAN) [
7], which are composed of long short-term memory (LSTM)-based generators and bidirectional LSTM-based discriminators, generates classical music by learning from a training dataset. SeqGAN [
8] is another GAN-based model that uses reinforcement learning (RL) for sequential generation. It can be trained using data that consist of sequences of discrete tokens. SeqGAN models a generator using a stochastic policy in RL to bypass the generator differentiation problem and perform gradient policy updates directly. SeqGAN is specifically designed for sequence generation and can be trained on various data types. The goal of OR-GAN [
9] is to improve the quality of the samples generated by SeqGAN. OR-GAN combines adversarial training with expert-based rewards and RL to generate samples that maintain information learned from data. In this manner, the sample diversity is retained, and the desired metrics are improved. The effectiveness of these approaches has been demonstrated in the generation of molecules encoded as text sequences and musical melodies. However, long-distance sequential data cannot be handled via these approaches owing to the backpropagation process through the time of the LSTM.
In addition to the combination of RNNs and GANs, MidiNet [
10] generates music in the symbolic domain using a convolutional neural network (CNN)-based GAN model, allowing it to generate realistic symbolic music. The model incorporates a conditional mechanism to generate original melodies based on a chord sequence or via conditioning based on previously used melody bars. MuseGAN [
11] is another GAN model based on CNNs that generates symbolic multitrack music. It includes the jamming, composer, and hybrid models, which are designed to address the unique challenges of music generation, such as its temporal nature and the interdependence of multiple tracks. Inco-GAN [
12] is a type of polyphonic music that uses a CNN-based inception model for the conditional generation of polyphonic music that can be freely adjusted in terms of length. In addition, ChordGAN [
13] is a chord-conditioned melody generation approach that uses a conditional GAN architecture and appropriate loss functions, similar to image-to-image translation algorithms. It can be used as a tool for musicians to learn compositional techniques for different styles using the same chords and automatically generate music.
Although the use of CNNs has improved the ability of models to extract local features, the issue of gradient vanishing that occurs when processing long-distance sequences has not been addressed. Moreover, balancing the generator and discriminator during dynamic learning is difficult, making it challenging to guarantee the convergence of the GANs during training.
The denoising diffusion probabilistic model (DDPM) [
14] is a type of probabilistic generative model that includes the forward and reverse processes. A key benefit of DDPM is that it can be used to denoise data without explicit labels or supervision. This makes it particularly useful for denoising data in unsupervised learning, in which it is difficult or impossible to obtain labeled data. This model can help to address convergence issues in generative models and has achieved promising results in the generation of high-quality images. Moreover, the transformer module, which is used extensively in language modeling, allows the model to focus directly on all input words simultaneously and measure their importance using the attention mechanism. Thus, the model can efficiently capture the long-term dependencies and make more accurate predictions. Transformers also apply parallel processing, making them more efficient and faster to train than RNNs.
This paper proposes a method for chord-conditioned melody generation using a transformer-based diffusion model known as MelodyDiffusion. The following three modifications are made to the traditional diffusion model. First, in the forward process, the inputs of MelodyDiffusion are not continuous image-like data, but they are rather discrete sequences, and no pre-trained variational autoencoder (VAE) is required to map the discrete data to a continuous latent space. Second, in the reverse process, the U-nets of the traditional diffusion model are replaced with transformers, thereby enabling the model to handle discrete sequences and to consider long-distance dependencies. Finally, a transformer-based encoder is used to embed the conditions for guiding the reverse process, which allows for chord-conditioned music generation.
The main contributions of this paper are summarized as follows: (1) a novel diffusion model that operates directly on discrete data is designed, such that the diffusion model is not limited to image generation. (2) MelodyDiffusion uses transformers instead of U-nets to handle discrete sequences. Furthermore, a transformer-based encoder is developed to realize chord-conditioned melody generation. (3) The experiments reveal that MelodyDiffusion can generate diverse melodies based on the given chords.
3. Chord-Conditioned Melody Generation Method
This paper proposes a method for chord-conditioned melody generation using a transformer-based diffusion model known as MelodyDiffusion. We first describe the representations of the melody and chords. Subsequently, we explain the structure and operation of the transformer-based diffusion model.
3.1. Date Representation
The music corpus OpenEWLD [
24], which is a dataset consisting of lead sheets in XML format, is used during training. As illustrated in , a piano roll is obtained from a lead sheet using the Python library pypianoroll. In the piano roll, each measure is divided into 40 parts. Therefore, each part is a 0.025 measure length; for example, “Duration: 0.25” indicates that a pitch takes up a 0.25 measure length. In a musical instrument digital interface, each pitch is defined with a unique index ranging from 0 to 127, thereby representing 128 types of pitches; for example, the index of “Pitch: C4” is 60. Thus, “Pitch: C4 Duration: 0.25” represents the sequence “60, 60, 60, 60, 0”. It is worth noting that the end of this sequence is replaced with 0 given that the “offset” of the pitch must be clearly indicated. Supposing “Pitch: F4 Duration: 0.25” is followed by “Pitch: F4 Duration: 0.5”, if the “offset” is not indicated, it will be difficult to distinguish these from “Pitch: F4 Duration: 0.75”.
Figure 1.
Example of melody and chord sequences converted from a piano roll. In the piano roll, a portion of the corresponding melody and chord were truncated. The black dotted box represents the truncated melody section, while the purple represents the truncated chord. The truncated melody and chord are converted to melody and chord sequences, with each segment between two dotted lines representing the length of one measure.
When a melody sequence is generated, the chord sequence can be obtained by simply supplementing the chord index in the corresponding position. A total of 446 chords have been detected and assigned an index in OpenEWLD, with the index 0 indicating “offset”.
During training, eight consecutive measures are randomly extracted from any lead sheet in the dataset. The melody and chord sequences, both with a length of 320 (8 × 40), are obtained using the representation method described above.
3.2. Transformer-Based Diffusion Model
MelodyDiffusion includes forward and reverse processes. As shown in , the forward process adds noise to the original melody based on time steps. The reverse process includes a denoising model and a pre-trained encoder, where the denoising model takes the noisy melody generated by the forward process as an input with which to eliminate the original melody. The pre-trained encoder extracts hidden features from the chords and is connected to the transformers in the denoising model through a cross-attention module to guide the denoising process. The forward and reverse processes are explained in more detail in the following sections.
Figure 2.
Model structure of MelodyDiffusion.
3.2.1. Notations
This section introduces the notation used in describing the forward and reverse processes, as shown in . In the forward process, time steps 𝑡∈(1, 𝑇) are used to control adding the noise 𝑧 on original melody 𝑥0, where the noise 𝑧 follows a Gaussian distribution. The reason for using Gaussian noise is that the noise must conform to a regular distribution, otherwise the diffusion model is unable predict it. Furthermore, Gaussian noise has been widely used in various diffusion models and has been proven to be effective. In addition, Gaussian noise has a wide range of applications and foundations in statistics and natural sciences; therefore, its use in diffusion models can better simulate real-world noise.
Table 1.
Notations and their descriptions.
𝑥1, 𝑥2, …, 𝑥𝑡, …, 𝑥𝑇 represent noisy melodies with noise added based on different time steps 𝑡. 𝛽 varies with 𝑡 and serves as the parameter for controlling the level of noise added. 𝜖𝜃 and 𝜌𝜃 represent the denoising model and encoder, respectively, in the reverse process. 𝑐1, 𝑐2, …, 𝑐𝑛 represents the chords that are input as conditions to the encoder. ℎ1, ℎ2, …, ℎ𝑛 represents the hidden features output by the encoder.
3.2.2. Forward Process
The forward process follows the time steps 1 to
𝑇, recursively adds noise
𝑧 to the original melody
𝑥0, and saves the noising results of each time step,
𝑥1,
𝑥2, …,
𝑥𝑡, …,
𝑥𝑇. Noise is added to the melody at time step
𝑡 using Equation (1), where
𝛽 increases as
𝑡 increases. Therefore, a larger
𝑡 value indicates that more noise
𝑧 is added to
𝑥0.
By recursively applying Equation (1), the noisy melody
𝑥𝑡 at any time step
𝑡 can be obtained based on the original melody
𝑥0, as shown in Equation (2).
Specifically, each pitch in the melody sequence is embedded using one-hot encoding. As the pitch ranges from 1 to 128, the size of the one-hot encoded vector is 128. Subsequently, each vector undergoes noise processing, whereby the noise follows a Gaussian distribution and is controlled by the parameter 𝛽𝑡. presents the overall process of adding Gaussian noise to the melody sequences. A melody with a length of 320 is embedded into a 320 × 128 matrix through one-hot encoding. Thereafter, Gaussian noise of the same size as the one-hot encoded melody is generated. Finally, the Gaussian noise is added to the one-hot encoded melody using Equation (2) to obtain the noisy melody.
Figure 3.
Adding Gaussian noise to the melody sequence.
3.2.3. Reverse Process
In the reverse process, the goal of denoising model
𝜖𝜃 is to infer
𝑥0 recursively from
𝑥𝑡. During the recursive process, Gaussian noise
𝑧′𝑡 added to
𝑥𝑡−1 is predicted based on
𝑥𝑡 according to Equation (3). After
𝑧′𝑡 is predicted,
𝑥′𝑡−1 can be obtained easily by using Equation (4), which follows the inverse process of adding noise. Through training,
𝑥′𝑡−1 gradually approximates the known
𝑥𝑡−1 from the forward process.
𝜖𝜃(·) represents the denoising model.
𝜌𝜑(·) represents the pre-trained encoder, which is composed of transformers using self-attention. It is differentiated from the denoising model
𝜖𝜃(·) given that is a pre-trained model which uses frozen weights during the training of the diffusion model. Moreover,
𝑐 represents the chord sequence input to the encoder as a conditional input.
Subsequently, the loss is calculated based on
𝑧′𝑡 and
𝑧𝑡. In the reverse process, the noise
𝑧′𝑡 is predicted based on
𝑥𝑡. In the forward process, the noise
𝑧𝑡 added to
𝑥𝑡−1 is known. The loss for updating the denoising model and encoder is obtained by calculating the difference between
𝑧′𝑡 and
𝑧𝑡, and the loss function uses the mean squared error (MSE) as defined in Equation (5).
As illustrated in , the reverse process comprises two parts: the denoising model and a pre-trained encoder for inputting the chord information. The denoising model is composed of transformer blocks that use both self-attention and cross-attention mechanisms. The query, key, and value in the self-attention layer originate from the current input. Conversely, in the cross-attention layer, only the query originates from the previous layer, and the key and value are obtained through cross-attention. The structure of the encoder is similar to that of the denoising model, except that all cross-attention layers are replaced with self-attention layers. Its structure is similar to that of BERT [
25], and it is pre-trained using the masked language modeling method [
25]. The pre-training process is completed in advance. Specifically, chord sequences are randomly masked and fed into the encoder, which learns deeper representations in the process of recovering the masked parts.
As shown in , the denoising model is connected to the pre-trained layer through cross-attention. The pre-trained encoder receives the chord sequence 𝑐1, 𝑐2, …, 𝑐𝑛 as an input and outputs the hidden features ℎ1, ℎ2, …, ℎ𝑛 of the contextual information. Subsequently, these hidden features are fed into each transformer in the denoising model via cross-attention.
Figure 4.
Pipeline of the conditioned-denoising in reverse process.
Algorithm 1 shows the training of MelodyDiffusion, while Algorithm 2 shows the method of sampling new melodies using Gaussian noise and conditional chords after the model has converged.
4. Experiments
Hits@k [
26] was used as the metric with which to evaluate the quality of the generated melodies. Hits@k is a commonly used evaluation metric for the recommendation models and returns the top k results from the generated list, the latter being sorted by the softmax probability distribution. It represents the average percentage of rankings with a value lower than k in the generated samples. First, MelodyDiffusion-large was designed as a comparison to demonstrate the impact of changes in the hyperparameters of the transformer on the results. Second, “w/o encoder” was used for an ablation experiment. Here, “w/o encoder” represents MelodyDiffusion that did not use an encoder with chord conditions as an input. Finally, the stable diffusion model [
27] was reproduced as a baseline, which is a conditional generative model based on U-nets. To adapt discrete data to the stable diffusion model, the input format was set to be slightly different from that of the transformer-based diffusion models.
4.1. Dateset
The Enhanced Wikifonia Leadsheet Dataset (EWLD) is a music lead sheet dataset. OpenEWLD [
23] was used in the experiments reported in this paper. It was extracted from EWLD and only contained lead sheets in the public domain. Each lead sheet in OpenEWLD contains the melodies and chords that are required for training. These data were divided into 13,884 sample pairs consisting of eight measures for training and evaluation.
4.2. Experimental Environment
lists the hyperparameters that were used during training in the forward process. Number of time steps signified that the maximum value for time step
𝑇 was 500.
𝛽𝑠𝑡𝑎𝑟𝑡 and
𝛽𝑒𝑛𝑑 indicated that the minimum value for
𝛽𝑡 in Equations (1) and (2) was 0.0001 and the maximum value was 0.02, with
𝛽 increasing linearly over time
𝑡. The proposed method’s models and the comparative models, which consisted of “w/o encoder”, and stable diffusion model [
27], used the same forward process in the experiment.
Table 2.
Hyperparameters used in the forward process.
shows the hyperparameters of the encoder and denoising model in both the base and large versions of the MelodyDiffusion model. The hyperparameter settings of both the base and large versions were based on BERT-base and -large [
25], which are types of transformer-based pre-trained model used for representation learning in natural language processing. In this paper, we attempted to use the transformer blocks of BERT as the main architecture of a diffusion model. The encoder and denoising model used the same structure, except that cross-attention was enabled in the denoising model.
Table 3.
Hyperparameters of encoder and denoising model in both the base and large versions of MelodyDiffusion.
lists the hyperparameters used by the baseline model. The stable diffusion model was replicated as the baseline. The melody was added as one channel and input into the model in the form of a grayscale image. The model consisted of 12 blocks wherein 6 had functions of down-sampling and 6 had functions of up-sampling. Each block received hidden states extracted from the chords through cross attention from the encoder. The encoder used here was the same as the encoder used in MelodyDiffusion.
Table 4.
Hyperparameters of the baseline model.
displays the hyperparameters used in the training strategy, including global dropout to prevent overfitting, as well as parameters related to the optimizer AdamW and learning rate. A dropout value of 0.1 was globally used in the reverse process model. The AdamW optimizer was selected to update the model and its learning rate was set to 1×10−4. The parameters relating to the optimizers, namely 𝛽1 and 𝛽2, were set to 0.9 and 0.98, respectively. The warm step, which indicates that the learning rate gradually increases from 0, reached its peak at 500 iterations and then decreased back to 0, following a cosine function.
Table 5.
Hyperparameters of the training strategy.
4.3. Displays of Training
illustrates the melodies with added noise at different time steps during the forward process. In the original melody 𝑥0, the melody was clearly defined. As the time step 𝑡 increased to 100, the noisy melody 𝑥100 became increasingly difficult to recognize. When the time step 𝑡 increased from 250 to 500, the melody became completely unrecognizable, and the measures were indistinguishable. This demonstrated that noise was gradually added during the forward process. As time steps increased, noise gradually dominated the melody, resulting in a gradual blurring until it became unrecognizable. This further confirms the importance of noise, which provides models with more creativity and diversity, resulting in the generation of more interesting melodies. In the MelodyDiffusion model, by randomly generating Gaussian noise distributions, different styles and qualities of melody generation results can be obtained.
Figure 5.
Examples of noise-added melodies at different time steps.
shows the change in MSE loss during the first 250 iterations of training for the MelodyDiffusion base and large models, as well as for the “w/o encoder” and the baseline stable diffusion models during the warm-up phase. Due to the larger number of parameters, the large model had a higher initial loss than the base model. In contrast to the transformer-based models, the stable diffusion model had a low initial loss given that it can predict the noise distribution based on local features by rapidly using convolution. However, as training continued, the transformer models began to excel at predicting noise on discrete data by considering the contextual information in both forward and backward directions. After 100 iterations, the MSE loss between the predicted noise and the actual noise distribution for all four models converged to approximately the same loss.
Figure 6.
Change in MSE loss during training.
GANs can generate new samples based on noise. However, the goal of GANs is to restore the original sample
𝑥0 from Gaussian noise distribution
𝑥𝑇, which is a challenging task. In contrast, MelodyDiffusion only infers
𝑥𝑡−1 based on
𝑥𝑡. As shown in , it is challenging to directly restore
𝑥0 from
𝑥500, making it difficult for the generator of a GAN to update based on the feedback from the discriminator. However, it is relatively easier to predict
𝑥499. Furthermore, to make the model converge faster, the training method of the denoising diffusion implicit model (DDIM) [
28] was used instead of the traditional DDPM. The reason DDIM was chosen is that the forward process of the diffusion model can include hundreds of steps, and as the DDPM based on Markov chains needs to traverse all time steps to generate a single sample. In DDIM, the forward process can be a non-Markov process and can be accelerated by subsampling.
4.4. Hits@k Evaluation Results
presents the evaluation results of the generated melodies using Hist@k. In the restoration process of noisy melodies, which added Gaussian noise at time step T (T = 500) on the basis of the chords given, the reverse process was used to run T recursive denoising operations on the noisy melodies, and the probability distributions of the restored melodies were output through softmax. For Hist@k, k was set to 1, 3, 5, 10, and 20, and the average percentage of the pitch in the original melodies among the top k pitches in the softmax probability distribution was calculated. When k = 1, only the pitch with the highest probability in the softmax probability distribution was considered to evaluate the performance of the restoration, which was the key metric in this. The performance of the base version of MelodyDiffusion was slightly lower than that of the baseline stable diffusion model, being lower by 1.43%; meanwhile, the large version with a similar number of parameters had a performance which was 0.63% higher. However, the performance of the w/o encoder model, which did not use chords as auxiliaries, was worse than that of the other three conditional models. When k > 3, the Hist@k score of the stable diffusion model was higher than that of the MelodyDiffusion models. After analyzing the generated results, it was apparent that the section of the melody with significant changes in pitch affected the transformer’s judgment in the restoration process given that it tended to make predictions based on contextual features. In contrast, the stable diffusion model, which utilized a CNN, could restore this section based on local features.
Table 6.
Evaluation results of the generated melodies obtained using Hist@k.
4.5. Comparison of Generated and Real Melodies
illustrates the generated melodies of the original melody (A) and those generated by the base (B) and large (C) versions of MelodyDiffusion, the stable diffusion model (D), and the w/o encoder (E). These melodies were generated by the models based on using random Gaussian noise as an input and utilizing the same chords as the condition (except for w/o encoder). The analysis of the results showed that all models could recognize the measures (with a duration of 40 for each measure). Second, there were obvious noise patterns in the background of (D); these occurred since the stable diffusion model, using CNN, could not eliminate interference based on contextual features. Finally, the three generated melodies in (E) were extremely similar, particularly in the case of the first two measures. This also indicated that the chords could provide diversity to the generated melodies.
Figure 7.
Visual comparison of the Softmax probability distribution of original and generated melodies. (A) Original melodies; (B) generated melodies based on the base version of MelodyDiffusion; (C) generated melodies based on the large version of MelodyDiffusion; (D) generated melodies based on the stable diffusion model; (E) generated melodies based on the w/o encoder system.
5. Conclusions
This paper proposed a chord-conditioned melody generation method using MelodyDiffusion, which is a diffusion model based on transformers. MelodyDiffusion consists of forward and reverse processes. In the forward process, Gaussian noise is added to melodies following the time step. In the reverse process, the noise is reduced through the denoising model, using the given chords as conditions. Following training, this model could effectively predict the noise distribution from noisy melodies. Hits@k was used as a metric to evaluate the generated melodies. The experimental results showed that the performance of the proposed method reached 70.35% and 72.41% (k = 1) for the base and large versions, respectively, in restoring noisy melodies, with the large version outperforming the baseline model (stable diffusion) by 0.63%. Additionally, the visual comparison showed that MelodyDiffusion could generate more diverse melodies under the condition of given chords compared to the unconditional diffusion model.
Currently, MelodyDiffusion can only generate monophonic melodies based on chords. Considering the practicality of polyphonic music composed of multiple instrument combinations, future research directions include identifying chords and melodies from polyphonic music, which mainly comprises multi-track polyphonic music, without distinguishing chords and melodies compared to datasets with paired chords and melodies. Moreover, it is necessary to explore the application of MelodyDiffusion to other forms of musical data and generate polyphonic music containing multiple instruments based on these considerations are necessary.
Moreover, the diffusion model was originally proposed and used in the field of image generation; thus, although its main structure was replaced with transformers to handle discrete data with temporal properties in MelodyDiffusion, a limitation on the length of the generated samples exists. While increasing the processing length of transformers can generate longer samples, it also leads to higher training costs. A feasible approach is to guide the generation of subsequent melodies by treating previous melody patterns as conditional inputs, similar to what occurs in the conditioning of chord inputs.