Deep Learning Techniques for Music Generation - Companion Mini Web Site - Additions

9 Deep Learning Techniques for Music Generation - Companion Mini Web Site - Additions to the Book

9.1 Transformer

The Transformer architecture [VNS+ 2017] is an architecture which gained enormous success, being used for translators, and as the basis of Chat GPT and similar large language models (LLMs). The Transformer may be presented as some evolution of a RNN Encoder-Decoder architecture (introduced in Section 5.13.3), with a self-attention mechanism and also a strong use of embeddings.

9.1.1 Embeddings

Embeddings and feature extraction have been introduced in Section 4.9.3. They have been originated in natural language processing (NLP). The basic idea is to project elements (i.e., words) of the text sequence considered into a latent space of an architecture. The encoding as embeddings transforms the sequence elements into vectors (with a size usually around 500-1000).

Two main architectures used [MCC+ 2013] are:



Fig. 9.1 CBOW and Skip-gram architectures. Reproduced from [MCC+ 2013]

The gain of this approach is that words are organized in a semantic space, with geometric semantic (and syntactic) relations between words (captured from the statistical co-occurrences between the words). For instance, one can deduce the word (and concept) "King" by the following arithmetic on vectors: "King" = "Queen" - "Woman" + "Man".

Fig. 9.2 Semantic and syntactic relationship between words/concepts. Reproduced from [Venugopal 2021]

It is also easy to compute the similarity of two words/concepts, by simply computing the dot product of the corresponding vectors. If the value is positive, words/concepts are similar; if negative, they are dissimilar; if null, they are uncorrelated. This will be the basis for the self-attention mechanism.

Note that this embeddings approach has been also applied to non natural text sequences, in our case, musical sequences. An important step is thus to tokenize musical elements (characteristics such as note pitch, note duration, etc.) in order to make them into tokens, organized in sequences, to be processed by a Transformer. Note that there are various possible strategies for such tokenization and it impacts on accuracy and efficiency of the training and the generation [FGC+ 2023]. The embeddings mechanism is, like for natural language, supposed to capture semantic relations between musical elements.

9.1.2 Self-Attention

Having introduced above embeddings and the semantic relations captured corresponding arithmetics on vectors, the self-attention mechanism boils down to computing dot products of each element of the input sequence (of the Transformer architecture) with all other elements. The corresponding values inform about the similarity relations between each element of the input sequence with all other elements, and thus identify the elements which are an important context. Thus the Transformer can focus on the most similar/contextual elements. This is a very simple but effective mechanism. The cost is that it is quadratic with respect to the length of the sequence. But pseudo-linear approximations have been designed, see, e.g., a survey in [TDB+ 2022].

9.1.3 More Details

Actually, there are further specific techniques in a Transformer architecture, like a multi-head attention mechanism (which allows managing simultaneously various zones of the input sequence), and a positional encoding mechanism (in order for the model to make use of the order of the sequence). See more in the original paper [VNS+ 2017].

9.2 Diffusion Models

Diffusion models became very popular for generation of high quality/conformity contents (images, etc.) [O'Connor, 2022]. One advantage over GANs is that their training is easier. They may be considered also as an evolution of some existing composite architecture, namely stacked autoencoders, with some additional features, denoizing (like for denoizing autoencoders) [Dieleman, 2022].

9.2.1 Denoizing Autoencoder

A denoising autoencoder is an extension of the conventional autoencoder. Noise data is added to the input (this may be through some additional input layer). The most common noise used is a Gaussian noise. The learning objective is to reconstruct the pure input from the contaminated input [VLB+, 2008]. Then once trained, the architecture may generate content similar to the original inputs from noise.

References


Jean-Pierre.Briot, 11/11/2024.