Deep Learning Techniques for Music Generation - Companion Mini Web Site - Additions

9 Deep Learning Techniques for Music Generation - Companion Mini Web Site - Additions to the Book

9.1 Transformer

The Transformer architecture [VNS+ 2017] is an architecture which gained enormous success, being used for translators, and as the basis of Chat GPT and similar large language models (LLMs). The Transformer may be presented as some evolution of a RNN Encoder-Decoder architecture (introduced in Section 5.13.3), with a self-attention mechanism and also a strong use of embeddings.

9.1.1 Embeddings

Embeddings and feature extraction have been introduced in Section 4.9.3. They have been originated in natural language processing (NLP). The basic idea is to project elements (i.e., words) of the text sequence considered into a latent space of an architecture. The encoding as embeddings transforms the sequence elements into vectors (with a size usually around 500-1000).

Two main architectures used [MCC+ 2013] are:

continuous bag-of-words model (CBOW)
It is thus named, as the order of words in the history does not influence the projection. It predicts current word based on the context (previous and next words).
continuous skip-gram model (Skip-gram)
Instead of predicting the current word based on the context, it is somehow the contrary, as it predicts surrounding words given the current word.

Fig. 9.1 CBOW and Skip-gram architectures. Reproduced from [MCC+ 2013]

The gain of this approach is that words are organized in a semantic space, with geometric semantic (and syntactic) relations between words (captured from the statistical co-occurrences between the words). For instance, one can deduce the word (and concept) "King" by the following arithmetic on vectors: "King" = "Queen" - "Woman" + "Man".

Fig. 9.2 Semantic and syntactic relationship between words/concepts. Reproduced from [Venugopal 2021]

It is also easy to compute the similarity of two words/concepts, by simply computing the dot product of the corresponding vectors. If the value is positive, words/concepts are similar; if negative, they are dissimilar; if null, they are uncorrelated. This will be the basis for the self-attention mechanism.

Note that this embeddings approach has been also applied to non natural text sequences, in our case, musical sequences. An important step is thus to tokenize musical elements (characteristics such as note pitch, note duration, etc.) in order to make them into tokens, organized in sequences, to be processed by a Transformer. Note that there are various possible strategies for such tokenization and it impacts on accuracy and efficiency of the training and the generation [FGC+ 2023]. The embeddings mechanism is, like for natural language, supposed to capture semantic relations between musical elements.

9.1.2 Self-Attention

Having introduced above embeddings and the semantic relations captured corresponding arithmetics on vectors, the self-attention mechanism boils down to computing dot products of each element of the input sequence (of the Transformer architecture) with all other elements. The corresponding values inform about the similarity relations between each element of the input sequence with all other elements, and thus identify the elements which are an important context. Thus the Transformer can focus on the most similar/contextual elements. This is a very simple but effective mechanism. The cost is that it is quadratic with respect to the length of the sequence. But pseudo-linear approximations have been designed, see, e.g., a survey in [TDB+ 2022].

9.1.3 More Details

Actually, there are further specific techniques in a Transformer architecture, like a multi-head attention mechanism (which allows managing simultaneously various zones of the input sequence), and a positional encoding mechanism (in order for the model to make use of the order of the sequence). See more in the original paper [VNS+ 2017].

9.2 Diffusion Models

Diffusion models became very popular for generation of high quality/conformity contents (images, etc.) [O'Connor, 2022]. One advantage over GANs is that their training is easier. They may be considered also as an evolution of some existing composite architecture, namely stacked autoencoders, with some additional features, denoizing (like for denoizing autoencoders) [Dieleman, 2022].

9.2.1 Denoizing Autoencoder

A denoising autoencoder is an extension of the conventional autoencoder. Noise data is added to the input (this may be through some additional input layer). The most common noise used is a Gaussian noise. The learning objective is to reconstruct the pure input from the contaminated input [VLB+, 2008]. Then once trained, the architecture may generate content similar to the original inputs from noise.

References

[Dieleman, 2022] Sander Dieleman. Diffusion Models are Autoencoders, 2022, Web.
[FGC+ 2023] Nathan Fradet, Nicolas Gutowski, Fabien Chhel, and Jean-Pierre Briot. Impact of Time and Note Duration Tokenizations on Deep Learning Symbolic Music Modeling, 24th Conference of the International Society for Music Information Retrieval (ISMIR 2023), pages 89-97, November 2023. Web/PDF
[MCC+ 2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space, arXiv:1301.3781, September 2013. Web
[O'Connor, 2022] Ryan O'Connor. Introduction to Diffusion Models for Machine Learning, AssemblyAI, 2022, Web
[TDB+ 2022] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient Transformers: A Survey, ACM Computing Surveys, 55(6):1-28, December 2022. Web
[VNS+ 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need, arXiv:1706.03762, December 2017. Web
[Venugopal 2021] Kovendhan Venugopal. Mathematical Introduction to GloVe Word Embedding, Becoming Human: Artificial Intelligence Magazine, Medium, May 2021. Web
[VLB+, 2008] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders, 25th International Conference on Machine Learning (ICML 2008), 2008, pages 1096-1103, PDF

Jean-Pierre.Briot, 11/11/2024.