Game Music Generation From Video

10 min readMar 5, 2020

{qinchaow, tinhangc, ziangqin, zijianwa}@usc.edu || Github

Introduction

In the current game making industry, music is essential for a successful game. The common practice is to invite musicians to write different tunes and melodies in advance according to the style of the game. These music pieces may be played in different scenes and stages of the game in a looped way, which quickly becomes boring after many hours of playing. Therefore, we try to explore a method that generates music automatically from game frames by utilizing deep learning techniques. This method may drastically improve the gameplay experience and inspire musicians that want to make unique music for games. We can also use variations of this model in areas like film making, advertising, and video editing.

In this article, we propose music generation models that generate proper music given game frames. By proper we mean that the music should fit the style, emotion or feelings of the game. For example, if the given frame shows cheerful scenery, the music should be delightful. If the frames indicate a boss fight, our music should be intense. If we are in the underground world, we may hear spooky music from our generator. In a word, we want music to become more diverse and adaptive while we are playing.

Summary

We propose two different music generation models in our project, a GAN model made from scratch and a VAE model utilizing the pre-trained decoder of google’s Magenta Team’s Music-VAE model. We pair gameplay video clips on youtube and midi file (music file) by hand to make a dataset and feed these data into our models to train them. Both models achieve reasonable music pieces, and they seem to align with the overall mood of the game frames. We run these models on google cloud using virtual machines with Nvidia’s K80 GPU to accelerate the training process.

GAN Model

Idea

The GAN model is inspired by this post by Cory Nguyen, Ryan Hoff, Sam Malcolm, Won Lee, and Abraham Khan. Abraham Khan and his team demonstrated that it is possible to generate new Pokemon music using GAN and achieved good results in their experiment. Despite this is different from our goal, we have adapted some techniques such as using Music21 software to extract music from MIDI file and mapping notes to a -1 to 1 scale for input.

Here is the original MIDI (from zelda):

In the GAN experiment, we will enforce the following settings:

We generate melody only, for simplicity.
Transform note’s pitch and duration to a scale of -1 to 1.
Transform all training MIDI into C major/minor to reduce unnecessary variance in training.
Transform notes’s octave to range between 3 to 6, because octaves outside range 3 to 6 are rarely used in making melody.
Transform note’s duration to either whole note, half note, quarter note, eighth note or sixteenth note.

Here is the transformed music for training our model:

Model

Here is the sketch of our model:

Note that this is a GAN model with an extra encoder in front of the generator and discriminator. The encoder here is a CNN network that uses a 3D convolutional layer to extract features from video clips. The features will then be used by the generator and discriminator to generate or identify fake music.

Now the the optimization equation becomes:

In the optimization equation, Encoder C() has nothing to do with the optimization goal. So why bother to train the Encoder? Here we performed an experiment to see what happen if we don’t train the encoder.

As you can see, the training loss becomes very unstable, and the generator’s loss becomes too huge as training goes on. Our explanation is that without training our encoder, the output of the encoder is the same as random noise, which is hard for the generator to generate music. Meanwhile, discriminator can become less focus on image’s feature but more on identifying fake music, causing the increase of the generator’s loss.

Therefore, we divide the training into two stages. In the first stage, we will train the discriminator along with the encoder:

In the second stage, we train the generator and discriminator like normal GAN, treating the encoder as deterministic function.

Here is the result of training:

We have hand-picked some video clips that have interesting results:

A video clip from HCBailly’s “Let’s Play Final Fantasy #33 — The Floating Castle”

original music

generated music

See source code.

VAE Model

Idea

GAN model successfully generates music from images and fulfills our need for music generation. Based on this, we go further to explore the possibility of controlling what music style to generate. VAE model makes this idea possible since VAE explicitly gives out the latent space of encoded input dataset. This latent space is “smooth”, that is, samples from nearby points in latent space have similar qualities to one another. We can directly manipulate this latent space when performing sampling, forcing the sampled music perform desired styles.

Model

But how do we build this VAE to generate data format that is different from input? We use paired data [game frame, music]. In our VAE model, we input the game frames into the encoder, then generate music from the decoder. To improve the reconstruction accuracy, we define the decoder loss to be a loss between reconstructed music sequence and original paired music sequences of the input data. The part of reducing KL divergence remains the same as in vanilla VAE model.

This way, the encoder learns how to encode images to build a latent space and the decoder learns how to generate music pieces from the given latent samples. The latent space here is not only built to generate a low-dimensional representation of input datasets but also built to acquire an inherent ability to allow music to generate from. As we learned from MusicVAE, the reconstruction quality heavily depends on how well the decoder is, and it is the only part to generate a music sequence directly, thus the decoder must be a powerful network structure to perform music generation. The decoder of MusicVAE has a novel structure to decode latent codes, it tries to diminish the vanishing influence phenomenon by using a segmented “conductor RNN” before final fully connected layers. It also limits the scope of decoder itself to better use latent codes. Since the decoder of MusicVAE is good enough to generate music from latent codes, we reuse this decoder in our model.

For the encoder part, we use a CNN to handle single image input, which has a network structure similar to AlexNet.

Data

In order to get a dataset of paired data [game frame, music], we created a manually labeled dataset. To get the original data, we manually cut the game video clips and label the corresponding midi file, game frames are sampled at a rate of 3 fps. Then we paired each frame with a 16-bar midi format trio music clips (trio music is polyphonic music that contains 3 instrument tracks: melody, bass and drum) as our basic data unit.

Result

This model successfully generate high-quality trio-music pieces. Some featured pieces are listed below:

See source code.

Decoder personalization of VAE model (LC-VAE applied to MusicVAE)

MusicVAE is a general model trained by millions of music with different genres. When pre-trained-decoder gets a latent code from the encoder, it may sample in the whole latent space where the latent vectors are regularized to be similar to a standard normal distribution. But in this case, generated music may be not similar to ‘game-like’ music. Thus, we also explore the decoding part and try to personalize the decoding process.

Fine-tuning is a popular approach but requires a lot of computations to modify the full network, so we train an additional VAE model* (SmallVAE in our project) to generate from the parts of the latent space we needed, which requires lower computation resource.

The SmallVAE model is trained on MusicVAE’s latent space, and it learns a much smaller latent representation of the already encoded latent vectors. Visually, it looks like this:

We train the SmallVAE based on several chunks of music midi.

The loss is defined by latent loss (KL loss) and reconstruction loss (MSE loss) with the encoded MusicVAE’s latent vector as input. The result after 500 epoches training showing below.

Here is the result sampling from SmallVAE:

Compared to random sampling from MusicVAE:

The melody has some similarity between training music and sampling from SmallVAE (similarity also can be observed by the shape of audio data above), which represents that the SmallVAE can generate the music conditionally.

See source code.

Watson Beat

At last, we tried some complex music generation methods developed by big companies to achieve better results. IBM Watson Beat is a tool developed by IBM to generate music of different types like popfunk, space, chill and etc. All we need to do is to specify an initial file that contains speed, music layers that must be included, time signature and a short piece of music. Watson Beat uses reinforcement learning algorithms and models each instrument using patterns and chords. The music generated by Watson Beat sounds quite real and beautiful.

Data

We change our manually labeled data set mentioned above to [game frame, features of music]. We extract features of music from midi music files.

Video Feature Extraction

The figure above shows our pipeline. Instead of generating music directly from game videos, we first use a vgg-like convolutional neural network to extract some features from game frame images. Through fully connected layers, these image features are transformed into midi features like time signature (tse), beats per minute (bpm), and the number of instrument layers (energy).

Training loss and accuracy of ‘energy’ feature

Training loss and accuracy of ‘tse’ feature

For the results above, we notice that time signature is not a good metric to represents images. It is because that almost all data in our data set share the same tse, the ‘peak’ may come from ‘noise’, which is image having different tse label with others. We conclude that tse can not describe the feature of image well. Thus, we only predict bpm and energy for following process.

These midi features could in some extent capture the mood and intensity of the music. We then feed these features as parameters of Watson Beat, which will produce music corresponding with the input game frames. Here is a sample of our input game frames and generated music. It is clear that the music generated by Watson Beat sounds more sophisticated and can also cohere with the given frames.

Result music:

See source code.

Conclusion

In this project, we explore two approaches to generate music from video. The initial implementation is to generate music features from images and use Watson-Beat model to generate music based on these generated music features. In both GAN and VAE models, we utilize a convolutional CNN neural network to extract image features. Then we use these features to generate music, either by using a generator-discriminator schema or by latent vectors. We achieve reasonable music pieces which generally correspond with the game video clips. Although some of them are somewhat simple, they prove that the idea of generating music from images or videos is possible in concept.

Future work

There are also lots of directions which we could explore further. Personalization is one of them. For example, for our GAN model, we can slightly change to model so that it can generate trio music like we did in VAE model. We can explore the performance of a LC-GAN model or other custom approaches like fine-tuning.

Applying the “attribute vector” arithmetic is also a promising way of doing personalization. By defining “music speed” as clock time of one note in music segments, we can extract two subsets that have extremes of “music speed” and compute the relative direction of these two datasets on latent space as “attribute vector of music speed”. We may use this “attribute vector” as an excellent tool to adjust output samples or embed this parameter into a new network structure to automatically adjust music speed.

Reference

Generating Pokemon-Inspired Music from Neural Networks — Cory Nguyen, Ryan Hoff, Sam Malcolm, Won Lee, and Abraham Khan
Latent constraints: learn to generate conditionally from unconditional generative models — Jesse Engel, Matthew D. Hoffman, Adam Roberts
MidiMe: Personalizing a MusicVAE model with user data — Monica Dinculescu, Jesse Engel, Adam Roberts

Game Music Generation From Video

Introduction

Summary

GAN Model

Idea

Model

VAE Model

Idea

Model

Data

Result

Decoder personalization of VAE model (LC-VAE applied to MusicVAE)

Watson Beat

Data

Video Feature Extraction

Conclusion

Future work

Reference

Written by Zijian