アテンションネットワークを使用した音楽生成のためのGoogleの新しいSOTAモデルであるMusicLMをPytorchに実装しました。
彼らは基本的にテキスト条件付きのAudioLMを使用していますが、驚くべきことに、MuLanという名前のテキストとオーディオの対照的な学習モデルからの埋め込みを使用しています。MuLanはこのリポジトリに構築されるものであり、AudioLMは他のリポジトリから変更され、ここでの音楽生成のニーズをサポートします。
$ pip install musiclm-pytorch
UsageMuLaN
first needs to be trained
import torch
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer
audio_transformer = AudioSpectrogramTransformer(
dim = 512,
depth = 6,
heads = 8,
dim_head = 64,
spec_n_fft = 128,
spec_win_length = 24,
spec_aug_stretch_factor = 0.8
)
text_transformer = TextTransformer(
dim = 512,
depth = 6,
heads = 8,
dim_head = 64
)
mulan = MuLaN(
audio_transformer = audio_transformer,
text_transformer = text_transformer
)
# get a ton of <sound, text> pairs and train
wavs = torch.randn(2, 1024)
texts = torch.randint(0, 20000, (2, 256))
loss = mulan(wavs, texts)
loss.backward()
# after much training, you can embed sounds and text into a joint embedding space
# for conditioning the audio LM
embeds = mulan.get_audio_latents(wavs) # during training
embeds = mulan.get_text_latents(texts) # during inference
Todo
[x] mulan seems to be using decoupled contrastive learning, offer that as an option
[ ] wrap mulan with mulan wrapper and quantize the output, project to audiolm dimensions
[ ] modify audiolm to accept conditioning embeddings, optionally take care of different dimensions through a separate projection
[ ] audiolm and mulan goes into musiclm and generate, filter with mulan
[ ] add a version of mulan to open clip
[ ] set all the proper spectrogram hyperparameters
[ ] email some contrastive learning experts and figure out why some papers are sharing the projection from embeddings to latent space
[ ] improvise a bit and give the audio transformer a position generating module before each attention layer
@inproceedings{Agostinelli2023MusicLMGM,
title = {MusicLM: Generating Music From Text},
author = {Andrea Agostinelli and Timo I. Denk and Zal{\'a}n Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matthew Sharifi and Neil Zeghidour and C. Frank},
year = {2023}
}
@article{Huang2022MuLanAJ,
title = {MuLan: A Joint Embedding of Music Audio and Natural Language},
author = {Qingqing Huang and Aren Jansen and Joonseok Lee and Ravi Ganti and Judith Yue Li and Daniel P. W. Ellis},
journal = {ArXiv},
year = {2022},
volume = {abs/2208.12415}
}
The only truth is music. - Jack Kerouac