This code is an implementation of DiffSinger for Korean. The algorithm is based on the following papers:
- Liu, J., Li, C., Ren, Y., Chen, F., & Zhao, Z. (2022, June). Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 11020-11028).
- Xiao, Y., Wang, X., He, L., & Soong, F. K. (2022, May). Improving Fastspeech TTS with Efficient Self-Attention and Compact Feed-Forward Network. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7472-7476). IEEE.
- Structure is based on the DiffSinger, but I made some minor changes.
- The multi-head attention is changed to linearized attention in FFT Block.
- Positional encoding is removed.
- Duration embedding is added.
- It is based on the scaled positional encoding with very low initial scale.
- Aux decoder and Diffusion are learned at the same time, not two stage.
- The multi-head attention is changed to linearized attention in FFT Block.
- I changed several hyper parameters and data type
- One of mel or spectrogram is can be selected as a feature type.
- Token type is changed from phoneme to grapheme.
- Becuase of the supported vocoder, I changed the sample rate of model to 22050Hz.
Using | Dataset | Dataset Link | |
---|---|---|---|
O | Children's Song Dataset | Link | |
X | AIHub Korean Multi-Singer Song Dataset | Link |
- I fixed some midi score to matching between note and wav F0.
- CSD dataset is used for the training of shared checkpoint.
- Pattern generator.py supports the AIHub Dataset, but I did not used for the training of shared checkpoint.
Before proceeding, please set the pattern, inference, and checkpoint paths in Hyper_Parameters.yaml according to your environment.
-
Sound
- Setting basic sound parameters.
-
Tokens
- The number of Lyric token.
-
Notes
- The highest note value for embedding.
-
Durations
- The highest duration value for embedding.
-
Genres
- Setting the number of genres.
-
Singers
- Setting the number of singers.
-
Duration
- Min duration is used at pattern generating only.
- Max duration is decided the maximum time step of model. MLP mixer always use the maximum time step.
- Equality set the strategy about syllable to grapheme.
- When
True
, onset, nucleus, and coda have same length or ±1 difference. - When
False
, onset and coda have Consonant_Duration length, and nucleus has duration - 2 * Consonant_Duration.
- When
-
Feature_Type
- Setting the feature type (
Mel
orSpectrogram
).
- Setting the feature type (
-
Encoder
- Setting the encoder(embedding).
-
Diffusion
- Setting the Diffusion denoiser.
-
Train
- Setting the parameters of training.
-
Inference_Batch_Size
- Setting the batch size when inference
-
Inference_Path
- Setting the inference path
-
Checkpoint_Path
- Setting the checkpoint path
-
Log_Path
- Setting the tensorboard log path
-
Use_Mixed_Precision
- Setting using mixed precision
-
Use_Multi_GPU
- Setting using multi gpu
- By the nvcc problem, Only linux supports this option.
- If this is
True
, device parameter is also multiple like '0,1,2,3'. - And you have to change the training command also: please check multi_gpu.sh.
-
Device
- Setting which GPU devices are used in multi-GPU enviornment.
- Or, if using only CPU, please set '-1'. (But, I don't recommend while training.)
python Pattern_Generate.py [parameters]
- -csd
- The path of children's song dataset
- -am
- The path of AIHub multi-singer song dataset
- -step
- The note step that is explored when generating patterns.
- The smaller step is, the more patterns are created in one song.
- -hp
- The path of hyperparameter.
python Train.py -hp <path> -s <int>
-
-hp <path>
- The hyper paramter file path
- This is required.
-
-s <int>
- The resume step parameter.
- Default is
0
. - If value is
0
, model try to search the latest checkpoint.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=32 python -m torch.distributed.launch --nproc_per_node=8 Train.py --hyper_parameters Hyper_Parameters.yaml --port 54322
- I recommend to check the multi_gpu.sh.
- Please check Inference.ipynb
- Please check Huggingface Space
- Multi singer version version training with AIHub Multi-Singer Song Dataset