Skip to content

CODEJIN/DiffSingerKR

Repository files navigation

DiffSinger-KR

Hugging Face

This code is an implementation of DiffSinger for Korean. The algorithm is based on the following papers:

Structure

  • Structure is based on the DiffSinger, but I made some minor changes.
    • The multi-head attention is changed to linearized attention in FFT Block.
      • Positional encoding is removed.
    • Duration embedding is added.
      • It is based on the scaled positional encoding with very low initial scale.
    • Aux decoder and Diffusion are learned at the same time, not two stage.
  • I changed several hyper parameters and data type
    • One of mel or spectrogram is can be selected as a feature type.
    • Token type is changed from phoneme to grapheme.
    • Becuase of the supported vocoder, I changed the sample rate of model to 22050Hz.

Supported dataset

Using Dataset Dataset Link
O Children's Song Dataset Link
X AIHub Korean Multi-Singer Song Dataset Link
  • I fixed some midi score to matching between note and wav F0.
  • CSD dataset is used for the training of shared checkpoint.
  • Pattern generator.py supports the AIHub Dataset, but I did not used for the training of shared checkpoint.

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in Hyper_Parameters.yaml according to your environment.

  • Sound

    • Setting basic sound parameters.
  • Tokens

    • The number of Lyric token.
  • Notes

    • The highest note value for embedding.
  • Durations

    • The highest duration value for embedding.
  • Genres

    • Setting the number of genres.
  • Singers

    • Setting the number of singers.
  • Duration

    • Min duration is used at pattern generating only.
    • Max duration is decided the maximum time step of model. MLP mixer always use the maximum time step.
    • Equality set the strategy about syllable to grapheme.
      • When True, onset, nucleus, and coda have same length or ±1 difference.
      • When False, onset and coda have Consonant_Duration length, and nucleus has duration - 2 * Consonant_Duration.
  • Feature_Type

    • Setting the feature type (Mel or Spectrogram).
  • Encoder

    • Setting the encoder(embedding).
  • Diffusion

    • Setting the Diffusion denoiser.
  • Train

    • Setting the parameters of training.
  • Inference_Batch_Size

    • Setting the batch size when inference
  • Inference_Path

    • Setting the inference path
  • Checkpoint_Path

    • Setting the checkpoint path
  • Log_Path

    • Setting the tensorboard log path
  • Use_Mixed_Precision

    • Setting using mixed precision
  • Use_Multi_GPU

    • Setting using multi gpu
    • By the nvcc problem, Only linux supports this option.
    • If this is True, device parameter is also multiple like '0,1,2,3'.
    • And you have to change the training command also: please check multi_gpu.sh.
  • Device

    • Setting which GPU devices are used in multi-GPU enviornment.
    • Or, if using only CPU, please set '-1'. (But, I don't recommend while training.)

Generate pattern

python Pattern_Generate.py [parameters]

Parameters

  • -csd
    • The path of children's song dataset
  • -am
    • The path of AIHub multi-singer song dataset
  • -step
    • The note step that is explored when generating patterns.
    • The smaller step is, the more patterns are created in one song.
  • -hp
    • The path of hyperparameter.

Training

Command

Single GPU

python Train.py -hp <path> -s <int>
  • -hp <path>

    • The hyper paramter file path
    • This is required.
  • -s <int>

    • The resume step parameter.
    • Default is 0.
    • If value is 0, model try to search the latest checkpoint.

Multi GPU

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=32 python -m torch.distributed.launch --nproc_per_node=8 Train.py --hyper_parameters Hyper_Parameters.yaml --port 54322

Inference

Checkpoint

TODO

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published