Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support QAT, export and inference for quantized BERT, GPT2 #285

Merged
merged 53 commits into from
May 7, 2022

Conversation

godweiyang
Copy link
Collaborator

No description provided.

@godweiyang godweiyang changed the title optimize examples for convenient usage support bert QAT and int8 inference Mar 17, 2022
@godweiyang godweiyang changed the title support bert QAT and int8 inference support QAT, export and inference for BERT, GPT2 Apr 18, 2022
@godweiyang godweiyang changed the title support QAT, export and inference for BERT, GPT2 support QAT, export and inference for quantized BERT, GPT2 Apr 27, 2022
@neopro12 neopro12 merged commit 4024ae1 into master May 7, 2022
neopro12 added a commit that referenced this pull request Dec 5, 2022
* Quantized Transformer pre-release (#235)

* add int8 for inference (ffn)

* quantize weights

* add compile option for int8

* add ffn int8 gemm for decoder

* add int8 gemm for qkv

* add int8 gemm for decoder

* load int8 pb (stage 1)

* load int8 pb (stage 2)

* use unsigned char to represent uint8

* add encoder clip_max

* add decoder clip_max

* fix decoder q project shape bug

* fix act kernel bug

* support compile with cuda 11

* remove redundant include

* compile using c++14

* support compile for training using cuda 11 and fix cub bug

* First try CublasLt

* Weight add transform to col32t

* i8 in i32 output

* Lt int8 all bug fixed

* add int8 logit gemm

* fix encoder int32_out_buf allocate bug

* modify float2int8 calculation

* Replace cublas int8 to lt

* Add fuse_residual_layer_norm kernel

* add scaled_colsum kernel of ffn2 weight, remove old version of cublas gemm (BLEU TEST OK)

* modify clip range of relu to [0, c]

* TODO: _trg_vocab_size must be 4x

* Replace all int8 cublas to cublasLt

* use relu clip range (0, c)

* remove default algo for cublasLt

* add test for cublas and tvm

* add more info for cublas gemm unit test

* load clip_max of gemm i8 out

* add int8 gemm io of logits

* add i8 gemm out of ker_arrange_decself_qkv

* split the beam search int kernel into two different versions (i8 and i32)

* finish decoder i8 gemm out (relu ad ffn2 gemm bug)

* finish encoder i8 gemm out (relu and ffn2 gemm bug)

* fix scale bug

* Unify all variable names and function names

* delete useless int8 kernels and add all col32 options

* delete useless int8 kernels and add all col32 options

* add test for batch gemm

* do not test cublaslt when bsz>1

* add more output info for cublas test

* polish cublas test code

* update proto and add clip_max for dec-self-attn-qkv-biias-out

* restore i32 out of ffn2 gemm out for relu clipmax(0, c)

* Int8 cache

* Optimize int8 refresh cache

* Fix wrong result when vocab / 32 != 0

* fix mixed fp32 and int8 bug

* modify dequant calculation equation

* rename round_up

* Add cublaslt gemm w/o mma

* Optimizer for 1 batch with regular gemm

* Remove redundant code

* update quant transformer proto

* Refactor quantize Transformer code

* rename quant_transformer.proto

* rename quant_transformer.proto

* recover unneeded change

Co-authored-by: xiongying.taka <[email protected]>

* Update GPU build toolchain  (#258)

* modify & support 80 86

* build with glibc 2.24

* Repair for manylinux_2_24

* Fix format

Co-authored-by: zhoubofan <[email protected]>
Co-authored-by: Ying Xiong <[email protected]>

* Optimize QuantTransformer GPU memory (#264)

* add int8 for inference (ffn)

* quantize weights

* add compile option for int8

* add ffn int8 gemm for decoder

* add int8 gemm for qkv

* add int8 gemm for decoder

* load int8 pb (stage 1)

* load int8 pb (stage 2)

* use unsigned char to represent uint8

* add encoder clip_max

* add decoder clip_max

* fix decoder q project shape bug

* fix act kernel bug

* support compile with cuda 11

* remove redundant include

* compile using c++14

* support compile for training using cuda 11 and fix cub bug

* First try CublasLt

* Weight add transform to col32t

* i8 in i32 output

* Lt int8 all bug fixed

* add int8 logit gemm

* fix encoder int32_out_buf allocate bug

* modify float2int8 calculation

* Replace cublas int8 to lt

* Add fuse_residual_layer_norm kernel

* add scaled_colsum kernel of ffn2 weight, remove old version of cublas gemm (BLEU TEST OK)

* modify clip range of relu to [0, c]

* TODO: _trg_vocab_size must be 4x

* Replace all int8 cublas to cublasLt

* use relu clip range (0, c)

* remove default algo for cublasLt

* add test for cublas and tvm

* add more info for cublas gemm unit test

* load clip_max of gemm i8 out

* add int8 gemm io of logits

* add i8 gemm out of ker_arrange_decself_qkv

* split the beam search int kernel into two different versions (i8 and i32)

* finish decoder i8 gemm out (relu ad ffn2 gemm bug)

* finish encoder i8 gemm out (relu and ffn2 gemm bug)

* fix scale bug

* Unify all variable names and function names

* delete useless int8 kernels and add all col32 options

* delete useless int8 kernels and add all col32 options

* add test for batch gemm

* do not test cublaslt when bsz>1

* add more output info for cublas test

* polish cublas test code

* update proto and add clip_max for dec-self-attn-qkv-biias-out

* restore i32 out of ffn2 gemm out for relu clipmax(0, c)

* Int8 cache

* Optimize int8 refresh cache

* Fix wrong result when vocab / 32 != 0

* fix mixed fp32 and int8 bug

* modify dequant calculation equation

* rename round_up

* Add cublaslt gemm w/o mma

* Optimizer for 1 batch with regular gemm

* Remove redundant code

* update quant transformer proto

* Refactor quantize Transformer code

* rename quant_transformer.proto

* rename quant_transformer.proto

* recover unneeded change

* Remove useless code

* Optimize quant_transformer gpu memory usage

* Fix encoder init buffer bug

* clean comments

* optimize decoder embedding memory

* fix multilingle emb quant bug

* fix nullptr buffer bug

* fix nullptr buffer bug

* optimize encoder embedding memory

* fix emb dequant dtype bug

* fix emb dequant dtype bug

* delete redundant argument of quant_weight

* support post training quantization for lightseq training models

* update ptq example README

* format README

* add README for int8 speed comparison

Co-authored-by: xiongying.taka <[email protected]>

* Update README.md (#267)

* lightseq training for gpt2 (#272)

* lightseq training support gpt

* remove comments

* fix softmax default mask_future

* remove comments

* remove comments

* Support training on both cuda 10 and 11 (#274)

* support training on both cuda 10 and 11

* format

Co-authored-by: Ying Xiong <[email protected]>

* Optimize QuantTransformer implementation (#269)

* support ptq of fairseq+lightseq training models

* support dynamic weight_clip_max, adjust ls_fs_transformer_ptq_export act_clip_max

* fix hf training padding mask bug

* add requirements for fairseq examples

* format code

* replace quant scaled emb with quant emb

* fix hidden_dm bug, use quant emb in ptq export

* modify round(x) to floor(x+0.5)

* format

Co-authored-by: Ying Xiong <[email protected]>

* Add torch QAT (#275)

* Add torch layer for training

* fix torch layer bug

* torch ls_transformer support qat

* add fairseq quant example script

Co-authored-by: Yang Wei <[email protected]>

* Fix training missing logging (#277)

* fix logging missing brought by pytorch-quantization

* fix format

* Support MoE Inference (#280)

* init moe

* update pywrapper

* add fairseq export example

* add example of exporting moe

* fix bug of shared_bias

* delete one logger

* change format of example.

* update export README

Co-authored-by: zhangzhexi <[email protected]>

* support fairseq export (#278)

* fix torch fake quant positions

* fixtorch  decoder self attn quant position

* rename scripts

* modify lightseq arguments

* add find-unused-parameters for ls_torch_fairseq training

* finetune quant model from pretrained fp16 model

* fairseq generate using scarebleu

* support native fairseq export

* polish export code

* support converting pb to hdf5

* support ls_torch_fairseq_quant export (stage 1)

* fix typo

* fix fake quant relu compute bug

* fix export bug

* delete useless proto keys

* add ls_torch_fairseq ptq export, fix encdec_attn kv quaant bug

* fix qat export bug

* modify ptq act_clip_max

* support fairseq generate using lightseq inference

* support native fairseq ptq export

* modify README.md

* Triton backend rebase (#301)

* tritonbackend for lightseq

remove useless type qualifiers

tritonbackend README

format

update psf/black

fix code format

update tritonbackend README

fix readme format

fix READEME format

fix REDME format

fix REDME format

adapt README

add empty directories which are needed by triton

* format

Co-authored-by: zhoubofan <[email protected]>

* publish tirtonserver image & update README (#303)

* tritonbackend for lightseq

remove useless type qualifiers

tritonbackend README

format

update psf/black

fix code format

update tritonbackend README

fix readme format

fix READEME format

fix REDME format

fix REDME format

adapt README

add empty directories which are needed by triton

* format

* add tritonserver_lightseq image

* format

* solve the speed of compiling python is too slow

* fix format

* update README

Co-authored-by: zhoubofan <[email protected]>

* support ViT model (#299)

* init example of vit training

* init vit proto

* init patch emb kernel

* init model and pywrapper

* fix bugs

* init vit export example

* fix export bug

* fix blockreducesum bug

* update export

* fix last layernorm

* rm redudent ispostln

* update readme and test example

* with_lightseq true

* delete useless moeKernel

* support channel*patch*patch>=1024

* update pre-commit and code format

* update format of run_vit.sh

* modify Dockerfile to compile tritonbackend (#305)

Co-authored-by: zhoubofan <[email protected]>

* support QAT, export and inference for quantized BERT, GPT2 (#285)

* modify readme of examples

* modify table in example readme

* add cpp example of quant_transformer

* support huggingface bert ptq (stage 1)

* fix huggingface bert weight loading fp16 bug

* finetune quant bert from fp16 ckpt

* add emb quant of bert

* add example of hf bert squad training, modify dir of huggingface training

* format

* rename huggingface dir to fix conflict with datasets

* fix typo of gpt

* export fairseq models to hdf5

* quant hdf5 load (stage 1)

* quant hdf5 transformer finished

* fix fairseq infer bug

* export quant beert, delete hf quant pos emb

* add quant bert files

* support quant bert inference (not test)

* fix quant bert expoort name bug

* support quant bert inference

* update black pre-coommit version

* add quant bert test example

* support cpp quant bert example

* format

* modify readme

* do not use ffn2 out quant if using gelu

* polish gemm test

* fix gemm test lt col bug

* support gpt2 qat

* add causal mask for gpt encoder

* support quant gpt export

* add quant gpt required files

* support quant gpt inference (stage 1)

* add fake quant for logits gemm

* support quant gpt inference (stage 2)

* support quant gpt inference (stage 3)

* support quant gpt inference (ppl)

* support quant gpt inference (TODO: fix qkv bias out clip_max, sampling)

* support quant gpt inference (ppl)

* support quant gpt inference (sampling)

* support quant decoder sampling

* modify readme (add install command)

* optimizer quant gpt gemm, fix gelu bug

* optimize cpp example

* replace quant gpt cache memcpy with pointer wsitch

* fuse quant gpt softmax kernel

* optimize quant gpt arrange-qkv kernel

* modify PiPI spelling

* fix gpt memory spelling

* hf bart training and inference (#316)

* update transformer_decoder_layer.py for hugging face

* huggingface bert training

* hugging bart and bert training

* huggingface bart and bert training

* update bart,bert example and fix bugs

* update example

* fix bugs

* fix bugs

* format update

* format update

* remove return decoder cache

* format update

* update ner and qa

* fix bugs

* formate update

* Optimising the cache of the decoder

Co-authored-by: duanrenchong <[email protected]>

* add encTdecT tagging (multilg_type=3) for multilingual translation (#313)

* add encTdecT tagging (multilg_type=3) for multilingual translation

* format code

Co-authored-by: Yang Wei <[email protected]>

* add export and test for xglm, add extra_decode_length for gpt inference (#317)

* add extra_decode_length for gpt2 sampling

* add extra_decode_length for incoder(xglm) sampling

* change styles for incoder files

* remove useless comments for ls_incoder.py

* change name from incoder to xglm

* change function names for xglm

* remove comments for XGLM

* remove useless lines

* modify default topp of xglm

* fix bug in xglm export

Co-authored-by: lidao <[email protected]>
Co-authored-by: anaivebird <[email protected]>
Co-authored-by: anaivebird <[email protected]>
Co-authored-by: Ying Xiong <[email protected]>

* [WIP] New arch (#320)

* modify Dockerfile to compile tritonbackend

* csrc change file directory

* ops split into ops & layers

* split definition and declaration

* format

Co-authored-by: zhoubofan <[email protected]>

* Fix test unit (#321)

* modify Dockerfile to compile tritonbackend

* fix test_ls_ops

Co-authored-by: zhoubofan <[email protected]>

* support trainable positional embedding (#323)

* Support trainable positional embedding

* fix bug

Co-authored-by: duanrenchong <[email protected]>

* fine-tune bart (#333)

* Support trainable positional embedding

* fix bug

* update convert fs to hf

* support fairseq finetune bart and export

* fix bugs

* add cnn dm script

* fix bugs

* fix

Co-authored-by: duanrenchong <[email protected]>

* fix quant and torch layer bugs (#335)

Co-authored-by: duanrenchong <[email protected]>

* support shard databin and hdfs for fairseq (#344)

* init streaming dataset

* add bash file

* update bash

* update bash

* update bash

* update bash

* fix hdfs download and loop bugs

* small tips

* fixs

* update shell files

Co-authored-by: duanrenchong <[email protected]>

* new_arch (#352)

* new_arch

* fix format error

Co-authored-by: zhoubofan <[email protected]>

* add lsflow (#355)

* [WIP] New arch develop (#353)

* new_arch

* fix format error

* use cuda_malloc

* fix format

* adapt xiaohui's mr

* temp

* fix error

* format

* add pybind_op

* add layer_normalize

* pass operator test

* fix format

* format

* remove useless files

* remove useless commit

* [WIP] New arch develop - add new operators (#358)

* new_arch

* fix format error

* use cuda_malloc

* fix format

* adapt xiaohui's mr

* temp

* fix error

* format

* add pybind_op

* add layer_normalize

* pass operator test

* fix format

* format

* remove useless files

* remove useless commit

* add FeedForwardOp

* add ops new

* format

* remove useless modify

* remove useless modify

* fix compatibility with torch > 1.10 (#359)

torch has remove <THC/THCGeneral.h> file, it will raise error when using ls_adam.

* add new kernels & operators (#360)

* add new kernels

* format

* add kernel - transform and softmax

* format

* fix operator error - Transform0213

* remove useless commit

* residual grad

* fix normalize residual grad

* remove useless file

* fix error

* support Google T5 and MT5 model (#362)

* add extra_decode_length for gpt2 sampling

* add extra_decode_length for incoder(xglm) sampling

* change styles for incoder files

* remove useless comments for ls_incoder.py

* change name from incoder to xglm

* t5 support: T5LayerNorm supported

* Finish relative position bias

* T5 support: same decoder output with pytorch version

* fix bug

* T5 support: correct for batch inference, but beam size > 1 not tested

* fix bug of padding lead to batch size > 1 error

* make head_num be param instead of magic number

* support t5-base

* move two files to correct path

* remove comment for most t5 files

* change relative_attention_num_buckets from magic number to variable

* make max_step variable to save GPU memory and speed up decoding.

* fix bugs

* restore spaces in xglm

* change for styling issues

* restore unneed bart change

* copy to mt5_export

* change proto file (add ffn_third_kernel)

* basic export function

* add exporting lm_head

* add mt5 files in inference/model/*

* add mt5 in inference/proto folder

* add mt5 in inference/pywrapper folder

* support first gelu then second XW2 added to XW1

* fix bug in gelu first and mat element-wise multiply

* fix export no ffn_third_kernel bug

* same for ffn layer output

* remove _logit_scaler as self.config.tie_word_embeddings=False

* try to change decoder into gated ffn

* fix bug

* update for debugging lm_head

* change ls_mt5.py to not run lightseq and huggingface the same time

* change format

* try to support protocolbuf read for T5 model

* fix format problem

* MT5 protocol buffer not support prompt added

Co-authored-by: weiyang.god <[email protected]>
Co-authored-by: lidao <[email protected]>

* Crf (#364)

* format code

* add viterbi kernel

* stage 1

* stage 2

* fix acc bug

* fix crf batch size

* finish viterbi

* New arch develop - encoder layer forward (#366)

* add transformer_encoder_layer

* 2022.08.22 modify

* lightseq new arch develop

* temporary develop

* fix error

* remove useless log

* format

* Ls new arch - finish bert (#369)

* complete bert develop

* format

* LightSeq QAT (#307)

* ls embedding support qat

* [WIP]ls transformer qat

* fix fairseq transformer cli shape bug of output projection

* ln_bw_i8 test passed!

* test with_mean of ln_i8

* ls encoder attn add qat

* dropout_relu_bias_i8 passed!

* dropout_gelu_bias unit test passed!

* dropout_relu_bias_bwd_i8 passed!

* dropout_gelu_bias_bwd_i8 unit test passed!

* format

* dropout_gelu_bias_bwd_i8 unit test passed!

* format

* polish unit test

* [WIP] ls encoder qat test

* quant_bias_add_transform_20314, quant_transform4d_0213 unit test passed!

* fix unit test bug

* [WIP] ls encoder qat unit test

* fix bug

* set default module to disable quant, fix bugs in examples

* fix encoder bug

* encoder qat test pass

* decoder qat forward test pass

* fix bug in encoder bw

* fix bug of cmax grad

* fix bug of act mask

* fix bug in tensor quantizer

* fix cmax grad bug

* [WIP] decoder support qat

* ls decoder qat pass

* ls encoder qat pass

* add unit test for quant bert encoder

* fix memory bug

* fix cmax grad bug in huggingface

* quant bert enc fw&bw test passed!

* fix hf cmax export bug

* fix fairseq out_proj bug

* fix fairseq shell bug

* fix decoder mem bug

* modify initial lr of fairseq quant training

* decoupled qat code

* modify huggingface training scripts

* add cmax grad

* delete enc_kv output quant

* modify ffn2gemm quant like inference

* fuse dequantize

* fix post ln mem bug

* add decoder self attn qkv cache quant

* export quant model (stage 1)

* export quant model (stage 2)

* export quant model (stage 3)

* support vit quant train

* add gradient clip

* fix hf export bug

* fix quant gpt bug

* support quant gpt training

* modify huggingface training scripts

* support ls bert, gpt export

* support custom quant transformer export

* optimizer ffn fake quant and dcmax

* support quant gpt export

* support quant vit export

* add quant linear layer

* fix quant linear layer bug

* support quant vit infer

* speedup cublass igemm on A100 (by huxingwu)

* optimize ls_quant_dropout_act_bias_bwd_kernel

* polish training gemm algo code

* support gemm best algo search on different GPUs and shapes

* search in the range (min_bsz, 512, 1) and (512, max_bsz, 32)

* add configs_sm75/h512_i2048_b1-10016.json

* support col32 igemm

* add configs_sm75/h768_i3072_b1-10016.json

* add configs_sm80/h512_i2048_b1-10016.json

* add configs_sm75/h1024_i4096_b1-10016.json

* add configs_sm80/h768_i3072_b1-10016.json

* fix syntax error

* configs_sm80/h1024_i4096_b1-10016.json

* modify gemm test config format

* merge all the configs to one

* support search all shapes which are not in the config

* polish the merged config

* add cublas_algo_map cpp code

* move get_sm func to lightseq kernels

* move gemm_test to lightseq ops

* modify default config dir, fix algo_map bug

* fix col32 bug

* col major igemm become default

* fix dcax kernel bug

* loosen cuda 11.6 requirement

* add vit cpp example

* fix bug from col32 gemm and a100 tuned col gemm

* support training encoder qkv_linear auto-tune gemm (in comment)

* add required header file

* dynamic use col32 or col4 in different GPUs

* fix multidefinition bug

* fix weight transform col32 bug

* add best algo for inference gemm (in comments)

* support easy benchmark for gpt and transformer

* support benmark huggingface

* fix embedding clip_max bug

* ls quant linear support more shape

* fix quant linear bug

* fix quant linear bug

* update pad function for older torch

* fix quant linear bug

* remove redundant code

* fix export bug

* fix format

* fix custom train&infer bug

* fix quant infer size overflow

* fix ls gpt export bug (extra_decode_length)

* fix hf bart cmax init and state

* fix max-batch-tokens bug of bart predict

Co-authored-by: Ying Xiong <[email protected]>
Co-authored-by: duanrenchong <[email protected]>

* Fix bart eval oom (#370)

* ls embedding support qat

* [WIP]ls transformer qat

* fix fairseq transformer cli shape bug of output projection

* ln_bw_i8 test passed!

* test with_mean of ln_i8

* ls encoder attn add qat

* dropout_relu_bias_i8 passed!

* dropout_gelu_bias unit test passed!

* dropout_relu_bias_bwd_i8 passed!

* dropout_gelu_bias_bwd_i8 unit test passed!

* format

* dropout_gelu_bias_bwd_i8 unit test passed!

* format

* polish unit test

* [WIP] ls encoder qat test

* quant_bias_add_transform_20314, quant_transform4d_0213 unit test passed!

* fix unit test bug

* [WIP] ls encoder qat unit test

* fix bug

* set default module to disable quant, fix bugs in examples

* fix encoder bug

* encoder qat test pass

* decoder qat forward test pass

* fix bug in encoder bw

* fix bug of cmax grad

* fix bug of act mask

* fix bug in tensor quantizer

* fix cmax grad bug

* [WIP] decoder support qat

* ls decoder qat pass

* ls encoder qat pass

* add unit test for quant bert encoder

* fix memory bug

* fix cmax grad bug in huggingface

* quant bert enc fw&bw test passed!

* fix hf cmax export bug

* fix fairseq out_proj bug

* fix fairseq shell bug

* fix decoder mem bug

* modify initial lr of fairseq quant training

* decoupled qat code

* modify huggingface training scripts

* add cmax grad

* delete enc_kv output quant

* modify ffn2gemm quant like inference

* fuse dequantize

* fix post ln mem bug

* add decoder self attn qkv cache quant

* export quant model (stage 1)

* export quant model (stage 2)

* export quant model (stage 3)

* support vit quant train

* add gradient clip

* fix hf export bug

* fix quant gpt bug

* support quant gpt training

* modify huggingface training scripts

* support ls bert, gpt export

* support custom quant transformer export

* optimizer ffn fake quant and dcmax

* support quant gpt export

* support quant vit export

* add quant linear layer

* fix quant linear layer bug

* support quant vit infer

* speedup cublass igemm on A100 (by huxingwu)

* optimize ls_quant_dropout_act_bias_bwd_kernel

* polish training gemm algo code

* support gemm best algo search on different GPUs and shapes

* search in the range (min_bsz, 512, 1) and (512, max_bsz, 32)

* add configs_sm75/h512_i2048_b1-10016.json

* support col32 igemm

* add configs_sm75/h768_i3072_b1-10016.json

* add configs_sm80/h512_i2048_b1-10016.json

* add configs_sm75/h1024_i4096_b1-10016.json

* add configs_sm80/h768_i3072_b1-10016.json

* fix syntax error

* configs_sm80/h1024_i4096_b1-10016.json

* modify gemm test config format

* merge all the configs to one

* support search all shapes which are not in the config

* polish the merged config

* add cublas_algo_map cpp code

* move get_sm func to lightseq kernels

* move gemm_test to lightseq ops

* modify default config dir, fix algo_map bug

* fix col32 bug

* col major igemm become default

* fix dcax kernel bug

* loosen cuda 11.6 requirement

* add vit cpp example

* fix bug from col32 gemm and a100 tuned col gemm

* support training encoder qkv_linear auto-tune gemm (in comment)

* add required header file

* dynamic use col32 or col4 in different GPUs

* fix multidefinition bug

* fix weight transform col32 bug

* add best algo for inference gemm (in comments)

* support easy benchmark for gpt and transformer

* support benmark huggingface

* fix embedding clip_max bug

* ls quant linear support more shape

* fix quant linear bug

* fix quant linear bug

* update pad function for older torch

* fix quant linear bug

* remove redundant code

* fix export bug

* fix format

* fix custom train&infer bug

* fix quant infer size overflow

* fix ls gpt export bug (extra_decode_length)

* fix hf bart cmax init and state

* fix max-batch-tokens bug of bart predict

* fix bart eval oom

Co-authored-by: Ying Xiong <[email protected]>
Co-authored-by: duanrenchong <[email protected]>

* Remove LayerWeight, simplify the use of computational graphs (#372)

* remove layer_weight

* fix test_ls_layers_new

* remove useless log

* fix

* pre-commit fix format

* format

* finish Encoder bw (#373)

* encoder_bw

* format

* Add gradient communication quantization (GCQ) (#367)

* add gra_comm_quantization

* add comments

* add script

* optimize gcq code

* update ls_hf_tranformer_layer

* fix ops bug

* add ls_fs_train_cli

* add huggingface gcq

* fix bug

* fix bug

* add comments

* fix bug

* optimize gcq code

* fix launch error on V100

* optimize GCQ code

* add torch version check

* remove redundant trainer for hf

* remove type annotations

* change torch version check into trainer

* add using multi NICs in script

* fix gcq fairseq training not found bug

* fix format

Co-authored-by: zhangying.1998 <[email protected]>
Co-authored-by: Ying Xiong <[email protected]>

* add debug log & canonical layer/node/tensor naming (#374)

* add debug log

* run pre-commit format

Co-authored-by: Yang Wei <[email protected]>

* Crf layer (#377)

* tmp layer

* finish crf layer

* fix compile bug

* Interface optimization & decoder development (#379)

* split total buffer into serveral mini-buffer (#382)

* Interface optimization & decoder development

* fix error

* split buffer

* Bert crf (#391)

* add bert-crf files

* support bert+crf weight loading

* add bert crf model (with bug)

* fix bug

* fix bug

* fix bug, restore memory.cpp to old version

* update manager to latest

* fix kernel bug when head_dim % 4 != 0

* delete emb lang_emb loading

* modify crf bias to fp32

* fix cmake dtype bug

* fix bug

* rename cmake outputs

* fix cudaStreamSynchronize error (#395)

* fix cudaStreamSynchronize error

* format

* update version to 3.0.0.dev20220926, update dockerfile to manylinux2014 (#392)

* update version to 3.0.0.dev20220926, update dockerfile to manylinux2014

* format

* support auto gemm for int8 inference and training (#393)

* save both col and col32 gemm config

* modify auto-gemm config name

* format

* add A100 igemm config

* fix gemm test workspace oom bug

* add T4 igemm config

* use col dataorder in inference transformer model

* add A30 igemm config

* support inference best igemm algo

* fix workspace bug

* fix workspace bug

* malloc workspace once

* fix workspacesize bug

* support auto igemm for bert

* support auto igemm for quant training

* support download igemm config from website

* save config to HOME, download only once

* fix auto gemm & new arch pad mask bug (#399)

* save both col and col32 gemm config

* modify auto-gemm config name

* format

* add A100 igemm config

* fix gemm test workspace oom bug

* add T4 igemm config

* use col dataorder in inference transformer model

* add A30 igemm config

* support inference best igemm algo

* fix workspace bug

* fix workspace bug

* malloc workspace once

* fix workspacesize bug

* support auto igemm for bert

* support auto igemm for quant training

* support download igemm config from website

* save config to HOME, download only once

* move get_sm inside class

* fix logits igemm col bug

* fix new arch pad mask bug

* fix bug

* fix auto gemm bug (#401)

* save both col and col32 gemm config

* modify auto-gemm config name

* format

* add A100 igemm config

* fix gemm test workspace oom bug

* add T4 igemm config

* use col dataorder in inference transformer model

* add A30 igemm config

* support inference best igemm algo

* fix workspace bug

* fix workspace bug

* malloc workspace once

* fix workspacesize bug

* support auto igemm for bert

* support auto igemm for quant training

* support download igemm config from website

* save config to HOME, download only once

* move get_sm inside class

* fix logits igemm col bug

* fix new arch pad mask bug

* fix bug

* fix algo_map bug

* add A10 igemm config

* update url of igemm configs

* use col32 in T4

* delete useless algomap init func

* Fix train bug (#403)

* fix train relu quant bug

* fix auto gemm algomap bug

* update to 3.0.0 (#404)

* add auto igemm for gpt, vit (#408)

* compatible gcq params (#409)

* fix gcq params

* fix format

Co-authored-by: zhangying.1998 <[email protected]>
Co-authored-by: xiongying.taka <[email protected]>

* Fix gpu name (#415)

* fix gpu name bug when on different gpus

* update version to 3.0.1

* Update README and doc (#417)

Update README and doc

* Hard gate moe (#424)

* hard gate moe dev

* moe hard gate dev

* moe hard gate dev 2

* hard gate moe dev 3

* hard gate moe dev 3

* add remark

* add annotation for moe.cc

* rename to moe_fw_batch1

* rename to moe_fw_hard_gate_batchn

* format code style for hard_gate_moe

* rename to moe_fw_hard_gate_batch1

* move initialization for hard_gate to construction method

* delete batch1 and batchn forward function for encoder

* detele batch1 and batchn forward function for hard gate moe decoder

* hard_moe decoder annotation

* hard_moe decoder annotation

* reformat code for hard gate moe

* fix bug for hard_gate_moe proto parse (#426)

* add hip lightseq

* test precommit

* test pre-commit

* modify readme_hip

Co-authored-by: Yang Wei <[email protected]>
Co-authored-by: xiongying.taka <[email protected]>
Co-authored-by: hexisyztem <[email protected]>
Co-authored-by: zhoubofan <[email protected]>
Co-authored-by: Yang Wei <[email protected]>
Co-authored-by: Xiaohui Wang <[email protected]>
Co-authored-by: Jersey <[email protected]>
Co-authored-by: zhangzhexi <[email protected]>
Co-authored-by: aachong <[email protected]>
Co-authored-by: duanrenchong <[email protected]>
Co-authored-by: xian8 <[email protected]>
Co-authored-by: lidao <[email protected]>
Co-authored-by: anaivebird <[email protected]>
Co-authored-by: anaivebird <[email protected]>
Co-authored-by: naivebird <[email protected]>
Co-authored-by: Ying Zhang <[email protected]>
Co-authored-by: zhangying.1998 <[email protected]>
Co-authored-by: AnYang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants