support QAT, export and inference for quantized BERT, GPT2 #285

godweiyang · 2022-03-15T09:30:24Z

No description provided.

…ning

* Quantized Transformer pre-release (#235) * add int8 for inference (ffn) * quantize weights * add compile option for int8 * add ffn int8 gemm for decoder * add int8 gemm for qkv * add int8 gemm for decoder * load int8 pb (stage 1) * load int8 pb (stage 2) * use unsigned char to represent uint8 * add encoder clip_max * add decoder clip_max * fix decoder q project shape bug * fix act kernel bug * support compile with cuda 11 * remove redundant include * compile using c++14 * support compile for training using cuda 11 and fix cub bug * First try CublasLt * Weight add transform to col32t * i8 in i32 output * Lt int8 all bug fixed * add int8 logit gemm * fix encoder int32_out_buf allocate bug * modify float2int8 calculation * Replace cublas int8 to lt * Add fuse_residual_layer_norm kernel * add scaled_colsum kernel of ffn2 weight, remove old version of cublas gemm (BLEU TEST OK) * modify clip range of relu to [0, c] * TODO: _trg_vocab_size must be 4x * Replace all int8 cublas to cublasLt * use relu clip range (0, c) * remove default algo for cublasLt * add test for cublas and tvm * add more info for cublas gemm unit test * load clip_max of gemm i8 out * add int8 gemm io of logits * add i8 gemm out of ker_arrange_decself_qkv * split the beam search int kernel into two different versions (i8 and i32) * finish decoder i8 gemm out (relu ad ffn2 gemm bug) * finish encoder i8 gemm out (relu and ffn2 gemm bug) * fix scale bug * Unify all variable names and function names * delete useless int8 kernels and add all col32 options * delete useless int8 kernels and add all col32 options * add test for batch gemm * do not test cublaslt when bsz>1 * add more output info for cublas test * polish cublas test code * update proto and add clip_max for dec-self-attn-qkv-biias-out * restore i32 out of ffn2 gemm out for relu clipmax(0, c) * Int8 cache * Optimize int8 refresh cache * Fix wrong result when vocab / 32 != 0 * fix mixed fp32 and int8 bug * modify dequant calculation equation * rename round_up * Add cublaslt gemm w/o mma * Optimizer for 1 batch with regular gemm * Remove redundant code * update quant transformer proto * Refactor quantize Transformer code * rename quant_transformer.proto * rename quant_transformer.proto * recover unneeded change Co-authored-by: xiongying.taka <[email protected]> * Update GPU build toolchain (#258) * modify & support 80 86 * build with glibc 2.24 * Repair for manylinux_2_24 * Fix format Co-authored-by: zhoubofan <[email protected]> Co-authored-by: Ying Xiong <[email protected]> * Optimize QuantTransformer GPU memory (#264) * add int8 for inference (ffn) * quantize weights * add compile option for int8 * add ffn int8 gemm for decoder * add int8 gemm for qkv * add int8 gemm for decoder * load int8 pb (stage 1) * load int8 pb (stage 2) * use unsigned char to represent uint8 * add encoder clip_max * add decoder clip_max * fix decoder q project shape bug * fix act kernel bug * support compile with cuda 11 * remove redundant include * compile using c++14 * support compile for training using cuda 11 and fix cub bug * First try CublasLt * Weight add transform to col32t * i8 in i32 output * Lt int8 all bug fixed * add int8 logit gemm * fix encoder int32_out_buf allocate bug * modify float2int8 calculation * Replace cublas int8 to lt * Add fuse_residual_layer_norm kernel * add scaled_colsum kernel of ffn2 weight, remove old version of cublas gemm (BLEU TEST OK) * modify clip range of relu to [0, c] * TODO: _trg_vocab_size must be 4x * Replace all int8 cublas to cublasLt * use relu clip range (0, c) * remove default algo for cublasLt * add test for cublas and tvm * add more info for cublas gemm unit test * load clip_max of gemm i8 out * add int8 gemm io of logits * add i8 gemm out of ker_arrange_decself_qkv * split the beam search int kernel into two different versions (i8 and i32) * finish decoder i8 gemm out (relu ad ffn2 gemm bug) * finish encoder i8 gemm out (relu and ffn2 gemm bug) * fix scale bug * Unify all variable names and function names * delete useless int8 kernels and add all col32 options * delete useless int8 kernels and add all col32 options * add test for batch gemm * do not test cublaslt when bsz>1 * add more output info for cublas test * polish cublas test code * update proto and add clip_max for dec-self-attn-qkv-biias-out * restore i32 out of ffn2 gemm out for relu clipmax(0, c) * Int8 cache * Optimize int8 refresh cache * Fix wrong result when vocab / 32 != 0 * fix mixed fp32 and int8 bug * modify dequant calculation equation * rename round_up * Add cublaslt gemm w/o mma * Optimizer for 1 batch with regular gemm * Remove redundant code * update quant transformer proto * Refactor quantize Transformer code * rename quant_transformer.proto * rename quant_transformer.proto * recover unneeded change * Remove useless code * Optimize quant_transformer gpu memory usage * Fix encoder init buffer bug * clean comments * optimize decoder embedding memory * fix multilingle emb quant bug * fix nullptr buffer bug * fix nullptr buffer bug * optimize encoder embedding memory * fix emb dequant dtype bug * fix emb dequant dtype bug * delete redundant argument of quant_weight * support post training quantization for lightseq training models * update ptq example README * format README * add README for int8 speed comparison Co-authored-by: xiongying.taka <[email protected]> * Update README.md (#267) * lightseq training for gpt2 (#272) * lightseq training support gpt * remove comments * fix softmax default mask_future * remove comments * remove comments * Support training on both cuda 10 and 11 (#274) * support training on both cuda 10 and 11 * format Co-authored-by: Ying Xiong <[email protected]> * Optimize QuantTransformer implementation (#269) * support ptq of fairseq+lightseq training models * support dynamic weight_clip_max, adjust ls_fs_transformer_ptq_export act_clip_max * fix hf training padding mask bug * add requirements for fairseq examples * format code * replace quant scaled emb with quant emb * fix hidden_dm bug, use quant emb in ptq export * modify round(x) to floor(x+0.5) * format Co-authored-by: Ying Xiong <[email protected]> * Add torch QAT (#275) * Add torch layer for training * fix torch layer bug * torch ls_transformer support qat * add fairseq quant example script Co-authored-by: Yang Wei <[email protected]> * Fix training missing logging (#277) * fix logging missing brought by pytorch-quantization * fix format * Support MoE Inference (#280) * init moe * update pywrapper * add fairseq export example * add example of exporting moe * fix bug of shared_bias * delete one logger * change format of example. * update export README Co-authored-by: zhangzhexi <[email protected]> * support fairseq export (#278) * fix torch fake quant positions * fixtorch decoder self attn quant position * rename scripts * modify lightseq arguments * add find-unused-parameters for ls_torch_fairseq training * finetune quant model from pretrained fp16 model * fairseq generate using scarebleu * support native fairseq export * polish export code * support converting pb to hdf5 * support ls_torch_fairseq_quant export (stage 1) * fix typo * fix fake quant relu compute bug * fix export bug * delete useless proto keys * add ls_torch_fairseq ptq export, fix encdec_attn kv quaant bug * fix qat export bug * modify ptq act_clip_max * support fairseq generate using lightseq inference * support native fairseq ptq export * modify README.md * Triton backend rebase (#301) * tritonbackend for lightseq remove useless type qualifiers tritonbackend README format update psf/black fix code format update tritonbackend README fix readme format fix READEME format fix REDME format fix REDME format adapt README add empty directories which are needed by triton * format Co-authored-by: zhoubofan <[email protected]> * publish tirtonserver image & update README (#303) * tritonbackend for lightseq remove useless type qualifiers tritonbackend README format update psf/black fix code format update tritonbackend README fix readme format fix READEME format fix REDME format fix REDME format adapt README add empty directories which are needed by triton * format * add tritonserver_lightseq image * format * solve the speed of compiling python is too slow * fix format * update README Co-authored-by: zhoubofan <[email protected]> * support ViT model (#299) * init example of vit training * init vit proto * init patch emb kernel * init model and pywrapper * fix bugs * init vit export example * fix export bug * fix blockreducesum bug * update export * fix last layernorm * rm redudent ispostln * update readme and test example * with_lightseq true * delete useless moeKernel * support channel*patch*patch>=1024 * update pre-commit and code format * update format of run_vit.sh * modify Dockerfile to compile tritonbackend (#305) Co-authored-by: zhoubofan <[email protected]> * support QAT, export and inference for quantized BERT, GPT2 (#285) * modify readme of examples * modify table in example readme * add cpp example of quant_transformer * support huggingface bert ptq (stage 1) * fix huggingface bert weight loading fp16 bug * finetune quant bert from fp16 ckpt * add emb quant of bert * add example of hf bert squad training, modify dir of huggingface training * format * rename huggingface dir to fix conflict with datasets * fix typo of gpt * export fairseq models to hdf5 * quant hdf5 load (stage 1) * quant hdf5 transformer finished * fix fairseq infer bug * export quant beert, delete hf quant pos emb * add quant bert files * support quant bert inference (not test) * fix quant bert expoort name bug * support quant bert inference * update black pre-coommit version * add quant bert test example * support cpp quant bert example * format * modify readme * do not use ffn2 out quant if using gelu * polish gemm test * fix gemm test lt col bug * support gpt2 qat * add causal mask for gpt encoder * support quant gpt export * add quant gpt required files * support quant gpt inference (stage 1) * add fake quant for logits gemm * support quant gpt inference (stage 2) * support quant gpt inference (stage 3) * support quant gpt inference (ppl) * support quant gpt inference (TODO: fix qkv bias out clip_max, sampling) * support quant gpt inference (ppl) * support quant gpt inference (sampling) * support quant decoder sampling * modify readme (add install command) * optimizer quant gpt gemm, fix gelu bug * optimize cpp example * replace quant gpt cache memcpy with pointer wsitch * fuse quant gpt softmax kernel * optimize quant gpt arrange-qkv kernel * modify PiPI spelling * fix gpt memory spelling * hf bart training and inference (#316) * update transformer_decoder_layer.py for hugging face * huggingface bert training * hugging bart and bert training * huggingface bart and bert training * update bart,bert example and fix bugs * update example * fix bugs * fix bugs * format update * format update * remove return decoder cache * format update * update ner and qa * fix bugs * formate update * Optimising the cache of the decoder Co-authored-by: duanrenchong <[email protected]> * add encTdecT tagging (multilg_type=3) for multilingual translation (#313) * add encTdecT tagging (multilg_type=3) for multilingual translation * format code Co-authored-by: Yang Wei <[email protected]> * add export and test for xglm, add extra_decode_length for gpt inference (#317) * add extra_decode_length for gpt2 sampling * add extra_decode_length for incoder(xglm) sampling * change styles for incoder files * remove useless comments for ls_incoder.py * change name from incoder to xglm * change function names for xglm * remove comments for XGLM * remove useless lines * modify default topp of xglm * fix bug in xglm export Co-authored-by: lidao <[email protected]> Co-authored-by: anaivebird <[email protected]> Co-authored-by: anaivebird <[email protected]> Co-authored-by: Ying Xiong <[email protected]> * [WIP] New arch (#320) * modify Dockerfile to compile tritonbackend * csrc change file directory * ops split into ops & layers * split definition and declaration * format Co-authored-by: zhoubofan <[email protected]> * Fix test unit (#321) * modify Dockerfile to compile tritonbackend * fix test_ls_ops Co-authored-by: zhoubofan <[email protected]> * support trainable positional embedding (#323) * Support trainable positional embedding * fix bug Co-authored-by: duanrenchong <[email protected]> * fine-tune bart (#333) * Support trainable positional embedding * fix bug * update convert fs to hf * support fairseq finetune bart and export * fix bugs * add cnn dm script * fix bugs * fix Co-authored-by: duanrenchong <[email protected]> * fix quant and torch layer bugs (#335) Co-authored-by: duanrenchong <[email protected]> * support shard databin and hdfs for fairseq (#344) * init streaming dataset * add bash file * update bash * update bash * update bash * update bash * fix hdfs download and loop bugs * small tips * fixs * update shell files Co-authored-by: duanrenchong <[email protected]> * new_arch (#352) * new_arch * fix format error Co-authored-by: zhoubofan <[email protected]> * add lsflow (#355) * [WIP] New arch develop (#353) * new_arch * fix format error * use cuda_malloc * fix format * adapt xiaohui's mr * temp * fix error * format * add pybind_op * add layer_normalize * pass operator test * fix format * format * remove useless files * remove useless commit * [WIP] New arch develop - add new operators (#358) * new_arch * fix format error * use cuda_malloc * fix format * adapt xiaohui's mr * temp * fix error * format * add pybind_op * add layer_normalize * pass operator test * fix format * format * remove useless files * remove useless commit * add FeedForwardOp * add ops new * format * remove useless modify * remove useless modify * fix compatibility with torch > 1.10 (#359) torch has remove <THC/THCGeneral.h> file, it will raise error when using ls_adam. * add new kernels & operators (#360) * add new kernels * format * add kernel - transform and softmax * format * fix operator error - Transform0213 * remove useless commit * residual grad * fix normalize residual grad * remove useless file * fix error * support Google T5 and MT5 model (#362) * add extra_decode_length for gpt2 sampling * add extra_decode_length for incoder(xglm) sampling * change styles for incoder files * remove useless comments for ls_incoder.py * change name from incoder to xglm * t5 support: T5LayerNorm supported * Finish relative position bias * T5 support: same decoder output with pytorch version * fix bug * T5 support: correct for batch inference, but beam size > 1 not tested * fix bug of padding lead to batch size > 1 error * make head_num be param instead of magic number * support t5-base * move two files to correct path * remove comment for most t5 files * change relative_attention_num_buckets from magic number to variable * make max_step variable to save GPU memory and speed up decoding. * fix bugs * restore spaces in xglm * change for styling issues * restore unneed bart change * copy to mt5_export * change proto file (add ffn_third_kernel) * basic export function * add exporting lm_head * add mt5 files in inference/model/* * add mt5 in inference/proto folder * add mt5 in inference/pywrapper folder * support first gelu then second XW2 added to XW1 * fix bug in gelu first and mat element-wise multiply * fix export no ffn_third_kernel bug * same for ffn layer output * remove _logit_scaler as self.config.tie_word_embeddings=False * try to change decoder into gated ffn * fix bug * update for debugging lm_head * change ls_mt5.py to not run lightseq and huggingface the same time * change format * try to support protocolbuf read for T5 model * fix format problem * MT5 protocol buffer not support prompt added Co-authored-by: weiyang.god <[email protected]> Co-authored-by: lidao <[email protected]> * Crf (#364) * format code * add viterbi kernel * stage 1 * stage 2 * fix acc bug * fix crf batch size * finish viterbi * New arch develop - encoder layer forward (#366) * add transformer_encoder_layer * 2022.08.22 modify * lightseq new arch develop * temporary develop * fix error * remove useless log * format * Ls new arch - finish bert (#369) * complete bert develop * format * LightSeq QAT (#307) * ls embedding support qat * [WIP]ls transformer qat * fix fairseq transformer cli shape bug of output projection * ln_bw_i8 test passed! * test with_mean of ln_i8 * ls encoder attn add qat * dropout_relu_bias_i8 passed! * dropout_gelu_bias unit test passed! * dropout_relu_bias_bwd_i8 passed! * dropout_gelu_bias_bwd_i8 unit test passed! * format * dropout_gelu_bias_bwd_i8 unit test passed! * format * polish unit test * [WIP] ls encoder qat test * quant_bias_add_transform_20314, quant_transform4d_0213 unit test passed! * fix unit test bug * [WIP] ls encoder qat unit test * fix bug * set default module to disable quant, fix bugs in examples * fix encoder bug * encoder qat test pass * decoder qat forward test pass * fix bug in encoder bw * fix bug of cmax grad * fix bug of act mask * fix bug in tensor quantizer * fix cmax grad bug * [WIP] decoder support qat * ls decoder qat pass * ls encoder qat pass * add unit test for quant bert encoder * fix memory bug * fix cmax grad bug in huggingface * quant bert enc fw&bw test passed! * fix hf cmax export bug * fix fairseq out_proj bug * fix fairseq shell bug * fix decoder mem bug * modify initial lr of fairseq quant training * decoupled qat code * modify huggingface training scripts * add cmax grad * delete enc_kv output quant * modify ffn2gemm quant like inference * fuse dequantize * fix post ln mem bug * add decoder self attn qkv cache quant * export quant model (stage 1) * export quant model (stage 2) * export quant model (stage 3) * support vit quant train * add gradient clip * fix hf export bug * fix quant gpt bug * support quant gpt training * modify huggingface training scripts * support ls bert, gpt export * support custom quant transformer export * optimizer ffn fake quant and dcmax * support quant gpt export * support quant vit export * add quant linear layer * fix quant linear layer bug * support quant vit infer * speedup cublass igemm on A100 (by huxingwu) * optimize ls_quant_dropout_act_bias_bwd_kernel * polish training gemm algo code * support gemm best algo search on different GPUs and shapes * search in the range (min_bsz, 512, 1) and (512, max_bsz, 32) * add configs_sm75/h512_i2048_b1-10016.json * support col32 igemm * add configs_sm75/h768_i3072_b1-10016.json * add configs_sm80/h512_i2048_b1-10016.json * add configs_sm75/h1024_i4096_b1-10016.json * add configs_sm80/h768_i3072_b1-10016.json * fix syntax error * configs_sm80/h1024_i4096_b1-10016.json * modify gemm test config format * merge all the configs to one * support search all shapes which are not in the config * polish the merged config * add cublas_algo_map cpp code * move get_sm func to lightseq kernels * move gemm_test to lightseq ops * modify default config dir, fix algo_map bug * fix col32 bug * col major igemm become default * fix dcax kernel bug * loosen cuda 11.6 requirement * add vit cpp example * fix bug from col32 gemm and a100 tuned col gemm * support training encoder qkv_linear auto-tune gemm (in comment) * add required header file * dynamic use col32 or col4 in different GPUs * fix multidefinition bug * fix weight transform col32 bug * add best algo for inference gemm (in comments) * support easy benchmark for gpt and transformer * support benmark huggingface * fix embedding clip_max bug * ls quant linear support more shape * fix quant linear bug * fix quant linear bug * update pad function for older torch * fix quant linear bug * remove redundant code * fix export bug * fix format * fix custom train&infer bug * fix quant infer size overflow * fix ls gpt export bug (extra_decode_length) * fix hf bart cmax init and state * fix max-batch-tokens bug of bart predict Co-authored-by: Ying Xiong <[email protected]> Co-authored-by: duanrenchong <[email protected]> * Fix bart eval oom (#370) * ls embedding support qat * [WIP]ls transformer qat * fix fairseq transformer cli shape bug of output projection * ln_bw_i8 test passed! * test with_mean of ln_i8 * ls encoder attn add qat * dropout_relu_bias_i8 passed! * dropout_gelu_bias unit test passed! * dropout_relu_bias_bwd_i8 passed! * dropout_gelu_bias_bwd_i8 unit test passed! * format * dropout_gelu_bias_bwd_i8 unit test passed! * format * polish unit test * [WIP] ls encoder qat test * quant_bias_add_transform_20314, quant_transform4d_0213 unit test passed! * fix unit test bug * [WIP] ls encoder qat unit test * fix bug * set default module to disable quant, fix bugs in examples * fix encoder bug * encoder qat test pass * decoder qat forward test pass * fix bug in encoder bw * fix bug of cmax grad * fix bug of act mask * fix bug in tensor quantizer * fix cmax grad bug * [WIP] decoder support qat * ls decoder qat pass * ls encoder qat pass * add unit test for quant bert encoder * fix memory bug * fix cmax grad bug in huggingface * quant bert enc fw&bw test passed! * fix hf cmax export bug * fix fairseq out_proj bug * fix fairseq shell bug * fix decoder mem bug * modify initial lr of fairseq quant training * decoupled qat code * modify huggingface training scripts * add cmax grad * delete enc_kv output quant * modify ffn2gemm quant like inference * fuse dequantize * fix post ln mem bug * add decoder self attn qkv cache quant * export quant model (stage 1) * export quant model (stage 2) * export quant model (stage 3) * support vit quant train * add gradient clip * fix hf export bug * fix quant gpt bug * support quant gpt training * modify huggingface training scripts * support ls bert, gpt export * support custom quant transformer export * optimizer ffn fake quant and dcmax * support quant gpt export * support quant vit export * add quant linear layer * fix quant linear layer bug * support quant vit infer * speedup cublass igemm on A100 (by huxingwu) * optimize ls_quant_dropout_act_bias_bwd_kernel * polish training gemm algo code * support gemm best algo search on different GPUs and shapes * search in the range (min_bsz, 512, 1) and (512, max_bsz, 32) * add configs_sm75/h512_i2048_b1-10016.json * support col32 igemm * add configs_sm75/h768_i3072_b1-10016.json * add configs_sm80/h512_i2048_b1-10016.json * add configs_sm75/h1024_i4096_b1-10016.json * add configs_sm80/h768_i3072_b1-10016.json * fix syntax error * configs_sm80/h1024_i4096_b1-10016.json * modify gemm test config format * merge all the configs to one * support search all shapes which are not in the config * polish the merged config * add cublas_algo_map cpp code * move get_sm func to lightseq kernels * move gemm_test to lightseq ops * modify default config dir, fix algo_map bug * fix col32 bug * col major igemm become default * fix dcax kernel bug * loosen cuda 11.6 requirement * add vit cpp example * fix bug from col32 gemm and a100 tuned col gemm * support training encoder qkv_linear auto-tune gemm (in comment) * add required header file * dynamic use col32 or col4 in different GPUs * fix multidefinition bug * fix weight transform col32 bug * add best algo for inference gemm (in comments) * support easy benchmark for gpt and transformer * support benmark huggingface * fix embedding clip_max bug * ls quant linear support more shape * fix quant linear bug * fix quant linear bug * update pad function for older torch * fix quant linear bug * remove redundant code * fix export bug * fix format * fix custom train&infer bug * fix quant infer size overflow * fix ls gpt export bug (extra_decode_length) * fix hf bart cmax init and state * fix max-batch-tokens bug of bart predict * fix bart eval oom Co-authored-by: Ying Xiong <[email protected]> Co-authored-by: duanrenchong <[email protected]> * Remove LayerWeight, simplify the use of computational graphs (#372) * remove layer_weight * fix test_ls_layers_new * remove useless log * fix * pre-commit fix format * format * finish Encoder bw (#373) * encoder_bw * format * Add gradient communication quantization (GCQ) (#367) * add gra_comm_quantization * add comments * add script * optimize gcq code * update ls_hf_tranformer_layer * fix ops bug * add ls_fs_train_cli * add huggingface gcq * fix bug * fix bug * add comments * fix bug * optimize gcq code * fix launch error on V100 * optimize GCQ code * add torch version check * remove redundant trainer for hf * remove type annotations * change torch version check into trainer * add using multi NICs in script * fix gcq fairseq training not found bug * fix format Co-authored-by: zhangying.1998 <[email protected]> Co-authored-by: Ying Xiong <[email protected]> * add debug log & canonical layer/node/tensor naming (#374) * add debug log * run pre-commit format Co-authored-by: Yang Wei <[email protected]> * Crf layer (#377) * tmp layer * finish crf layer * fix compile bug * Interface optimization & decoder development (#379) * split total buffer into serveral mini-buffer (#382) * Interface optimization & decoder development * fix error * split buffer * Bert crf (#391) * add bert-crf files * support bert+crf weight loading * add bert crf model (with bug) * fix bug * fix bug * fix bug, restore memory.cpp to old version * update manager to latest * fix kernel bug when head_dim % 4 != 0 * delete emb lang_emb loading * modify crf bias to fp32 * fix cmake dtype bug * fix bug * rename cmake outputs * fix cudaStreamSynchronize error (#395) * fix cudaStreamSynchronize error * format * update version to 3.0.0.dev20220926, update dockerfile to manylinux2014 (#392) * update version to 3.0.0.dev20220926, update dockerfile to manylinux2014 * format * support auto gemm for int8 inference and training (#393) * save both col and col32 gemm config * modify auto-gemm config name * format * add A100 igemm config * fix gemm test workspace oom bug * add T4 igemm config * use col dataorder in inference transformer model * add A30 igemm config * support inference best igemm algo * fix workspace bug * fix workspace bug * malloc workspace once * fix workspacesize bug * support auto igemm for bert * support auto igemm for quant training * support download igemm config from website * save config to HOME, download only once * fix auto gemm & new arch pad mask bug (#399) * save both col and col32 gemm config * modify auto-gemm config name * format * add A100 igemm config * fix gemm test workspace oom bug * add T4 igemm config * use col dataorder in inference transformer model * add A30 igemm config * support inference best igemm algo * fix workspace bug * fix workspace bug * malloc workspace once * fix workspacesize bug * support auto igemm for bert * support auto igemm for quant training * support download igemm config from website * save config to HOME, download only once * move get_sm inside class * fix logits igemm col bug * fix new arch pad mask bug * fix bug * fix auto gemm bug (#401) * save both col and col32 gemm config * modify auto-gemm config name * format * add A100 igemm config * fix gemm test workspace oom bug * add T4 igemm config * use col dataorder in inference transformer model * add A30 igemm config * support inference best igemm algo * fix workspace bug * fix workspace bug * malloc workspace once * fix workspacesize bug * support auto igemm for bert * support auto igemm for quant training * support download igemm config from website * save config to HOME, download only once * move get_sm inside class * fix logits igemm col bug * fix new arch pad mask bug * fix bug * fix algo_map bug * add A10 igemm config * update url of igemm configs * use col32 in T4 * delete useless algomap init func * Fix train bug (#403) * fix train relu quant bug * fix auto gemm algomap bug * update to 3.0.0 (#404) * add auto igemm for gpt, vit (#408) * compatible gcq params (#409) * fix gcq params * fix format Co-authored-by: zhangying.1998 <[email protected]> Co-authored-by: xiongying.taka <[email protected]> * Fix gpu name (#415) * fix gpu name bug when on different gpus * update version to 3.0.1 * Update README and doc (#417) Update README and doc * Hard gate moe (#424) * hard gate moe dev * moe hard gate dev * moe hard gate dev 2 * hard gate moe dev 3 * hard gate moe dev 3 * add remark * add annotation for moe.cc * rename to moe_fw_batch1 * rename to moe_fw_hard_gate_batchn * format code style for hard_gate_moe * rename to moe_fw_hard_gate_batch1 * move initialization for hard_gate to construction method * delete batch1 and batchn forward function for encoder * detele batch1 and batchn forward function for hard gate moe decoder * hard_moe decoder annotation * hard_moe decoder annotation * reformat code for hard gate moe * fix bug for hard_gate_moe proto parse (#426) * add hip lightseq * test precommit * test pre-commit * modify readme_hip Co-authored-by: Yang Wei <[email protected]> Co-authored-by: xiongying.taka <[email protected]> Co-authored-by: hexisyztem <[email protected]> Co-authored-by: zhoubofan <[email protected]> Co-authored-by: Yang Wei <[email protected]> Co-authored-by: Xiaohui Wang <[email protected]> Co-authored-by: Jersey <[email protected]> Co-authored-by: zhangzhexi <[email protected]> Co-authored-by: aachong <[email protected]> Co-authored-by: duanrenchong <[email protected]> Co-authored-by: xian8 <[email protected]> Co-authored-by: lidao <[email protected]> Co-authored-by: anaivebird <[email protected]> Co-authored-by: anaivebird <[email protected]> Co-authored-by: naivebird <[email protected]> Co-authored-by: Ying Zhang <[email protected]> Co-authored-by: zhangying.1998 <[email protected]> Co-authored-by: AnYang <[email protected]>

godweiyang added 2 commits March 15, 2022 17:16

modify readme of examples

f391565

modify table in example readme

5ed629a

godweiyang requested review from neopro12 and Taka152 as code owners March 15, 2022 09:30

godweiyang added 3 commits March 16, 2022 13:27

add cpp example of quant_transformer

35adcfd

support huggingface bert ptq (stage 1)

e8fa612

fix huggingface bert weight loading fp16 bug

8c72f26

godweiyang changed the title ~~optimize examples for convenient usage~~ support bert QAT and int8 inference Mar 17, 2022

godweiyang added 22 commits March 18, 2022 00:20

finetune quant bert from fp16 ckpt

1d6c376

add emb quant of bert

872b540

add example of hf bert squad training, modify dir of huggingface trai…

cf0caa6

…ning

format

ae3ffbd

rename huggingface dir to fix conflict with datasets

6657d86

fix typo of gpt

4a986ae

export fairseq models to hdf5

76aa5d8

quant hdf5 load (stage 1)

ebe1071

quant hdf5 transformer finished

ee296bb

fix fairseq infer bug

b964832

export quant beert, delete hf quant pos emb

5f52dd6

add quant bert files

63e90d9

support quant bert inference (not test)

d228cf2

fix quant bert expoort name bug

3400a1d

support quant bert inference

968f9ac

update black pre-coommit version

df94d69

add quant bert test example

6d3e74c

support cpp quant bert example

0189252

format

1078cc2

modify readme

33bb905

do not use ffn2 out quant if using gelu

3bb9f73

polish gemm test

fa7b8cb

godweiyang added 2 commits April 18, 2022 18:09

support gpt2 qat

ff64270

add causal mask for gpt encoder

c17fdbb

godweiyang changed the title ~~support bert QAT and int8 inference~~ support QAT, export and inference for BERT, GPT2 Apr 18, 2022

godweiyang added 14 commits April 19, 2022 13:10

support quant gpt export

19dd24a

add quant gpt required files

88ae1d7

fix confict

a594523

support quant gpt inference (stage 1)

61cb0c4

fix conflict

436799d

fix conflict

bc0a7d5

add fake quant for logits gemm

c5f6aa2

support quant gpt inference (stage 2)

292cc3c

support quant gpt inference (stage 3)

7ba1c6a

support quant gpt inference (ppl)

1ab6bfc

support quant gpt inference (TODO: fix qkv bias out clip_max, sampling)

ca9739b

support quant gpt inference (ppl)

56eb950

support quant gpt inference (sampling)

d3a5807

support quant decoder sampling

a37e20f

godweiyang changed the title ~~support QAT, export and inference for BERT, GPT2~~ support QAT, export and inference for quantized BERT, GPT2 Apr 27, 2022

godweiyang added 9 commits April 28, 2022 00:44

modify readme (add install command)

305ef73

optimizer quant gpt gemm, fix gelu bug

c1141d8

optimize cpp example

7ff1fc4

replace quant gpt cache memcpy with pointer wsitch

6a4c705

fuse quant gpt softmax kernel

e68bf8f

optimize quant gpt arrange-qkv kernel

9e40037

modify PiPI spelling

dd71c87

Merge branch 'master' into opt-example

c687af2

fix gpt memory spelling

8c4b81e

neopro12 approved these changes May 7, 2022

View reviewed changes

neopro12 merged commit 4024ae1 into master May 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support QAT, export and inference for quantized BERT, GPT2 #285

support QAT, export and inference for quantized BERT, GPT2 #285

godweiyang commented Mar 15, 2022

support QAT, export and inference for quantized BERT, GPT2 #285

support QAT, export and inference for quantized BERT, GPT2 #285

Conversation

godweiyang commented Mar 15, 2022