support QAT, export and inference for quantized BERT, GPT2 (#285)

* modify readme of examples * modify table in example readme * add cpp example of quant_transformer * support huggingface bert ptq (stage 1) * fix huggingface bert weight loading fp16 bug * finetune quant bert from fp16 ckpt * add emb quant of bert * add example of hf bert squad training, modify dir of huggingface training * format * rename huggingface dir to fix conflict with datasets * fix typo of gpt * export fairseq models to hdf5 * quant hdf5 load (stage 1) * quant hdf5 transformer finished * fix fairseq infer bug * export quant beert, delete hf quant pos emb * add quant bert files * support quant bert inference (not test) * fix quant bert expoort name bug * support quant bert inference * update black pre-coommit version * add quant bert test example * support cpp quant bert example * format * modify readme * do not use ffn2 out quant if using gelu * polish gemm test * fix gemm test lt col bug * support gpt2 qat * add causal mask for gpt encoder * support quant gpt export * add quant gpt required files * support quant gpt inference (stage 1) * add fake quant for logits gemm * support quant gpt inference (stage 2) * support quant gpt inference (stage 3) * support quant gpt inference (ppl) * support quant gpt inference (TODO: fix qkv bias out clip_max, sampling) * support quant gpt inference (ppl) * support quant gpt inference (sampling) * support quant decoder sampling * modify readme (add install command) * optimizer quant gpt gemm, fix gelu bug * optimize cpp example * replace quant gpt cache memcpy with pointer wsitch * fuse quant gpt softmax kernel * optimize quant gpt arrange-qkv kernel * modify PiPI spelling * fix gpt memory spelling
bytedance · May 7, 2022 · 4024ae1 · 4024ae1
1 parent 3c1c506
commit 4024ae1
Show file tree

Hide file tree

Showing 111 changed files with 9,766 additions and 1,715 deletions.
diff --git a/README.md b/README.md
@@ -41,7 +41,7 @@ The following is a support matrix of LightSeq **inference** library compared wit
 ## Performance
 
 ### [>>> Training](./lightseq/training)
-Here we present the experimental results on WMT14 English to German translation task based on Transformer-big models. We train Transformer models of different sizes on eight NVIDIA Tesla V100/NVIDIA Ampere A100 GPUs with data parallel and fp16 mixed precision.
+Here we present the experimental results on WMT14 English to German translation task based on Transformer-big models. We train Transformer models of different sizes on eight NVIDIA Tesla V100/NVIDIA Tesla A100 GPUs with data parallel and fp16 mixed precision.
 [Fairseq](https://github.com/pytorch/fairseq) with [Apex](https://github.com/NVIDIA/apex) is choosed as our baseline.
 
 <img src="./docs/training/images/single_step.png"  width="80%" aligned="middle">
@@ -66,6 +66,20 @@ More results is available [here](./docs/inference/performance.md).
 ## Quick Start
 Complete user guide is available [here](docs/guide.md).
 
+### Installation
+You can install LightSeq from PyPI:
+```shell
+$ pip install lightseq
+```
+
+LightSeq installation from PyPI only supports Python 3.6 to 3.8 on Linux for now. Consider compiling from source if you have other environments:
+```shell
+$ PATH=/usr/local/hdf5/:$PATH ENABLE_FP32=0 ENABLE_DEBUG=0 pip install -e $PROJECT_DIR
+```
+
+Detailed building introduction is available [here](docs/inference/build.md).
+
+
 ### Fast training from Fairseq
 
 You can experience lightning fast training by running following commands,
@@ -97,12 +111,10 @@ $ cd examples/inference/python
 then you can check the performance by simply running following commands. `hf_bart_export.py` is used to transform pytorch weights to LightSeq protobuffer.
 
 ```shell
-$ python export/hf_bart_export.py
+$ python export/huggingface/hf_bart_export.py
 $ python test/ls_bart.py
 ```
 
-LightSeq installation from pypi only supports python 3.6 to 3.8 on Linux for now. Consider compiling from source if you have other environments.
-
 More usage is available [here](./lightseq/inference/README.md).
 
 ### Fast deploy inference server

diff --git a/docker/README.md b/docker/README.md
@@ -1,5 +1,5 @@
 ## Dockerfiles of lightseq
 
-Pypi: for publish python package.
+PyPI: for publish python package.
 
 Tritonserver: for publish tritonserver
diff --git a/docs/guide.md b/docs/guide.md
@@ -119,7 +119,7 @@ These functions can export the configuration, embedding, encoder and decoder wei
 LightSeq provides export examples of native Hugging Face BERT/BART/GPT2, Fairseq trained with LightSeq and LightSeq Transformer. All codes are available [here](../examples/inference/python/export).
 
 #### Fairseq
-The main code is as follows (some parameters are omitted). Complete code is available [here](../examples/inference/python/export/ls_fs_transformer_export.py).
+The main code is as follows (some parameters are omitted). Complete code is available [here](../examples/inference/python/export/fairseq/ls_fs_transformer_export.py).
 ```python
 model = Transformer()
 encoder_state_dict, decoder_state_dict = _extract_weight(state_dict)
@@ -136,7 +136,7 @@ First, you need to divide the state dict into two parts of encoder and decoder,
 The above functions export the checkpoints to protobuf by default. Specify `save_pb=False` to export to hdf5 files. You can use the [Fairseq training example](../examples/training/fairseq) to obtain the trained checkpoints.
 
 #### Hugging Face
-LightSeq provides three examples of exporting native Hugging Face models ([BERT](../examples/inference/python/export/hf_bert_export.py), [BART](../examples/inference/python/export/hf_bart_export.py) and [GPT2](../examples/inference/python/export/hf_gpt2_export.py)). Because these native models did not use LightSeq modules to pretrain, the users must manually make the export rules.
+LightSeq provides three examples of exporting native Hugging Face models ([BERT](../examples/inference/python/export/huggingface/hf_bert_export.py), [BART](../examples/inference/python/export/huggingface/hf_bart_export.py) and [GPT2](../examples/inference/python/export/huggingface/hf_gpt2_export.py)). Because these native models did not use LightSeq modules to pretrain, the users must manually make the export rules.
 
 #### LightSeq Transformer
 LightSeq provide an example of exporting its own Transformer module, which is similar to Fairseq models export. You can use the [custom training example](../examples/training/custom) to obtain the trained checkpoints. This export example can also compare the results and speeds of forward propagation in training library, inference library loading both protobuf and hdf5 files. The results show that the inference library is faster than the forward propagation of training library by about 2x.

diff --git a/docs/training/images/single_step.png b/docs/training/images/single_step.png
diff --git a/examples/inference/cpp/CMakeLists.txt b/examples/inference/cpp/CMakeLists.txt
@@ -3,11 +3,20 @@ cmake_minimum_required(VERSION 3.18)
 add_executable(transformer_example transformer_example.cc)
 target_link_libraries(transformer_example PUBLIC liblightseq)
 
+add_executable(quant_transformer_example quant_transformer_example.cc)
+target_link_libraries(quant_transformer_example PUBLIC liblightseq)
+
 add_executable(bert_example bert_example.cc)
 target_link_libraries(bert_example PUBLIC liblightseq)
 
+add_executable(quant_bert_example quant_bert_example.cc)
+target_link_libraries(quant_bert_example PUBLIC liblightseq)
+
 add_executable(gpt_example gpt_example.cc)
 target_link_libraries(gpt_example PUBLIC liblightseq)
 
+add_executable(quant_gpt_example quant_gpt_example.cc)
+target_link_libraries(quant_gpt_example PUBLIC liblightseq)
+
 add_executable(transformer_decoder_example decoder_example.cc.cu)
 target_link_libraries(transformer_decoder_example PUBLIC transformer_model)
diff --git a/examples/inference/cpp/bert_example.cc b/examples/inference/cpp/bert_example.cc
@@ -8,15 +8,31 @@ Example of how to run Bert inference using our implementation.
 
 int main(int argc, char* argv[]) {
   std::string model_weights_path = argv[1];
+  std::vector<int> example_input = {2859, 2758, 2051, 2157,
+                                    2005, 6629, 7566, 1012};
+  int eg_seq_len = example_input.size();
   int max_batch_size = 128;
+  int batch_size = 1;
+  int batch_seq_len = eg_seq_len;
+
+  if (argc == 4) {
+    batch_size = atoi(argv[2]);
+    batch_seq_len = atoi(argv[3]);
+  }
+  if (batch_size > max_batch_size) {
+    throw std::runtime_error("batch_size exceeds the maximum (128)!");
+  }
+
+  std::vector<int> host_input;
+  for (int i = 0; i < batch_size; ++i) {
+    for (int j = 0; j < batch_seq_len; ++j) {
+      host_input.push_back(example_input[j % eg_seq_len]);
+    }
+  }
 
   auto model = lightseq::cuda::LSModelFactory::GetInstance().CreateModel(
       "Bert", model_weights_path, max_batch_size);
 
-  int batch_size = 1;
-  int batch_seq_len = 8;
-  std::vector<int> host_input = {101, 4931, 1010, 2129, 2024, 2017, 102, 0};
-
   void* d_input;
   lightseq::cuda::CHECK_GPU_ERROR(
       cudaMalloc(&d_input, sizeof(int) * batch_size * batch_seq_len));

diff --git a/examples/inference/cpp/gpt_example.cc b/examples/inference/cpp/gpt_example.cc
@@ -8,15 +8,30 @@ Example of how to run gpt inference using our implementation.
 
 int main(int argc, char* argv[]) {
   std::string model_weights_path = argv[1];
+  std::vector<int> example_input = {40, 1842, 345, 11, 475, 345, 910, 326};
+  int eg_seq_len = example_input.size();
   int max_batch_size = 128;
+  int batch_size = 1;
+  int batch_seq_len = eg_seq_len;
+
+  if (argc == 4) {
+    batch_size = atoi(argv[2]);
+    batch_seq_len = atoi(argv[3]);
+  }
+  if (batch_size > max_batch_size) {
+    throw std::runtime_error("batch_size exceeds the maximum (128)!");
+  }
+
+  std::vector<int> host_input;
+  for (int i = 0; i < batch_size; ++i) {
+    for (int j = 0; j < batch_seq_len; ++j) {
+      host_input.push_back(example_input[j % eg_seq_len]);
+    }
+  }
 
   auto model = lightseq::cuda::LSModelFactory::GetInstance().CreateModel(
       "Gpt", model_weights_path, max_batch_size);
 
-  int batch_size = 1;
-  int batch_seq_len = 5;
-  std::vector<int> host_input = {3666, 1438, 318, 402, 11571};
-
   void* d_input;
   lightseq::cuda::CHECK_GPU_ERROR(
       cudaMalloc(&d_input, sizeof(int) * batch_size * batch_seq_len));
@@ -58,7 +73,7 @@ int main(int argc, char* argv[]) {
     }
     std::cout << std::endl;
 
-    lightseq::cuda::print_vec(d_output, "output", 5);
+    lightseq::cuda::print_vec(d_output, "output", 10);
   }
 
   return 0;

diff --git a/examples/inference/cpp/quant_bert_example.cc b/examples/inference/cpp/quant_bert_example.cc
@@ -0,0 +1,81 @@
+#include "model_base.h"
+#include "util.h"
+
+/**
+@file
+Example of how to run QuantBert inference using our implementation.
+*/
+
+int main(int argc, char* argv[]) {
+  std::string model_weights_path = argv[1];
+  std::vector<int> example_input = {2859, 2758, 2051, 2157,
+                                    2005, 6629, 7566, 1012};
+  int eg_seq_len = example_input.size();
+  int max_batch_size = 128;
+  int batch_size = 1;
+  int batch_seq_len = eg_seq_len;
+
+  if (argc == 4) {
+    batch_size = atoi(argv[2]);
+    batch_seq_len = atoi(argv[3]);
+  }
+  if (batch_size > max_batch_size) {
+    throw std::runtime_error("batch_size exceeds the maximum (128)!");
+  }
+
+  std::vector<int> host_input;
+  for (int i = 0; i < batch_size; ++i) {
+    for (int j = 0; j < batch_seq_len; ++j) {
+      host_input.push_back(example_input[j % eg_seq_len]);
+    }
+  }
+
+  auto model = lightseq::cuda::LSModelFactory::GetInstance().CreateModel(
+      "QuantBert", model_weights_path, max_batch_size);
+
+  void* d_input;
+  lightseq::cuda::CHECK_GPU_ERROR(
+      cudaMalloc(&d_input, sizeof(int) * batch_size * batch_seq_len));
+  lightseq::cuda::CHECK_GPU_ERROR(cudaMemcpy(
+      d_input, host_input.data(), sizeof(int) * batch_size * batch_seq_len,
+      cudaMemcpyHostToDevice));
+
+  model->set_input_ptr(0, d_input);
+  model->set_input_shape(0, {batch_size, batch_seq_len});
+
+  for (int i = 0; i < model->get_output_size(); i++) {
+    void* d_output;
+    std::vector<int> shape = model->get_output_max_shape(i);
+    int total_size = 1;
+    for (int j = 0; j < shape.size(); j++) {
+      total_size *= shape[j];
+    }
+    lightseq::cuda::CHECK_GPU_ERROR(
+        cudaMalloc(&d_output, total_size * sizeof(int)));
+    model->set_output_ptr(i, d_output);
+  }
+  lightseq::cuda::CHECK_GPU_ERROR(cudaStreamSynchronize(0));
+  std::cout << "infer preprocessing finished" << std::endl;
+
+  /* ---step5. infer and log--- */
+  for (int i = 0; i < 10; i++) {
+    auto start = std::chrono::high_resolution_clock::now();
+    model->Infer();
+    lightseq::cuda::print_time_duration(start, "one infer time", 0);
+  }
+
+  for (int i = 0; i < model->get_output_size(); i++) {
+    const float* d_output;
+    d_output = static_cast<const float*>(model->get_output_ptr(i));
+    std::vector<int> shape = model->get_output_shape(i);
+    std::cout << "output shape: ";
+    for (int j = 0; j < shape.size(); j++) {
+      std::cout << shape[j] << " ";
+    }
+    std::cout << std::endl;
+
+    lightseq::cuda::print_vec(d_output, "output", 5);
+  }
+
+  return 0;
+}
diff --git a/examples/inference/cpp/quant_gpt_example.cc b/examples/inference/cpp/quant_gpt_example.cc
@@ -0,0 +1,80 @@
+#include "model_base.h"
+#include "gpt.h"
+
+/**
+@file
+Example of how to run gpt inference using our implementation.
+*/
+
+int main(int argc, char* argv[]) {
+  std::string model_weights_path = argv[1];
+  std::vector<int> example_input = {40, 1842, 345, 11, 475, 345, 910, 326};
+  int eg_seq_len = example_input.size();
+  int max_batch_size = 128;
+  int batch_size = 1;
+  int batch_seq_len = eg_seq_len;
+
+  if (argc == 4) {
+    batch_size = atoi(argv[2]);
+    batch_seq_len = atoi(argv[3]);
+  }
+  if (batch_size > max_batch_size) {
+    throw std::runtime_error("batch_size exceeds the maximum (128)!");
+  }
+
+  std::vector<int> host_input;
+  for (int i = 0; i < batch_size; ++i) {
+    for (int j = 0; j < batch_seq_len; ++j) {
+      host_input.push_back(example_input[j % eg_seq_len]);
+    }
+  }
+
+  auto model = lightseq::cuda::LSModelFactory::GetInstance().CreateModel(
+      "QuantGpt", model_weights_path, max_batch_size);
+
+  void* d_input;
+  lightseq::cuda::CHECK_GPU_ERROR(
+      cudaMalloc(&d_input, sizeof(int) * batch_size * batch_seq_len));
+  lightseq::cuda::CHECK_GPU_ERROR(cudaMemcpy(
+      d_input, host_input.data(), sizeof(int) * batch_size * batch_seq_len,
+      cudaMemcpyHostToDevice));
+
+  model->set_input_ptr(0, d_input);
+  model->set_input_shape(0, {batch_size, batch_seq_len});
+
+  for (int i = 0; i < model->get_output_size(); i++) {
+    void* d_output;
+    std::vector<int> shape = model->get_output_max_shape(i);
+    int total_size = 1;
+    for (int j = 0; j < shape.size(); j++) {
+      total_size *= shape[j];
+    }
+    lightseq::cuda::CHECK_GPU_ERROR(
+        cudaMalloc(&d_output, total_size * sizeof(int)));
+    model->set_output_ptr(i, d_output);
+  }
+  lightseq::cuda::CHECK_GPU_ERROR(cudaStreamSynchronize(0));
+  std::cout << "infer preprocessing finished" << std::endl;
+
+  /* ---step5. infer and log--- */
+  for (int i = 0; i < 10; i++) {
+    auto start = std::chrono::high_resolution_clock::now();
+    model->Infer();
+    lightseq::cuda::print_time_duration(start, "one infer time", 0);
+  }
+
+  for (int i = 0; i < model->get_output_size(); i++) {
+    const int* d_output;
+    d_output = static_cast<const int*>(model->get_output_ptr(i));
+    std::vector<int> shape = model->get_output_shape(i);
+    std::cout << "output shape: ";
+    for (int j = 0; j < shape.size(); j++) {
+      std::cout << shape[j] << " ";
+    }
+    std::cout << std::endl;
+
+    lightseq::cuda::print_vec(d_output, "output", 10);
+  }
+
+  return 0;
+}