bytedance · neopro12 · May 7, 2022 · Mar 15, 2022 · Mar 15, 2022 · Mar 16, 2022
@@ -41,7 +41,7 @@ The following is a support matrix of LightSeq **inference** library compared wit
 ## Performance
 
 ### [>>> Training](./lightseq/training)
-Here we present the experimental results on WMT14 English to German translation task based on Transformer-big models. We train Transformer models of different sizes on eight NVIDIA Tesla V100/NVIDIA Ampere A100 GPUs with data parallel and fp16 mixed precision.
+Here we present the experimental results on WMT14 English to German translation task based on Transformer-big models. We train Transformer models of different sizes on eight NVIDIA Tesla V100/NVIDIA Tesla A100 GPUs with data parallel and fp16 mixed precision.
 [Fairseq](https://github.com/pytorch/fairseq) with [Apex](https://github.com/NVIDIA/apex) is choosed as our baseline.
 
 <img src="./docs/training/images/single_step.png"  width="80%" aligned="middle">
@@ -66,6 +66,20 @@ More results is available [here](./docs/inference/performance.md).
 ## Quick Start
 Complete user guide is available [here](docs/guide.md).
 
+### Installation
+You can install LightSeq from PyPI:
+```shell
+$ pip install lightseq
+```
+
+LightSeq installation from PyPI only supports Python 3.6 to 3.8 on Linux for now. Consider compiling from source if you have other environments:
+```shell
+$ PATH=/usr/local/hdf5/:$PATH ENABLE_FP32=0 ENABLE_DEBUG=0 pip install -e $PROJECT_DIR
+```
+
+Detailed building introduction is available [here](docs/inference/build.md).
+
+
 ### Fast training from Fairseq
 
 You can experience lightning fast training by running following commands,
@@ -97,12 +111,10 @@ $ cd examples/inference/python
 then you can check the performance by simply running following commands. `hf_bart_export.py` is used to transform pytorch weights to LightSeq protobuffer.
 
 ```shell
-$ python export/hf_bart_export.py
+$ python export/huggingface/hf_bart_export.py
 $ python test/ls_bart.py
 ```
 
-LightSeq installation from pypi only supports python 3.6 to 3.8 on Linux for now. Consider compiling from source if you have other environments.
-
 More usage is available [here](./lightseq/inference/README.md).
 
 ### Fast deploy inference server

@@ -1,5 +1,5 @@
 ## Dockerfiles of lightseq
 
-Pypi: for publish python package.
+PyPI: for publish python package.
 
 Tritonserver: for publish tritonserver
@@ -119,7 +119,7 @@ These functions can export the configuration, embedding, encoder and decoder wei
 LightSeq provides export examples of native Hugging Face BERT/BART/GPT2, Fairseq trained with LightSeq and LightSeq Transformer. All codes are available [here](../examples/inference/python/export).
 
 #### Fairseq
-The main code is as follows (some parameters are omitted). Complete code is available [here](../examples/inference/python/export/ls_fs_transformer_export.py).
+The main code is as follows (some parameters are omitted). Complete code is available [here](../examples/inference/python/export/fairseq/ls_fs_transformer_export.py).
 ```python
 model = Transformer()
 encoder_state_dict, decoder_state_dict = _extract_weight(state_dict)
@@ -136,7 +136,7 @@ First, you need to divide the state dict into two parts of encoder and decoder,
 The above functions export the checkpoints to protobuf by default. Specify `save_pb=False` to export to hdf5 files. You can use the [Fairseq training example](../examples/training/fairseq) to obtain the trained checkpoints.
 
 #### Hugging Face
-LightSeq provides three examples of exporting native Hugging Face models ([BERT](../examples/inference/python/export/hf_bert_export.py), [BART](../examples/inference/python/export/hf_bart_export.py) and [GPT2](../examples/inference/python/export/hf_gpt2_export.py)). Because these native models did not use LightSeq modules to pretrain, the users must manually make the export rules.
+LightSeq provides three examples of exporting native Hugging Face models ([BERT](../examples/inference/python/export/huggingface/hf_bert_export.py), [BART](../examples/inference/python/export/huggingface/hf_bart_export.py) and [GPT2](../examples/inference/python/export/huggingface/hf_gpt2_export.py)). Because these native models did not use LightSeq modules to pretrain, the users must manually make the export rules.
 
 #### LightSeq Transformer
 LightSeq provide an example of exporting its own Transformer module, which is similar to Fairseq models export. You can use the [custom training example](../examples/training/custom) to obtain the trained checkpoints. This export example can also compare the results and speeds of forward propagation in training library, inference library loading both protobuf and hdf5 files. The results show that the inference library is faster than the forward propagation of training library by about 2x.

@@ -3,11 +3,20 @@ cmake_minimum_required(VERSION 3.18)
 add_executable(transformer_example transformer_example.cc)
 target_link_libraries(transformer_example PUBLIC liblightseq)
 
+add_executable(quant_transformer_example quant_transformer_example.cc)
+target_link_libraries(quant_transformer_example PUBLIC liblightseq)
+
 add_executable(bert_example bert_example.cc)
 target_link_libraries(bert_example PUBLIC liblightseq)
 
+add_executable(quant_bert_example quant_bert_example.cc)
+target_link_libraries(quant_bert_example PUBLIC liblightseq)
+
 add_executable(gpt_example gpt_example.cc)
 target_link_libraries(gpt_example PUBLIC liblightseq)
 
+add_executable(quant_gpt_example quant_gpt_example.cc)
+target_link_libraries(quant_gpt_example PUBLIC liblightseq)
+
 add_executable(transformer_decoder_example decoder_example.cc.cu)
 target_link_libraries(transformer_decoder_example PUBLIC transformer_model)
@@ -8,15 +8,31 @@ Example of how to run Bert inference using our implementation.
 
 int main(int argc, char* argv[]) {
   std::string model_weights_path = argv[1];
+  std::vector<int> example_input = {2859, 2758, 2051, 2157,
+                                    2005, 6629, 7566, 1012};
+  int eg_seq_len = example_input.size();
   int max_batch_size = 128;
+  int batch_size = 1;
+  int batch_seq_len = eg_seq_len;
+
+  if (argc == 4) {
+    batch_size = atoi(argv[2]);
+    batch_seq_len = atoi(argv[3]);
+  }
+  if (batch_size > max_batch_size) {
+    throw std::runtime_error("batch_size exceeds the maximum (128)!");
+  }
+
+  std::vector<int> host_input;
+  for (int i = 0; i < batch_size; ++i) {
+    for (int j = 0; j < batch_seq_len; ++j) {
+      host_input.push_back(example_input[j % eg_seq_len]);
+    }
+  }
 
   auto model = lightseq::cuda::LSModelFactory::GetInstance().CreateModel(
       "Bert", model_weights_path, max_batch_size);
 
-  int batch_size = 1;
-  int batch_seq_len = 8;
-  std::vector<int> host_input = {101, 4931, 1010, 2129, 2024, 2017, 102, 0};
-
   void* d_input;
   lightseq::cuda::CHECK_GPU_ERROR(
       cudaMalloc(&d_input, sizeof(int) * batch_size * batch_seq_len));

@@ -8,15 +8,30 @@ Example of how to run gpt inference using our implementation.
 
 int main(int argc, char* argv[]) {
   std::string model_weights_path = argv[1];
+  std::vector<int> example_input = {40, 1842, 345, 11, 475, 345, 910, 326};
+  int eg_seq_len = example_input.size();
   int max_batch_size = 128;
+  int batch_size = 1;
+  int batch_seq_len = eg_seq_len;
+
+  if (argc == 4) {
+    batch_size = atoi(argv[2]);
+    batch_seq_len = atoi(argv[3]);
+  }
+  if (batch_size > max_batch_size) {
+    throw std::runtime_error("batch_size exceeds the maximum (128)!");
+  }
+
+  std::vector<int> host_input;
+  for (int i = 0; i < batch_size; ++i) {
+    for (int j = 0; j < batch_seq_len; ++j) {
+      host_input.push_back(example_input[j % eg_seq_len]);
+    }
+  }
 
   auto model = lightseq::cuda::LSModelFactory::GetInstance().CreateModel(
       "Gpt", model_weights_path, max_batch_size);
 
-  int batch_size = 1;
-  int batch_seq_len = 5;
-  std::vector<int> host_input = {3666, 1438, 318, 402, 11571};
-
   void* d_input;
   lightseq::cuda::CHECK_GPU_ERROR(
       cudaMalloc(&d_input, sizeof(int) * batch_size * batch_seq_len));
@@ -58,7 +73,7 @@ int main(int argc, char* argv[]) {
     }
     std::cout << std::endl;
 
-    lightseq::cuda::print_vec(d_output, "output", 5);
+    lightseq::cuda::print_vec(d_output, "output", 10);
   }
 
   return 0;

@@ -0,0 +1,81 @@
+#include "model_base.h"
+#include "util.h"
+
+/**
+@file
+Example of how to run QuantBert inference using our implementation.
+*/
+
+int main(int argc, char* argv[]) {
+  std::string model_weights_path = argv[1];
+  std::vector<int> example_input = {2859, 2758, 2051, 2157,
+                                    2005, 6629, 7566, 1012};
+  int eg_seq_len = example_input.size();
+  int max_batch_size = 128;
+  int batch_size = 1;
+  int batch_seq_len = eg_seq_len;
+
+  if (argc == 4) {
+    batch_size = atoi(argv[2]);
+    batch_seq_len = atoi(argv[3]);
+  }
+  if (batch_size > max_batch_size) {
+    throw std::runtime_error("batch_size exceeds the maximum (128)!");
+  }
+
+  std::vector<int> host_input;
+  for (int i = 0; i < batch_size; ++i) {
+    for (int j = 0; j < batch_seq_len; ++j) {
+      host_input.push_back(example_input[j % eg_seq_len]);
+    }
+  }
+
+  auto model = lightseq::cuda::LSModelFactory::GetInstance().CreateModel(
+      "QuantBert", model_weights_path, max_batch_size);
+
+  void* d_input;
+  lightseq::cuda::CHECK_GPU_ERROR(
+      cudaMalloc(&d_input, sizeof(int) * batch_size * batch_seq_len));
+  lightseq::cuda::CHECK_GPU_ERROR(cudaMemcpy(
+      d_input, host_input.data(), sizeof(int) * batch_size * batch_seq_len,
+      cudaMemcpyHostToDevice));
+
+  model->set_input_ptr(0, d_input);
+  model->set_input_shape(0, {batch_size, batch_seq_len});
+
+  for (int i = 0; i < model->get_output_size(); i++) {
+    void* d_output;
+    std::vector<int> shape = model->get_output_max_shape(i);
+    int total_size = 1;
+    for (int j = 0; j < shape.size(); j++) {
+      total_size *= shape[j];
+    }
+    lightseq::cuda::CHECK_GPU_ERROR(
+        cudaMalloc(&d_output, total_size * sizeof(int)));
+    model->set_output_ptr(i, d_output);
+  }
+  lightseq::cuda::CHECK_GPU_ERROR(cudaStreamSynchronize(0));
+  std::cout << "infer preprocessing finished" << std::endl;
+
+  /* ---step5. infer and log--- */
+  for (int i = 0; i < 10; i++) {
+    auto start = std::chrono::high_resolution_clock::now();
+    model->Infer();
+    lightseq::cuda::print_time_duration(start, "one infer time", 0);
+  }
+
+  for (int i = 0; i < model->get_output_size(); i++) {
+    const float* d_output;
+    d_output = static_cast<const float*>(model->get_output_ptr(i));
+    std::vector<int> shape = model->get_output_shape(i);
+    std::cout << "output shape: ";
+    for (int j = 0; j < shape.size(); j++) {
+      std::cout << shape[j] << " ";
+    }
+    std::cout << std::endl;
+
+    lightseq::cuda::print_vec(d_output, "output", 5);
+  }
+
+  return 0;
+}
@@ -0,0 +1,80 @@
+#include "model_base.h"
+#include "gpt.h"
+
+/**
+@file
+Example of how to run gpt inference using our implementation.
+*/
+
+int main(int argc, char* argv[]) {
+  std::string model_weights_path = argv[1];
+  std::vector<int> example_input = {40, 1842, 345, 11, 475, 345, 910, 326};
+  int eg_seq_len = example_input.size();
+  int max_batch_size = 128;
+  int batch_size = 1;
+  int batch_seq_len = eg_seq_len;
+
+  if (argc == 4) {
+    batch_size = atoi(argv[2]);
+    batch_seq_len = atoi(argv[3]);
+  }
+  if (batch_size > max_batch_size) {
+    throw std::runtime_error("batch_size exceeds the maximum (128)!");
+  }
+
+  std::vector<int> host_input;
+  for (int i = 0; i < batch_size; ++i) {
+    for (int j = 0; j < batch_seq_len; ++j) {
+      host_input.push_back(example_input[j % eg_seq_len]);
+    }
+  }
+
+  auto model = lightseq::cuda::LSModelFactory::GetInstance().CreateModel(
+      "QuantGpt", model_weights_path, max_batch_size);
+
+  void* d_input;
+  lightseq::cuda::CHECK_GPU_ERROR(
+      cudaMalloc(&d_input, sizeof(int) * batch_size * batch_seq_len));
+  lightseq::cuda::CHECK_GPU_ERROR(cudaMemcpy(
+      d_input, host_input.data(), sizeof(int) * batch_size * batch_seq_len,
+      cudaMemcpyHostToDevice));
+
+  model->set_input_ptr(0, d_input);
+  model->set_input_shape(0, {batch_size, batch_seq_len});
+
+  for (int i = 0; i < model->get_output_size(); i++) {
+    void* d_output;
+    std::vector<int> shape = model->get_output_max_shape(i);
+    int total_size = 1;
+    for (int j = 0; j < shape.size(); j++) {
+      total_size *= shape[j];
+    }
+    lightseq::cuda::CHECK_GPU_ERROR(
+        cudaMalloc(&d_output, total_size * sizeof(int)));
+    model->set_output_ptr(i, d_output);
+  }
+  lightseq::cuda::CHECK_GPU_ERROR(cudaStreamSynchronize(0));
+  std::cout << "infer preprocessing finished" << std::endl;
+
+  /* ---step5. infer and log--- */
+  for (int i = 0; i < 10; i++) {
+    auto start = std::chrono::high_resolution_clock::now();
+    model->Infer();
+    lightseq::cuda::print_time_duration(start, "one infer time", 0);
+  }
+
+  for (int i = 0; i < model->get_output_size(); i++) {
+    const int* d_output;
+    d_output = static_cast<const int*>(model->get_output_ptr(i));
+    std::vector<int> shape = model->get_output_shape(i);
+    std::cout << "output shape: ";
+    for (int j = 0; j < shape.size(); j++) {
+      std::cout << shape[j] << " ";
+    }
+    std::cout << std::endl;
+
+    lightseq::cuda::print_vec(d_output, "output", 10);
+  }
+
+  return 0;
+}