ONNXRuntime extensions (C++ API): tokenizer gives the same output on different inputs. #23021

Valoisa · 2024-12-05T09:32:47Z

Valoisa
Dec 5, 2024

Hi guys. I converted a RobertaTokenizer to an ONNX format, and now I need to use it from C++ code. This model takes a string tensor as input and is supposed to return an int64 tensor of tokens, attention mask and offset mapping (accoriding to its inputs/outputs inspection). But the model spits out [0, 2] for every input. I tried to use it from the C# code, and everything worked as expected, so it's more likely I'm doing something wrong here. I'm also not quite sure about the correctness of memory ownership. Logs from the debug mode don't look suspicious. Here's what I'm doing:

	Ort::SessionOptions session_options;
	void* handle;
	try
	{
		Ort::ThrowOnError(Ort::GetApi().RegisterCustomOpsLibrary(session_options, EXTENSIONS_DLL.c_str(), &handle));
	}
	catch(Ort::Exception& e)
	{
		std::cerr << "Error during customops library loading: " << e.what() << std::endl;
	}
	std::cout << "ONNX Runtime Version: " << Ort::GetVersionString() << std::endl;
		
	Ort::Env env;
	Ort::Session session_{ env, FILE_PATH.c_str(), session_options };

	//	Initializing input names
	std::vector<const char*> input_names;
	input_names.reserve(session_.GetInputCount());
	for (int i = 0; i < session_.GetInputCount(); ++i)
		input_names.push_back(session_.GetInputNameAllocated(i, Ort::AllocatorWithDefaultOptions()).get());
	//	Initializing output names
	std::vector<const char*> output_names;
	output_names.reserve(session_.GetOutputCount());
	for (int i = 0; i < session_.GetOutputCount(); ++i)
		output_names.push_back(session_.GetOutputNameAllocated(i, Ort::AllocatorWithDefaultOptions()).get());

	//	Creating an input string tensor.
	std::vector<const char*> input_strings = { "mercedes benz" };
	size_t len = 1;
	for (const char* str : input_strings)
		len += strlen(str);
	std::vector<int64_t> string_shape = { static_cast<int64_t>(input_strings.size()) };

	Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
	Ort::Value input_tensor = [&]()
	{
		try
		{
			Ort::Value t = Ort::Value::CreateTensor(
				(OrtMemoryInfo*)memory_info, input_strings.data(), 32 + len * sizeof(char), string_shape.data(),
				string_shape.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_STRING);
			return t;
		}
		catch(const Ort::Exception& e)
		{
			std::cerr << "Exception during CreateTensor: " << e.what() << std::endl;
		}
		return Ort::Value(nullptr);
	}();

	//	Finally, running a tokenizer
	Ort::RunOptions run_options;
	if ((OrtValue*)input_tensor != nullptr)
	{
		try
		{
			std::vector<Ort::Value> output_tensors = session_.Run(run_options, input_names.data(), &input_tensor,
				session_.GetInputCount(), output_names.data(), session_.GetOutputCount());
			Ort::TensorTypeAndShapeInfo shape_info = output_tensors[0].GetTensorTypeAndShapeInfo();
			std::vector<int64_t> shape = shape_info.GetShape();
			size_t rows = shape[0];
			size_t cols = shape[1];
			const int64_t* raw_data = output_tensors[0].GetTensorData<int64_t>();
			std::vector<std::vector<int64_t>> output;
			// Converting the output to readable result:
			output.resize(rows, std::vector<int64_t>(cols));
			for (size_t i = 0; i < rows; ++i) {
				for (size_t j = 0; j < cols; ++j) {
					output[i][j] = raw_data[i * cols + j];
				}
			}
			const int a = 0;
		}
		catch (const Ort::Exception& e)
		{
			std::cerr << "Error caught on Run: " << e.what() << std::endl;
		}
	}
	else
	{
		std::cerr << "Tensor was nullptr.\n";
	}

Any help is appreciated.
Just in case, here's the tokenizer.
BPEtokenizer.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNXRuntime extensions (C++ API): tokenizer gives the same output on different inputs. #23021

{{title}}

Replies: 0 comments

Select a reply

ONNXRuntime extensions (C++ API): tokenizer gives the same output on different inputs. #23021

Valoisa Dec 5, 2024

Replies: 0 comments

Valoisa
Dec 5, 2024