|
1 | 1 | # tokenizers-cpp
|
2 | 2 |
|
3 |
| -Cross platform universal tokenizer binding to HF and sentencepiece |
| 3 | +This project provides a cross-platform C++ tokenizer binding library that can be universally deployed. |
| 4 | +It wraps and binds the [HuggingFace tokenizers library](https://github.com/huggingface/tokenizers) |
| 5 | +and [sentencepiece](https://github.com/google/sentencepiece) and provides a minimum common interface in C++. |
| 6 | + |
| 7 | +The main goal of the project is to enable tokenizer deployment for language model applications |
| 8 | +to native platforms with minimum dependencies and remove some of the barriers of |
| 9 | +cross-language bindings. This project is developed in part with and |
| 10 | +used in [MLC LLM](https://github.com/mlc-ai/mlc-llm). We have tested the following platforms: |
| 11 | + |
| 12 | +- iOS |
| 13 | +- Android |
| 14 | +- Windows |
| 15 | +- Linux |
| 16 | +- Web browser |
| 17 | + |
| 18 | +## Getting Started |
| 19 | + |
| 20 | +The easiest way is to add this project as a submodule and then |
| 21 | +include it via `add_sub_directory` in your CMake project. |
| 22 | +You also need to turn on `c++17` support. |
| 23 | + |
| 24 | +- First, you need to make sure you have rust installed. |
| 25 | +- If you are cross-compiling make sure you install the necessary target in rust. |
| 26 | + For example, run `rustup target add aarch64-apple-ios` to install iOS target. |
| 27 | +- You can then link the libary |
| 28 | + |
| 29 | +See [example](example) folder for an example CMake project. |
| 30 | + |
| 31 | +### Example Code |
| 32 | + |
| 33 | +```c++ |
| 34 | +// - dist/tokenizer.json |
| 35 | +void HuggingFaceTokenizerExample() { |
| 36 | + // Read blob from file. |
| 37 | + auto blob = LoadBytesFromFile("dist/tokenizer.json"); |
| 38 | + // Note: all the current factory APIs takes in-memory blob as input. |
| 39 | + // This gives some flexibility on how these blobs can be read. |
| 40 | + auto tok = Tokenizer::FromBlobJSON(blob); |
| 41 | + std::string prompt = "What is the capital of Canada?"; |
| 42 | + // call Encode to turn prompt into token ids |
| 43 | + std::vector<int> ids = tok->Encode(prompt); |
| 44 | + // call Decode to turn ids into string |
| 45 | + std::string decoded_prompt = tok->Decode(ids); |
| 46 | +} |
| 47 | + |
| 48 | +void SentencePieceTokenizerExample() { |
| 49 | + // Read blob from file. |
| 50 | + auto blob = LoadBytesFromFile("dist/tokenizer.model"); |
| 51 | + // Note: all the current factory APIs takes in-memory blob as input. |
| 52 | + // This gives some flexibility on how these blobs can be read. |
| 53 | + auto tok = Tokenizer::FromBlobSentencePiece(blob); |
| 54 | + std::string prompt = "What is the capital of Canada?"; |
| 55 | + // call Encode to turn prompt into token ids |
| 56 | + std::vector<int> ids = tok->Encode(prompt); |
| 57 | + // call Decode to turn ids into string |
| 58 | + std::string decoded_prompt = tok->Decode(ids); |
| 59 | +} |
| 60 | +``` |
| 61 | + |
| 62 | +### Extra Details |
| 63 | + |
| 64 | +Currently, the project generates three static libraries |
| 65 | +- `libtokenizers_c.a`: the c binding to tokenizers rust library |
| 66 | +- `libsentencepice.a`: sentencepiece static library |
| 67 | +- `libtokenizers_cpp.a`: the cpp binding implementation |
| 68 | + |
| 69 | +If you are using an IDE, you can likely first use cmake to generate |
| 70 | +these libraries and add them to your development environment. |
| 71 | +If you are using cmake, `target_link_libraries(yourlib tokenizers_cpp)` |
| 72 | +will automatically links in the other two libraries. |
| 73 | +You can also checkout [MLC LLM](https://github.com/mlc-ai/mlc-llm) |
| 74 | +for as an example of complete LLM chat application integrations. |
| 75 | + |
| 76 | +## Javascript Support |
| 77 | + |
| 78 | +We use emscripten to expose tokenizer-cpp to wasm and javascript. |
| 79 | +Checkout [web](web) for more details. |
| 80 | + |
| 81 | +## Acknowledgements |
| 82 | + |
| 83 | +This project is only possible thanks to the shoulders open-source ecosystems that we stand on. |
| 84 | +This project is based on sentencepiece and tokenizers library. |
0 commit comments