Count Tokens of Code.
Token counts play a key role in shaping a Large Language Model's (LLM) memory and conversation history. They're vital for prompt engineering and token cost estimation. Various strategies in prompt engineering (e.g., contextual filtering and reranking) predominantly aim at token compression to counteract LLM's context size limit.
ctoc provides a lightweight tool for analyzing codebases at the token level. It incorporates all the features of cloc. (You can use ctoc
in a cloc
-consistent manner.)
Built on top of gocloc, ctoc is extremely fast.
What are Tokens? (in the context of Large Language Model)
- Tokens: basic units of text/code for LLM AI models to process/generate language.
- Tokenization: splitting input/output texts into smaller units for LLM AI models.
- Vocabulary size: the number of tokens each model uses, which varies among different GPT models.
- Tokenization cost: affects the memory and computational resources that a model needs, which influences the cost and performance of running an OpenAI or Azure OpenAI model.
refs: https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/tokens
Install from GitHub release:
curl -sL "https://github.com/yaohui-wyh/ctoc/releases/download/v0.0.1/ctoc_$(uname)_$(uname -m).tar.gz" | tar xz && chmod +x ctoc && ctoc -h
Alternatively, you can install via go install
(requires Go 1.19+):
go install github.com/yaohui-wyh/ctoc/cmd/ctoc@latest
$ ctoc -h
Usage:
ctoc [OPTIONS]
Application Options:
--by-file report results for every encountered source file
--sort=[name|files|blank|comment|code|tokens] sort based on a certain column (default: code)
--output-type= output type [values: default,cloc-xml,sloccount,json] (default: default)
--exclude-ext= exclude file name extensions (separated commas)
--include-lang= include language name (separated commas)
--match= include file name (regex)
--not-match= exclude file name (regex)
--match-d= include dir name (regex)
--not-match-d= exclude dir name (regex)
--debug dump debug log for developer
--skip-duplicated skip duplicated files
--show-lang print about all languages and extensions
--version print version info
--show-encoding print about all LLM models and their corresponding encodings
--encoding=[cl100k_base|p50k_base|p50k_edit|r50k_base] specify tokenizer encoding (default: cl100k_base)
Help Options:
-h, --help Show this help message
$ ctoc .
------------------------------------------------------------------------------------------------
Language files blank comment code tokens
------------------------------------------------------------------------------------------------
Go 15 282 153 2096 21839
XML 3 0 0 140 1950
YAML 1 0 0 40 237
Markdown 1 13 0 34 322
Makefile 1 6 0 15 128
------------------------------------------------------------------------------------------------
TOTAL 21 301 153 2325 24476
------------------------------------------------------------------------------------------------
Specify the output type as JSON:
$ ctoc --output-type=json .
{"languages":[{"name":"Go","files":16,"code":2113,"comment":155,"blank":285,"tokens":22000},{"name":"XML","files":3,"code":149,"comment":0,"blank":0,"tokens":1928},{"name":"Markdown","files":1,"code":136,"comment":0,"blank":31,"tokens":1874},{"name":"YAML","files":1,"code":40,"comment":0,"blank":0,"tokens":237},{"name":"Makefile","files":1,"code":19,"comment":0,"blank":7,"tokens":149}],"total":{"files":22,"code":2457,"comment":155,"blank":323,"tokens":26188}}
# For gpt-4, the price is $0.03/1k prompt tokens
$ echo "scale=2; 0.03*$(ctoc --output-type=json . | jq ".total.tokens")/1000" | bc
.79
Print the token count for each Go file separately and sort them by token count:
$ ctoc --by-file --include-lang=Go --sort=tokens .
-----------------------------------------------------------------------------------------------
File files blank comment code tokens
-----------------------------------------------------------------------------------------------
language.go 31 8 647 8673
file_test.go 72 13 481 4136
cmd/ctoc/main.go 39 16 267 2534
file.go 32 7 188 1720
utils.go 21 7 133 961
utils_test.go 17 78 13 891
language_test.go 22 0 79 661
xml.go 11 10 70 636
gocloc.go 9 4 62 448
json.go 6 4 47 402
json_test.go 4 1 33 312
option.go 5 5 29 266
examples/languages/main.go 5 0 23 131
examples/files/main.go 5 0 23 130
bspool.go 4 0 14 72
tools.go 2 2 4 27
-----------------------------------------------------------------------------------------------
TOTAL 16 285 155 2113 22000
-----------------------------------------------------------------------------------------------
Same as gocloc
$ ctoc --show-lang
$ ctoc --show-encoding
text-embedding-ada-002 (cl100k_base)
code-search-babbage-code-001 (r50k_base)
gpt-3.5-turbo (cl100k_base)
text-embedding-3-large (cl100k_base)
gpt2 (gpt2)
babbage (r50k_base)
code-cushman-002 (p50k_base)
text-similarity-davinci-001 (r50k_base)
text-similarity-curie-001 (r50k_base)
text-similarity-ada-001 (r50k_base)
text-search-ada-doc-001 (r50k_base)
code-cushman-001 (p50k_base)
cushman-codex (p50k_base)
gpt-4 (cl100k_base)
text-davinci-002 (p50k_base)
text-davinci-001 (r50k_base)
davinci-codex (p50k_base)
text-search-babbage-doc-001 (r50k_base)
text-davinci-003 (p50k_base)
text-curie-001 (r50k_base)
text-babbage-001 (r50k_base)
text-ada-001 (r50k_base)
code-davinci-002 (p50k_base)
text-davinci-edit-001 (p50k_edit)
text-search-davinci-doc-001 (r50k_base)
code-search-ada-code-001 (r50k_base)
davinci (r50k_base)
curie (r50k_base)
code-davinci-001 (p50k_base)
gpt-4o (o200k_base)
ada (r50k_base)
code-davinci-edit-001 (p50k_edit)
text-embedding-3-small (cl100k_base)
text-similarity-babbage-001 (r50k_base)
text-search-curie-doc-001 (r50k_base)
The BPE dictionary is automatically downloaded and cached upon its initial run for each encoding.
For additional information, please refer to tiktoken-go#cache
You can specify the encoding with the --encoding
flag:
cl100k_base
is the default encoding forgpt-4
,gpt-4-turbo
,gpt-3.5-turbo
.o200k_base
is the default encoding forgpt-4o
.
$ ctoc --encoding=cl100k_base .
------------------------------------------------------------------------------------------------
Language files blank comment code tokens
------------------------------------------------------------------------------------------------
Go 16 286 155 2113 21995
Markdown 1 42 0 183 2585
YAML 1 0 0 40 237
Makefile 1 7 0 19 149
------------------------------------------------------------------------------------------------
TOTAL 19 335 155 2355 24966
------------------------------------------------------------------------------------------------
$ ctoc --encoding=o200k_base .
------------------------------------------------------------------------------------------------
Language files blank comment code tokens
------------------------------------------------------------------------------------------------
Go 16 286 155 2113 22109
Markdown 1 42 0 183 2596
YAML 1 0 0 40 237
Makefile 1 7 0 19 150
------------------------------------------------------------------------------------------------
TOTAL 19 335 155 2355 25092
------------------------------------------------------------------------------------------------
- CPU 2.6GHz 6core Intel Core i7 / 32GB 2667MHz DDR4 / MacOSX 13.5.2
- ctoc 7473a0
- cl100k_base encoding (with BPE dictionary cached)
➜ kubernetes git:(master) time ctoc .
------------------------------------------------------------------------------------------------
Language files blank comment code tokens
------------------------------------------------------------------------------------------------
Go 15172 503395 992193 3921496 53747627
JSON 430 2 0 1011821 10428573
YAML 1224 612 1464 156024 974131
Markdown 461 24842 170 93141 3251948
BASH 318 6522 12788 33010 528217
Protocol Buffers 130 5864 19379 12809 358110
Assembly 50 2212 925 8447 129534
Plain Text 31 203 0 6664 48218
Makefile 58 594 940 2027 31548
Bourne Shell 9 154 119 687 8055
sed 4 4 32 439 3138
Python 7 114 160 418 5435
Zsh 1 14 3 191 1872
PowerShell 3 44 79 181 2496
C 5 42 55 140 1799
TOML 6 31 107 101 2049
HTML 2 0 0 2 21
Batch 1 2 17 2 170
------------------------------------------------------------------------------------------------
TOTAL 17912 544651 1028431 5247600 69522941
------------------------------------------------------------------------------------------------
ctoc . 160.09s user 8.08s system 119% cpu 2:20.96 total`
MIT