このリポジトリは、Trie(Prefix Tree)を LOUDS(Level-Order Unary Degree Sequence) 形式に変換し、バイナリに保存してロードできるようにする C++ 実装です。
- Writer 側(
LOUDS,LOUDSWithTermId)- PrefixTree を LOUDS に変換 z
- LOUDS をバイナリファイルとして保存(シリアライズ)
- Reader 側(
LOUDSReader,LOUDSWithTermIdReader)- Writer が出力した LOUDS バイナリをロード
SuccinctBitVector(rank/select の前計算)で探索を高速化
termId版について:
commonPrefixSearchは termId を同時に返しません。必要な場合はgetTermId(nodeIndex)を別途呼び出します。
project/
src/
common/
bit_vector.hpp
succinct_bit_vector.hpp
prefix/
prefix_tree.hpp
prefix_tree.cpp
louds/
louds.hpp
louds.cpp
converter.hpp
converter.cpp
louds_reader.hpp
louds_reader.cpp
prefix_with_term_id/
prefix_tree_with_term_id.hpp
prefix_tree_with_term_id.cpp
louds_with_term_id/
louds_with_term_id.hpp
louds_with_term_id.cpp
converter_with_term_id.hpp
converter_with_term_id.cpp
louds_with_term_id_reader.hpp
louds_with_term_id_reader.cpp
tests/
test_prefix_tree.cpp
test_louds.cpp
test_louds_with_term_id.cpp
test_louds_reader.cpp
test_louds_with_term_id_reader.cpp
- Linux / WSL / macOS を想定
- C++17 対応の
g++
Ubuntu / WSL の例:
sudo apt update
sudo apt install -y g++LOUDS は Trie の木構造を BFS(幅優先)順にビット列へエンコードします。
- 各ノードの子の数だけ
1を並べ、最後に0を 1 つ置く - 追加情報として:
labels[]:1に対応するエッジのラベル(文字)isLeaf: 単語終端ノードかどうかを示すビット列
探索のために以下を使います:
rank1(i):LBS[0..i]の 1 の数rank0(i):LBS[0..i]の 0 の数select1(k): k 番目の 1 の位置select0(k): k 番目の 0 の位置
SuccinctBitVector は Kotlin 実装と同様に、ブロック(256bit / 8bit)ごとの累積を前計算して高速化します。
本リポジトリでは、jawiki の「all titles (ns0)」gzip ファイルを入力として LOUDS を構築する想定です。
以下は jawiki_build 実行時に計測した統計および時間です。
- 総単語数(タイトル数): 2,412,128
- 総文字数: 20,334,605
- 入力 gzip サイズ: 15,417,082 bytes(約 15.42 MB / 14.70 MiB)
- UTF-8 展開後合計(バイト総量): 53,364,837 bytes(約 53.36 MB / 50.89 MiB)
- total: 6.03564 sec
- convert LOUDS: 0.944134 sec
- convert LOUDS with termId: 0.956271 sec
- Elapsed (wall clock): 0:11.02
- User time: 9.20 sec, System time: 1.80 sec
- Max RSS: 4,200,128 KB(約 4.01 GiB)
補足:
metrics.jsonの秒数は「処理の一部(内部区間)」の計測で、/usr/bin/timeはプロセス全体(I/O や初期化等を含む)を測定します。そのため、両者の値が一致しないのは自然です。
以下は project/ ディレクトリ直下で実行する想定です。
(-I src により src/... を include できるようにしています)
g++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/prefix/prefix_tree.cpp \
tests/test_prefix_tree.cpp \
-I src -o test_prefix_tree
./test_prefix_treecommonPrefixSearch と、Writer 側のバイナリ round-trip(saveToFile -> loadFromFile)を検証します。
g++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/louds/louds.cpp \
src/louds/converter.cpp \
src/prefix/prefix_tree.cpp \
tests/test_louds.cpp \
-I src -o test_louds
./test_loudsg++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/louds_with_term_id/louds_with_term_id.cpp \
src/louds_with_term_id/converter_with_term_id.cpp \
src/prefix_with_term_id/prefix_tree_with_term_id.cpp \
tests/test_louds_with_term_id.cpp \
-I src -o test_louds_with_term_id
./test_louds_with_term_idWriter が作った .bin を Reader がロードできること、commonPrefixSearch 等が動くことを検証します。
g++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/louds/louds_reader.cpp \
src/louds/louds.cpp \
src/louds/converter.cpp \
src/prefix/prefix_tree.cpp \
tests/test_louds_reader.cpp \
-I src -o test_louds_reader
./test_louds_readerWriter が作った .bin を Reader がロードできること、getTermId(nodeIndex) が動くことも含めて検証します。
g++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/louds_with_term_id/louds_with_term_id_reader.cpp \
src/louds_with_term_id/louds_with_term_id.cpp \
src/louds_with_term_id/converter_with_term_id.cpp \
src/prefix_with_term_id/prefix_tree_with_term_id.cpp \
tests/test_louds_with_term_id_reader.cpp \
-I src -o test_louds_with_term_id_reader
./test_louds_with_term_id_reader本プロジェクトは MIT License です。LICENSE ファイルを参照してください。
This repository provides a C++ implementation to convert a trie (prefix tree) into LOUDS (Level-Order Unary Degree Sequence), serialize it to a binary file, and load it back for fast query operations.
- Writer side (
LOUDS,LOUDSWithTermId)- Convert PrefixTree into LOUDS
- Save LOUDS as a binary file (serialization)
- Reader side (
LOUDSReader,LOUDSWithTermIdReader)- Load LOUDS binaries generated by the writer
- Use
SuccinctBitVector(precomputed rank/select) to accelerate traversal
About the
termIdvariant:
commonPrefixSearchdoes not return termIds together. If needed, callgetTermId(nodeIndex)separately.
project/
src/
common/
bit_vector.hpp
succinct_bit_vector.hpp
prefix/
prefix_tree.hpp
prefix_tree.cpp
louds/
louds.hpp
louds.cpp
converter.hpp
converter.cpp
louds_reader.hpp
louds_reader.cpp
prefix_with_term_id/
prefix_tree_with_term_id.hpp
prefix_tree_with_term_id.cpp
louds_with_term_id/
louds_with_term_id.hpp
louds_with_term_id.cpp
converter_with_term_id.hpp
converter_with_term_id.cpp
louds_with_term_id_reader.hpp
louds_with_term_id_reader.cpp
tests/
test_prefix_tree.cpp
test_louds.cpp
test_louds_with_term_id.cpp
test_louds_reader.cpp
test_louds_with_term_id_reader.cpp
- Linux / WSL / macOS recommended
g++with C++17 support
Example on Ubuntu / WSL:
sudo apt update
sudo apt install -y g++LOUDS encodes a trie shape into a bit sequence in BFS order:
- For each node, write
1for each child, followed by one terminating0. - Additional arrays:
labels[]: edge labels aligned with1bitsisLeaf: marks terminal nodes
Queries rely on:
rank1(i): number of1s inLBS[0..i]rank0(i): number of0s inLBS[0..i]select1(k): position of k-th1select0(k): position of k-th0
SuccinctBitVector precomputes block summaries (256-bit big blocks, 8-bit small blocks) to accelerate rank/select similar to the Kotlin implementation.
This repo commonly uses a jawiki “all titles (ns0)” gzip file as input.
Below are dataset stats and timing results recorded during jawiki_build.
- Total titles (word_count): 2,412,128
- Total characters (char_count): 20,334,605
- Input gzip size: 15,417,082 bytes (≈ 15.42 MB / 14.70 MiB)
- Total UTF-8 bytes after decoding: 53,364,837 bytes (≈ 53.36 MB / 50.89 MiB)
- total: 6.03564 sec
- convert LOUDS: 0.944134 sec
- convert LOUDS with termId: 0.956271 sec
- Elapsed (wall clock): 0:11.02
- User time: 9.20 sec, System time: 1.80 sec
- Max RSS: 4,200,128 KB (≈ 4.01 GiB)
Note:
metrics.jsonreflects internal section timings, while/usr/bin/timemeasures the full process (including I/O and initialization). Therefore it is expected that these numbers do not match exactly.
All commands assume you are in the project/ directory.
-I src is used so headers under src/... can be included.
g++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/prefix/prefix_tree.cpp \
tests/test_prefix_tree.cpp \
-I src -o test_prefix_tree
./test_prefix_treeThis validates commonPrefixSearch and writer-side round-trip (saveToFile -> loadFromFile).
g++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/louds/louds.cpp \
src/louds/converter.cpp \
src/prefix/prefix_tree.cpp \
tests/test_louds.cpp \
-I src -o test_louds
./test_loudsg++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/louds_with_term_id/louds_with_term_id.cpp \
src/louds_with_term_id/converter_with_term_id.cpp \
src/prefix_with_term_id/prefix_tree_with_term_id.cpp \
tests/test_louds_with_term_id.cpp \
-I src -o test_louds_with_term_id
./test_louds_with_term_idThis validates that LOUDSReader can load the .bin produced by the writer and run queries like commonPrefixSearch.
g++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/louds/louds_reader.cpp \
src/louds/louds.cpp \
src/louds/converter.cpp \
src/prefix/prefix_tree.cpp \
tests/test_louds_reader.cpp \
-I src -o test_louds_reader
./test_louds_readerThis validates that LOUDSWithTermIdReader can load the .bin and that getTermId(nodeIndex) works.
g++ -std=c++17 -O2 -Wall -Wextra -pedantic \
src/louds_with_term_id/louds_with_term_id_reader.cpp \
src/louds_with_term_id/louds_with_term_id.cpp \
src/louds_with_term_id/converter_with_term_id.cpp \
src/prefix_with_term_id/prefix_tree_with_term_id.cpp \
tests/test_louds_with_term_id_reader.cpp \
-I src -o test_louds_with_term_id_reader
./test_louds_with_term_id_readerThis project is licensed under the MIT License. See LICENSE for details.