@@ -120,61 +120,181 @@ parsers:
120120 - A ** low-level API** for streaming lexer-generated tokens.
121121 - A ** high-level API** that structures tokens into parsed log events for easier consumption.
122122
123+ Check [ User's Guide] ( #users-guide ) section for more details.
124+
123125As the library prioritizes log parsing, the regex engine is not part of the default API. To access
124126regex-specific functionality, enable the ` regex-engine ` feature in the Cargo configuration. This
125127feature provides APIs for:
126128- Converting [ regex_syntax::ast::Ast] [ regex-syntax-ast-Ast ] into an NFA.
127129- Merging multiple NFAs into a single DFA.
128130- Simulating a DFA with character streams or strings.
129131
130- ## Architecture Overview
131- ![ log-surgeon-arch-overview] ( docs/src/overall-arch-diagram.png )
132132
133133## User's Guide
134- log-surgeon is a Rust library for high-performance parsing of unstructured text logs. It is being
135- shipped as a Rust crate and can be included in your Rust project by adding the following line to
136- your ` Cargo.toml ` file:
134+
135+ log-surgeon is a Rust library for high-performance parsing of unstructured text logs. It is
136+ being shipped as a Rust crate and can be included in your Rust project by adding the following line
137+ to your ` Cargo.toml ` file:
137138``` toml
138139[dependencies ]
139140log-surgeon = { git = " https://github.com/Toplogic-Inc/log-surgeon-rust" , branch = " main" }
140141```
141142
142- Example usage of the library can be found in the examples directory of the repository. You can use
143- the following code to confirm that you successfully included the library and check the version of
144- the library:
145- ``` rust
146- extern crate log_surgeon;
143+ ### Architecture Overview
144+ ![ log-surgeon-arch-overview] ( docs/src/overall-arch-diagram.png )
147145
148- fn main () {
149- println! (" You are using log-surgeon version: {}" , log_surgeon :: version ());
150- }
146+ ### User-defined Schema Config
147+
148+ log-surgeon allows users to customize their own log parser using schema. For detailed instructions,
149+ refer to the [ Schema Documentation] ( docs/Schema.md ) .
150+
151+ ### Lexer
152+
153+ log-surgeon provides its lexer as a low-level API, which parses log events into a stream of tokens.
154+ Tokens can be classified into the following types:
155+
156+ - ** Timestamp** : A token that matches a defined timestamp pattern.
157+ - ** Variable** : A token that matches a defined variable pattern.
158+ - ** StaticText** : A token that does not match any timestamp or variable pattern.
159+ - ** StaticTextWithNewline** : A variant of StaticText that ends with a newline character ('\n').
160+
161+ ** Tips** :
162+ - Each token holds a byte buffer as its value.
163+ - A timestamp token includes an ID that corresponds to the regex pattern defined in the schema
164+ config.
165+ - A variable token includes an ID that maps to the variable name and its associated regex pattern in
166+ the schema config.
167+ - Each token also retains source information, indicating the line in the input from which the token
168+ was extracted.
169+
170+ The lexer also allows users to define their own ** custom input type** . To integrate a custom input
171+ log stream, it must implement the [ log_surgeon::lexer::LexerStream] ( src/lexer/lexer_stream.rs )
172+ trait, which consumes the stream byte by byte. By default, we provide
173+ [ log_surgeon::Lexer::BufferedFileStream] ( src/lexer/streams.rs ) to read a log file from file system.
174+
175+ ** Example** :
176+
177+ A simple example program is provided in [ examples/lexer] ( examples/lexer/src/main.rs ) to parse a
178+ given log file and print all the tokens. You can use the following commands to run the program and
179+ parse the sample logs:
180+ ``` shell
181+ cd examples/lexer
182+ cargo run -- ../schema.yaml ../logs/hive-24h.log
183+ # If you want to try some other inputs, run:
184+ # cargo run -- <SCHEMA_FILE_PATH> <INPUT_FILE_PATH>
151185```
152186
187+ ### Log Parser
188+
189+ log-surgeon provides a log parser as a high-level API. The log parser consumes the parsed tokens
190+ from the underlying lexer, and constructs log events using the following parsing rules:
191+ ```
192+ <log-event> ::= <timestamp> <msg-token-sequence> <end-of-line>
193+
194+ <msg-token-sequence> ::= <msg-token> <msg-token-sequence>
195+ | ε (* empty sequence *)
196+
197+ <msg-token> ::= <variable>
198+ | <static-text>
199+
200+ <timestamp> ::= TOKEN(Timestamp)
201+
202+ <variable> ::= TOKEN(Variable)
203+
204+ <static-text> ::= TOKEN(StaticText)
205+ | TOKEN(StaticText_EndOfLine)
206+
207+ <end-of-line> ::= TOKEN(StaticText_EndOfLine)
208+ ```
209+ NOTE: In practice, the first log event might miss the timestamp and the last log event might miss
210+ the end-of-line due to file/stream truncations.
211+
212+ ** Example** :
213+
214+ A simple example program is provided in [ examples/simple-parser] ( examples/simple-parser/src/main.rs )
215+ to parse a given log file and print all the constructed log events. You can use the following
216+ commands to run the program and parse the sample logs:
217+ ``` shell
218+ cd examples/simple-parser
219+ cargo run -- ../schema.yaml ../logs/hive-24h.log
220+ # If you want to try some other inputs, run:
221+ # cargo run -- <SCHEMA_FILE_PATH> <INPUT_FILE_PATH>
222+ ```
223+
224+ ## Experimental Results
225+
226+ We conducted tests and benchmarks on both the lexer and log parser APIs using real-world
227+ unstructured log datasets. This section summarizes the results of our experiments.
228+
229+ | Dataset | Total Log Size (GByte) | # Tokens | Lexer Execution Time (real time in seconds) | Lexer Throughput (#tokens / sec) | Parser Execution Time (real time in seconds) | Parser Throughput (MByte / sec) |
230+ | --------------------------------------| ------------------------| ------------| ---------------------------------------------| ----------------------------------| ----------------------------------------------| ---------------------------------|
231+ | [ hive-24hr] [ log-hive ] | 1.99 | 62334502 | 9.726 | 6409058.40 | 10.125 | 201.08 |
232+ | [ openstack-24hr] [ log-open-stack ] | 33.00 | 878471152 | 178.398 | 4924220.85 | 198.826 | 169.94 |
233+ | [ hadoop-cluster1-worker1] [ log-hadoop ] | 84.77 | 2982800187 | 442.400 | 6742010.28 | 492.523 | 176.25 |
234+
235+ ** NOTE** :
236+ - The log datasets are hyperlinked in the table above for reference.
237+ - Execution environment:
238+ - OS: Ubuntu 22.04.3 LTS on Windows 10.0.22631 x86_64
239+ - Kernel: 5.15.167.4-microsoft-standard-WSL2
240+ - CPU: Intel i9-14900K (32) @ 3.187GHz
241+ - Memory: 48173MiB
242+ - The schema config used for these experiments is available [ here] ( examples/schema.yaml ) .
243+ - The experiments were executed using the example program [ here] ( examples/benchmark/src/main.rs ) .
244+
245+ Given the time constraints, the team is satisfied with the current experimental results. While we
246+ have additional optimization plans in mind (see
247+ [ Lessons Learned and Concluding Remarks] ( #lessons-learned-and-concluding-remarks ) ), the current
248+ throughput is already within a reasonable range. It is worth noting that performance should remain
249+ similar even with a more complex schema file. This consistency is achieved through the use of
250+ delimiters and a deterministic finite automaton (DFA), which ensures an almost-linear time
251+ complexity, bounded by the number of bytes in the input log stream.
252+
153253## Reproducibility Guide
154- There are several regression tests in the ` tests ` directory of the repository as well as in the
155- individual components of the project. You can run the tests to ensure that the library is working
156- as expected. The tests include testing the AST to NFA conversion, the NFA to DFA conversion, the
157- DFA simulation on the input stream, and the correct passing of unstructured logs given input file
158- and log searching schema.
159254
160- To run the tests, you can use the following command:
255+ ### Testing
256+ Let's start with unit tests and integration tests. You can run the default cargo test command for
257+ testing:
161258``` shell
162259cargo test
163260```
261+ We also have GitHub CI enabled to automate the testing flow. Check [ here] [ project-gh-action ] for our
262+ recent workflow activities. Notice that we use [ cargo-nextest] [ nextest ] for our internal development
263+ and CI workflows for its cleaner user interface. If you have ` cargo-nextest ` installed already, you
264+ can run:
265+ ``` shell
266+ cargo nextest run --all-features
267+ ```
268+
269+ ### Example Programs
270+ We have provided two example programs, [ lexer] ( examples/lexer ) and
271+ [ simple-parser] ( examples/simple-parser ) , to demonstrate how to use log-surgeon. These programs
272+ accept any valid schema files or log files as inputs, which can be specified via the command line.
273+ For more details, refer to the [ User's Guid] ( #users-guide ) section.
274+
275+ We have prepared a short [ video demo] [ video-demo ] showcasing log-surgeon in action. The demo uses
276+ the simple-parser as an example. To reproduce the demo, run the following commands:
277+ ``` shell
278+ cd examples/simple-parser
279+ cargo build --release
280+ target/release/simple-parser ../schema.yaml ../logs/hive-24h.log
281+ ```
282+ Note: The top-level structure of this project has changed a little bit since the video was recorded.
283+ However, the results should be the same by running the commands above.
164284
165- There are also example usage of the library in the ` examples ` directory of the repository. You can
166- run the examples to see how the library can be used or be reproduced in a real-world scenario. Assume
167- you are in the root directory of the repository, you can run the following command to change your
168- directory to the examples directory and run the example :
285+ ### Experimental Results
286+ The experimental result statistics were measured using the example program
287+ [ benchmark ] ( examples/benchmark ) . To reproduce these experiments or run them on other datasets,
288+ run the following commands :
169289``` shell
170- cd examples
171- cargo run
290+ # Download the entire dataset in `$DATASET_DIR`
291+ # Locate the schema config for experiments in `$SCHEMA_CONFIG_PATH`
292+ cd examples/benchmark
293+ cargo build --release
294+ target/release/benchmark $SCHEMA_CONFIG_PATH $DATASET_DIR
172295```
173- The example uses the repository relative path to include the dependency. If you want to include the
174- library in your project, you can follow the user's guide above where you should specify the git URL
175- to obtain the latest version of the library.
176296
177- ## Contributions by each team member
297+ ## Contributions
1782981 . ** [ Louis] [ github-siwei ] **
179299- Implemented the draft version of the AST-to-NFA conversion.
180300- Implemented the conversion from one or more NFAs to a single DFA.
@@ -190,7 +310,7 @@ to obtain the latest version of the library.
190310Both members contributed to the overall architecture, unit testing, integration testing, and library
191311finalization. Both members reviewed the other's implementation through GitHub's Pull Request.
192312
193- ## Lessons learned and concluding remarks
313+ ## Lessons Learned and Concluding Remarks
194314This project provided us with an excellent opportunity to learn about the Rust programming language.
195315We gained hands-on experience with Rust's borrowing system, which helped us write safe and reliable
196316code.
@@ -210,7 +330,7 @@ the gap in the Rust ecosystem where there is no high-performance unstructured lo
210330The future work:
211331- Improve DFA simulation performance.
212332- Implement [ tagged-DFA] [ wiki-tagged-dfa ] to support more powerful variable extraction.
213- - Optimize the lexer to emit tokens based on buffer views, reducing internal string copying.
333+ - Optimize the lexer to emit tokens based on buffer views, reducing internal data copying.
214334
215335[ badge-apache ] : https://img.shields.io/badge/license-APACHE-blue.svg
216336[ badge-build-status ] : https://github.com/Toplogic-Inc/log-surgeon-rust/workflows/CI/badge.svg
@@ -222,7 +342,11 @@ The future work:
222342[ github-zhihao ] : https://github.com/LinZhihao-723
223343[ hadoop-logs ] : https://zenodo.org/records/7114847
224344[ home-page ] : https://github.com/Toplogic-Inc/log-surgeon-rust
345+ [ log-hadoop ] : https://zenodo.org/records/7114847
346+ [ log-hive ] : https://zenodo.org/records/7094921
347+ [ log-open-stack ] : https://zenodo.org/records/7094972
225348[ mongodb-logs ] : https://zenodo.org/records/11075361
349+ [ nextest ] : https://nexte.st/
226350[ project-gh-action ] : https://github.com/Toplogic-Inc/log-surgeon-rust/actions
227351[ regex-syntax-ast-Ast ] : https://docs.rs/regex-syntax/latest/regex_syntax/ast/enum.Ast.html
228352[ wiki-dfa ] : https://en.wikipedia.org/wiki/Deterministic_finite_automaton
0 commit comments