This project is a Lexical Analyzer for the C Programming Language.
It processes C source code (after removing comments, there should be no comments in the source code before sending the file as input) and converts it into a structured sequence of tokens, capturing essential information like token type, value, line number, and column number.
The analyzer also generates a detailed log file for every analyzed source file, saved automatically inside a /log
directory.
The project is modular and well-documented using Doxygen for easy understanding and maintainability. You can find the full HTML documentation here
To streamline the build process, the project includes a compile.sh
script. This script automates the compilation of source code files and generates an executable for the Lexical Analyzer. It compiles all the .c
files in the src/
directory and places the binary output in the bin/
directory.
To use the compile script:
- Open a terminal.
- Navigate to the project directory.
- Run the following command:
./compile.sh
-
Ensure that you have gcc installed on your system.
-
If you encounter permission errors, you may need to make the compile.sh script executable with the following command:
chmod +x compile.sh
To run the LexiC:
./bin/lexer <source-code-file>
//Example
./bin/lexer example_code.c
This module handles file chunk operations essential for lexical analysis.
It provides:
- Reading C source files (after removing comments) into manageable chunks.
- Writing processed tokens and logs into output files.
- Memory-managed file operations ensuring binary-safe reads and writes.
Container for a storing file chunks.
Field | Type | Description |
---|---|---|
buff |
char * |
Buffer containing file content |
chksz |
size_t |
Size of the data in the buffer |
-
fchnk_t *fchnk_ctor()
Constructor for file chunk objects. -
void fchnk_dtor(fchnk_t *chnk)
Destructor for releasing file chunk memory. -
fchnk_t *fchnk_ptor(char *const buff, const size_t chksz)
Initializes a file chunk with an existing buffer (ownership transferred). -
bool fwrite_fchnk(const char *fname, const fchnk_t *chnk)
Writes a file chunk's content to a file. -
fchnk_t *get_fchnk(const char *fname)
Reads the full content of a file into a file chunk object (binary-safe).
Token validation functions for lexical analysis.
Provides pattern-matching functions that identify:
- Language keywords (e.g.,
if
,while
) - Operators and punctuation (e.g.,
+
,;
) - Literals (numeric, string, character)
- Identifiers and preprocessor directives
Used during tokenization to classify raw lexemes into specific token types.
All functions are case-sensitive and follow standard C syntax rules.
Note:
These are pure validation functions — they don't modify input or handle memory allocation.
Lexical analyzer components for token processing.
Complete token handling system for source code analysis, consisting of:
- Token type classification
- Individual token representation
- Token collection management
Processing Pipeline:
- Classification categorizes lexemes.
- Structures store the results.
- Collections manage token sequences.
Includes: lexer_validation.h
Token type definitions and classification utilities.
Covers:
- Token categories: preprocessor, symbols, literals.
- Specific token types: keywords, operators, identifiers.
- Type conversion and identification utilities.
Token Category Enumeration.
Enum Value | Description |
---|---|
PRE_PROC |
Preprocessor directives (#define , #include ) |
SYMBOLS |
Operators and punctuation (+ , ; ) |
LITERAL |
String/character literals |
NFKI_LITERAL |
Numerical literals, floats, keywords, or identifiers |
Specific Token Type Enumeration.
Enum Value | Description |
---|---|
KEYWORD |
Reserved keywords (int , if , return ) |
OPERATOR |
Operators (+ , - , * , / , && , ` |
PUNCTUATION |
Punctuation characters (; , , , () , {} ) |
NUMERIC_LITERAL |
Integer numbers (123 , 456 ) |
FLOATING_POINT_LITERAL |
Floating-point numbers (3.14 ) |
CHARACTER_LITERAL |
Character literals ('a' ) |
STRING_LITERAL |
String literals ("hello" ) |
INVALID_IDENTIFIER |
Malformed identifiers |
IDENTIFIER |
Valid identifiers (variable/function names) |
PRE_PROCESSOR_OPERATOR |
Preprocessor-related operators |
-
const char *toktyp_rval(tok_e type)
Returns a human-readable string for a token type. -
tok_e get_toktyp(const char *value, tokcat_e type)
Determines a specific token type based on token string and category.
Token instance representation and operations.
Defines:
- Container structure for tokens.
- Constructors, destructors, and utilities for individual tokens.
Container for a single token.
Field | Type | Description |
---|---|---|
val |
char * |
Token string value |
typ |
tok_e |
Token type |
ln |
size_t |
Line number (1-indexed) |
col |
size_t |
Column position (1-indexed) |
-
tok_t *tok_ctor()
Allocates a new empty token. -
tok_t **tok_nctor(size_t n)
Allocates an array ofn
tokens. -
tok_t *tok_ptor(char *value, tok_e type, size_t line, size_t col)
Allocates and initializes a token. -
void tok_dtor(tok_t *tok)
Frees a token and its contents. -
void printf_tok(const tok_t *tok)
Prints token details to console. -
bool fwrite_tok(FILE *fp, const tok_t *tok)
Writes token details to a file.
Collection of tokens and operations.
Manages:
- Token arrays
- Set metadata
- Bulk operations on groups of tokens
Container for multiple tokens.
Field | Type | Description |
---|---|---|
toks |
tok_t ** |
Dynamic array of token pointers |
toksz |
size_t |
Number of tokens |
-
tokset_t *tokset_ctor()
Allocates and returns a new token set object. -
tokset_t *tokset_ptor(const size_t toksz)
Creates and initializes a token set with a specified number of tokens. -
void tokset_dtor(tokset_t *set)
Frees the memory associated with a token set. -
size_t cnt_toktyp(const tokset_t *const set, const tok_e type)
Counts the number of tokens of a specific type in a token set. -
void printf_tokset(const tokset_t *const set)
Prints the contents of a token set to the standard output. -
bool fwrite_tokset(FILE *fp, const tokset_t *const set)
Writes the contents of a token set to a file.
The lexer_tokenize.h
header file implements core functions for the tokenization process in a lexer. It is responsible for converting source code into token streams, counting tokens, and segmenting the code into lexical units. This header file provides the primary functions for handling the tokenization of source code, as well as utilities for managing token sets.
-
size_t tokcnt(const char *const line)
Counts the number of tokens in a given string (or file content). -
void toknz_segtoset(tokset_t *const set, const size_t token_index, const char *const line, const size_t start, const size_t end, const size_t line_no, const tokcat_e category, const size_t column)
Tokenizes a segment of a line and stores the resulting token in the token set. -
tokset_t *toknz(const char *const line)
Tokenizes a line (or multiple lines of code) into a set of tokens.