A Python lexical analyzer and parser written in C++23 that tokenizes Python source code and builds an Abstract Syntax Tree (AST).
Special Thanks to DrkWithT with helping me refactoring the match and consume inside the Parser namespace to include metaprogramming. This way I don't have to call consume(T) || consume (T) for larger conditionals.
- Lexical Analysis: Tokenizes Python source code into meaningful tokens
- Syntax Parsing: Builds an AST using recursive descent parsing
- Python Language Support:
- Function definitions (
def) - Class definitions (
class) - Control flow (
if,elif,else,while,for) - Pattern matching (
match/case) - Binary operators (
+,-,*,/,//,%,**) - Assignment operators (
=,+=,-=,*=,/=) - Literals (integers, floats, strings)
- Return statements
- Function definitions (
- CMake 3.10 or higher
- C++23 compatible compiler (GCC 11+, Clang 14+)
cmake -B build
cmake --build buildThe executable will be located at build/bin/lexical_analyzer.
./build/bin/lexical_analyzer <python_file>./build/bin/lexical_analyzer example.pyThis will:
- Tokenize the input Python file
- Parse the tokens into an AST
- Display the AST structure
lexical/
├── src/
│ ├── lexical.hpp # Lexer interface
│ ├── lexical.cpp # Lexer implementation
│ ├── parser.hpp # Parser interface
│ ├── parser.cpp # Parser implementation
│ ├── ast.hpp # AST node definitions
│ └── token.hpp # Token type definitions
├── main.cpp # Entry point
├── CMakeLists.txt # Build configuration
└── example.py # Example Python file
The parser generates an AST with the following node types:
PROGRAM- Root nodeFUNCTION_DEF- Function definitionsCLASS_DEF- Class definitionsIF_STMT,ELIF_STMT,ELSE_STMT- Conditional statementsWHILE_STMT,FOR_STMT- Loop statementsMATCH_STMT,CASE_STMT- Pattern matchingASSIGNMENT- Variable assignmentsBINARY_OP- Binary operationsRETURN_STMT- Return statementsIDENTIFIER- Variable/function namesINTEGER_LITERAL,FLOAT_LITERAL,STRING_LITERAL- Literals
For the following Python code:
def greet(name):
message = "Hello, " + name
return messageThe parser generates:
Node: PROGRAM
Node: FUNCTION_DEF (value: "greet")
Node: PARAMETER_LIST
Node: PARAMETER (value: "name")
Node: BLOCK
Node: ASSIGNMENT (value: "=")
Node: IDENTIFIER (value: "message")
Node: BINARY_OP (value: "+")
Node: STRING_LITERAL (value: "Hello, ")
Node: IDENTIFIER (value: "name")
Node: RETURN_STMT
Node: IDENTIFIER (value: "message")
The parser uses recursive descent with the following precedence hierarchy:
parse_statement()- Statement dispatcherparse_assignment()- Assignment expressionsparse_operator()- Binary operatorsparse_expression()- Primary expressions- Specialized parsers for control structures
The lexer recognizes:
- Keywords:
def,class,if,while,for,match,case,return, etc. - Operators:
+,-,*,/,//,%,**,=,+=, etc. - Literals: integers, floats, strings
- Delimiters:
(),[],{},:,, - Indentation:
INDENT,DEDENT,NEWLINE
This project is provided as-is for educational purposes.
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.