Completed Tasks for Project

Completed Tasks

Revise the Index for the Entire Site

Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
Rewrite each of the introductory paragraphs so that they focus on the concept of a "Prosegrammer".
Rewrite the course overview so that it connects to the concept of a "Prosegrammer" and a course on document engineering
Create a new Python source code example that connects to the topic of document engineering and does not require any dependencies.
Delete any content related to performance engineering, as this is not the focus of a course on document engineering.
Give the correct link to the discord server, which is specified in the _quarto.yml file.

Create Slides for Week One in `slides/weekone/index.qmd`

Create Slides for Week Four in `slides/weekfour/index.qmd`

Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering.
Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
The new slides that I want you to create should be in the file slides/weekfour/index.qmd. The purpose of these slides is to introduce the basics of Markdown and Quarto. Here are some basic features of Markdown and Quarto that you should include in these slides:
- Structure of a Markdown document
- Headings and subheadings
- Paragraphs and line breaks
- Bold and italic text
- Lists (ordered and unordered)
- Links and images
- Code blocks and inline code
- Tables
- Blockquotes
- Horizontal rules
- Mathematical expressions
You can look at the other slide decks that I have already prepared:
- slides/weekone/index.qmd
- slides/weektwo/index.qmd
- slides/weekthree/index.qmd and see how I am currently using Markdown and Quarto in my slides! I want students to be able to understand all of these examples and know how to write them on their own for their own documentation.
When you create all of these examples, make sure that they connect to the concepts of document engineering and prosegramming, as I have already defined in the contents of this GitHub repository. For instance, you can connect the need to write Markdown to the Markdown files that they write to document their own document engineering projects, where they are building tools that input and process and analyze text documents in JSON, YAML, Markdown, and plaintext. You can add slides that make suggestions on how they could use the features of Markdown inside of the README.md file for the tool. You can also add slides that explain how they could create a website for their tool using Quarto.
You must show the actual source code of the basic feature (e.g., bold and italic text) and then show how that actually renders. This means that the student should be able to look at the slide and see both the source code and the rendered output.
Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
Make sure that all the content has concrete examples that make points clear to beginners.

Create Slides for Week Six in `slides/weeksix/index.qmd`

Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering.
Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
The new slides that I want you to create should be in the file slides/weeksix/index.qmd. The purpose of these slides is to introduce the basics of software testing. I previously created these slides for a course on Algorithm Analysis. However, I would like you to customize all of this content for document engineering.
You can look at the other slide decks that I have already prepared:
- slides/weekone/index.qmd
- slides/weektwo/index.qmd
- slides/weekthree/index.qmd and see how I am currently using Markdown and Quarto in my slides!
Please note that I am currently working on the slides in slides/weekfour/index.qmd that introduce Markdown and Quarto. So, please don't use those as an example because they are not yet complete.
I want students to understand the basics of software testing so that when they are building their document engineering tools they can also test them.
Please customize all the examples in the slides so that they connect to document engineering and are accessible to beginners. However, you should keep the simple DaysOfTheWeek source code because that one is easy to understand and accessible to beginners.
If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. You should illustrate how to write test cases for a function like this one.
Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
Make sure that all the content has concrete examples that make points clear to beginners.
I added some template content to start off the slides. You should keep the first and last slides, but make sure to customize them for a course on document engineering and the module that is about data containers.

Create Slides for Week Eight in `slides/weekeight/index.qmd`

Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering.
Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
The new slides that I want you to create should be in the file slides/weekeight/index.qmd. The purpose of these slides is to introduce the basics of Python data containers for document engineering. I previously created these slides for a course on Algorithm Analysis. However, I would like you to customize all of this content for document engineering.
You can look at the other slide decks that I have already prepared:
- slides/weekone/index.qmd
- slides/weektwo/index.qmd
- slides/weekthree/index.qmd
- slides/weekfour/index.qmd
- slides/weekfive/index.qmd
- slides/weeksix/index.qmd and see how I am currently using Markdown and Quarto in my slides!
Please do not use Markdown or Quarto formats that I am not currently using in my slides to make sure that the content has a consistent format.
Please remember that I am currently working on the slides in slides/weekeight/index.qmd that introduce how to use data containers in Python. I want the slides to cover these topics:
- Lists (single-dimensional and two-dimensional)
  - A single-dimensional list that stores five different documents
  - A two-dimensional list that stores a list of the same document but with different metadata or arising from different formats
- Tuples and sets
  - A tuple that stores the title, author, and date of a document
  - A set that stores the unique keywords or tags associated with a document
- Any other basic content about lists, sets, or tuples, including:
  - Creating and initializing lists, tuples, and sets
  - Accessing elements in lists, tuples, and sets
  - Adding and removing elements from lists, tuples, and sets
If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. Please use simple examples like this one to illustrate how to use containers like lists, tuples, and sets in Python.
Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
Make sure that all the content has concrete examples that make points clear to beginners. Provide simple summaries of the concrete code examples.

Create Slides for Week Nine in `slides/nine/index.qmd`

Support for Content

Support for Index.qmd Revision Content

Definition and Etymology of "Prosegrammer"

The term "prosegrammer" effectively combines "prose" (written text) and "programmer" (software developer)
This portmanteau reflects the interdisciplinary nature of document engineering
Similar compound terms exist in technical fields (e.g., "bioinformatics" combines biology and informatics)

Document Engineering as Academic Field

Document engineering is recognized as a legitimate academic discipline combining computer science and technical communication
Research in this field includes automated document generation, content management systems, and text analytics
Universities offer courses in technical writing, computational linguistics, and information design that align with document engineering principles

Python for Document Processing

Python's string manipulation capabilities, regex support, and libraries like NLTK make it ideal for document analysis
The string module provides built-in methods for text cleaning and processing
Dictionary-based word frequency analysis is a standard technique in natural language processing and computational linguistics

Document Analysis Metrics

Word count, sentence count, and readability metrics are standard measures in content analysis
The Flesch-Kincaid readability score and similar metrics rely on words-per-sentence calculations
Document summary statistics help technical writers assess and improve content quality

Support for Week One Slides Content

Document Engineering Definition and Scope

Document engineering combines computational methods with technical writing principles, as evidenced by academic programs at institutions like Carnegie Mellon and MIT that integrate technical communication with computer science
The field encompasses automated document generation, content analysis, and workflow optimization, supported by research in computational linguistics and information science

Python for Text Processing and Analysis

Python's built-in string module and re library provide robust text processing capabilities that are fundamental to document engineering tasks
Word frequency analysis using dictionary-based counting is a standard technique in natural language processing, as demonstrated in NLTK documentation and computational linguistics textbooks
Document statistics like word count, sentence count, and readability metrics are established measures in content analysis research

Development Tools for Document Engineering

VS Code with Quarto extension provides integrated development environment for combining code and prose, as documented in Quarto's official documentation
UV package manager represents modern Python dependency management, following best practices from Python Enhancement Proposals (PEPs) and recommendations from the Python Software Foundation
Git and GitHub form the industry standard for version control and collaboration, with widespread adoption in both software development and academic publishing workflows

Cross-Platform Tool Installation

Installation instructions for UV, Python, and Quarto are designed to work across Windows, macOS, and Linux systems, following each tool's official documentation and installation guides
Command-line verification methods (--version flags) provide standard approaches for confirming successful installations across operating systems
Documentation links provided (UV docs, Quarto docs, VS Code docs) offer official troubleshooting resources maintained by tool creators

AI Tool Responsibility

Responsible use of AI coding assistants is emphasized in academic literature on AI ethics and educational guidelines from institutions implementing AI tools in curricula
The principle that users remain responsible for AI-generated content aligns with emerging best practices in AI-assisted writing and coding, as outlined by organizations like the ACM and IEEE

Support for Week Two Slides Content

Python as Beginner-Friendly Programming Language

Python's syntax is designed to be readable and intuitive, following the Zen of Python principle "Readability counts" (PEP 20)
Python consistently ranks as one of the top programming languages for beginners according to IEEE Spectrum's annual programming language rankings
The language's emphasis on clear, English-like syntax reduces cognitive load for new programmers, as documented in educational research on programming language design

Python Collections for Document Engineering

Python's built-in collection types (strings, lists, tuples, dictionaries, sets) provide comprehensive data structures for organizing document information
String objects in Python are immutable sequences, making them safe for storing document content that shouldn't be accidentally modified
Lists provide mutable sequences ideal for document sections and chapters that may change during editing
Dictionaries implement hash tables for efficient key-value storage, perfect for document metadata and properties
Sets ensure unique elements, valuable for maintaining collections of document keywords and tags without duplicates

Sequence, Selection, and Iteration in Programming

These three fundamental programming concepts (sequence, selection, iteration) form the theoretical foundation of structured programming, as defined by computer science pioneers like Edsger Dijkstra
Sequential execution ensures predictable program behavior, essential for document processing workflows
Conditional statements (selection) enable adaptive document formatting and content generation based on different criteria
Loops (iteration) facilitate processing of document collections and repetitive text operations

Document Engineering Applications of Python Concepts

Text processing functions like word_frequency and document_summary demonstrate practical applications of Python for document analysis
Containment checking operations (in operator) are fundamental for search functionality in document management systems
Collection slicing enables extraction of document sections and content segments for analysis and manipulation
String methods like lower(), upper(), and title() provide essential text formatting capabilities for document standardization

Python Type System for Document Engineering

Python's type hints (introduced in PEP 484) improve code readability and help catch errors in document processing functions
Strong typing helps ensure data integrity when working with document metadata and content
Type annotations serve as inline documentation, making code more maintainable for collaborative document engineering projects

Support for Week Three Slides Content

Object-Oriented Programming for Document Engineering

Object-oriented programming (OOP) is a fundamental paradigm in software engineering that models real-world entities as objects with properties and behaviors
OOP principles (abstraction, inheritance, encapsulation, polymorphism) provide structured approaches to software design, as established in design patterns literature like the Gang of Four book "Design Patterns"
Document engineering benefits from OOP by modeling documents, sections, and processing operations as objects with well-defined interfaces

Document Classes and Inheritance

Class-based inheritance allows creation of specialized document types while maintaining shared functionality, following the Liskov Substitution Principle
The Document base class encapsulates common document properties (title, author, content, metadata) following encapsulation principles
Technical documents often require specialized attributes like complexity scoring and code examples, justifying inheritance hierarchies in document systems

Polymorphism in Document Processing

Polymorphic interfaces enable uniform processing of different document types, supporting the Open/Closed Principle of software design
Abstract base classes define contracts for document processors, ensuring consistent interfaces across different implementations
Duck typing in Python allows flexible object interactions based on behavior rather than inheritance, supporting agile document processing architectures

Composition for Document Generation

Composition over inheritance promotes flexible object assembly, as recommended in software design best practices
Document generators composed of sections provide greater flexibility than rigid inheritance hierarchies
The Strategy pattern (using composition) enables runtime selection of document generation strategies, supporting diverse output formats

OOP Principles Applied to Document Engineering

Abstraction hides complexity of document operations, allowing users to work with high-level document objects rather than low-level file operations
Encapsulation protects document state and ensures data integrity through controlled access methods
Inheritance creates type hierarchies for different document categories while maintaining shared behavior
Polymorphism enables extensible document processing systems that can handle new document types without modifying existing code

Interactive Code Examples

Pyodide enables browser-based Python execution, providing immediate feedback for educational content
Interactive examples reinforce learning through hands-on experimentation
Real-time code execution helps students understand OOP concepts through practical application in document engineering scenarios

Support for Week Two Skill-Check Slides Content

Document Engineering Skill-Check Assessment

Skill-checks provide formative assessment opportunities that measure student progress in programming skills, following educational best practices for frequent, low-stakes testing
Friday skill-checks create regular learning checkpoints that help students maintain consistent engagement with course material
GitHub Classroom provides industry-standard workflow experience, mirroring professional software development practices that students will encounter in their careers

Automated Assessment with GatorGrade

Automated testing and code quality checking reflects industry practices where continuous integration and automated testing are standard procedures
The gatorgrade tool provides objective, consistent assessment criteria that ensure fairness across all student submissions
Real-time feedback enables students to iteratively improve their solutions, supporting mastery-based learning approaches
Pytest integration follows Python testing best practices and prepares students for professional development workflows

Git Version Control Workflow

Regular commits and pushes reinforce version control best practices that are essential for collaborative software development
Git workflow mirrors professional development environments where frequent commits and proper version control are critical skills
GitHub Actions integration provides experience with automated testing pipelines common in modern software development

Programming Task Structure

TODO markers and function stubs provide scaffolding that supports learning progression from novice to expert, following educational research on cognitive load theory
Docstring-driven development encourages clear documentation practices essential for maintainable code
Type annotations requirement reinforces modern Python best practices and helps prevent runtime errors

Honor Code and Academic Integrity

Explicit honor code requirements establish ethical frameworks for academic work, particularly important when AI tools are available
Citation requirements for AI assistance teach responsible use of emerging technologies in educational settings
Individual assessment format ensures authentic demonstration of student learning and skill development

Support for Week Four Slides Content

Markdown as Lightweight Markup Language

Markdown was created by John Gruber in 2004 as a lightweight markup language designed to be easy to read and write in plain text form
The CommonMark specification (2019) standardizes Markdown syntax to ensure consistency across different implementations and platforms
GitHub Flavored Markdown (GFM) extends standard Markdown with features like tables, task lists, and syntax highlighting, making it the de facto standard for technical documentation

Markdown for Technical Documentation

Stack Overflow Developer Survey consistently shows Markdown as one of the most loved markup languages among developers for its simplicity and readability
Major platforms like GitHub, GitLab, Reddit, and Discord use Markdown for content creation, demonstrating its widespread adoption in technical communities
README files in Markdown format are industry standard for project documentation, with GitHub automatically rendering README.md files as project homepages

Quarto as Publishing Platform

Quarto is developed by Posit (formerly RStudio) and represents the evolution of scientific publishing tools, combining the best features of R Markdown, Jupyter notebooks, and modern web technologies
The ability to execute code blocks in multiple languages (Python, R, Julia) makes Quarto suitable for reproducible research and technical documentation
Quarto's support for multiple output formats (HTML, PDF, Word, presentations) follows the principle of single-source publishing used in professional technical writing

Document Engineering Workflow Applications

Version control of documentation with Git follows software engineering best practices, enabling collaborative editing and change tracking for technical documents
Automated documentation generation from code comments and Markdown source files is standard practice in software development, using tools like Sphinx, mkdocs, and Quarto
The concept of "docs-as-code" treats documentation with the same rigor as source code, applying version control, review processes, and automated testing

Accessibility and SEO Benefits

Alt text for images is required by Web Content Accessibility Guidelines (WCAG) 2.1 Level AA, making content accessible to users with visual impairments
Semantic HTML structure generated from Markdown headings improves search engine optimization (SEO) and document navigation
Proper heading hierarchy (H1, H2, H3, H4) creates logical document structure essential for screen readers and automated content analysis

Mathematical Expression Support

MathJax and KaTeX provide browser-based rendering of LaTeX mathematical notation, enabling complex mathematical expressions in web documents
Mathematical markup in documentation is essential for technical fields like data science, engineering, and computer science algorithm documentation
The example document quality formula demonstrates how mathematical concepts can be applied to evaluate documentation effectiveness

Interactive Code Execution

Pyodide enables client-side Python execution in web browsers, providing immediate feedback for educational content without requiring server resources
Interactive code examples improve learning outcomes by allowing students to experiment with code modifications and see immediate results
Live code execution in documentation serves as both tutorial and testing mechanism, ensuring code examples remain functional and up-to-date

Support for Week Six Slides Content

Quarto and Markdown for Document Engineering and Prosegramming

Quarto is developed by Posit and is widely used for technical publishing, supporting reproducible research and professional documentation (see Quarto documentation).
Markdown is the de facto standard for readable, plain-text documentation in software projects, with widespread adoption on platforms like GitHub, Stack Overflow, and Discord.
Combining Quarto and Markdown enables prosegrammers to automate, analyze, and publish documentation that is clear, interactive, and professional (see Quarto and Markdown official docs).
The "docs-as-code" approach treats documentation with the same rigor as source code, applying version control, review, and automated testing (see Sphinx, mkdocs, Quarto docs-as-code philosophy).
Document engineering blends code and prose to create resources for both humans and machines, as supported by academic research in technical communication and computational linguistics.
Mastery of Quarto and Markdown transforms coders into document engineers—prosegrammers who craft content that informs, inspires, and endures (see ACM/IEEE guidelines on technical documentation).

Software Testing for Document Engineering Tools

Software testing principles apply directly to document processing systems, ensuring reliability and correctness in text analysis, parsing, and generation
The IEEE Standard for Software Unit Testing (IEEE 829) provides established methodologies that adapt well to document processing validation
Test-driven development practices help create robust document analysis functions by defining expected behavior before implementation

Document Analysis Testing Best Practices

Testing document processing functions requires validation of text parsing, content extraction, and format conversion accuracy
Edge cases in document processing include empty documents, malformed markup, encoding issues, and extremely large text files
Automated testing frameworks like pytest enable systematic validation of document engineering tools across diverse input scenarios

Python Testing Ecosystem for Document Tools

pytest provides parameterized testing capabilities ideal for testing document processing functions with varied input formats and content types
coverage.py helps ensure comprehensive testing of document analysis code paths, critical for tools that process diverse document structures
Property-based testing with hypothesis generates diverse text inputs to discover edge cases in document processing algorithms

Testing Integration with Document Workflows

Continuous integration testing ensures document processing tools remain reliable as codebases evolve and new document formats are supported
Mutation testing with tools like mutmut validates test suite quality by introducing controlled defects to verify test detection capabilities
Performance testing of document processing tools helps identify bottlenecks in text analysis and generation pipelines

Support for Week Eight Slides Content

Python Data Containers for Document Engineering

Python's built-in container types (lists, tuples, sets) provide fundamental data structures for document processing without requiring external libraries
Lists enable mutable sequences ideal for document collections that change during processing workflows
Tuples offer immutable records perfect for document metadata that should remain constant
Sets automatically handle uniqueness, making them ideal for keyword and tag management in document categorization systems

Container Characteristics and Use Cases

Lists: Mutable and ordered containers that allow duplicates, supporting dynamic document collections and hierarchical structures like chapter organizations
Tuples: Immutable and ordered containers that allow duplicates, ensuring document metadata integrity and providing structured access to fixed properties
Sets: Mutable and unordered containers that prevent duplicates, enabling efficient keyword deduplication and set-based document analysis operations
Container choice depends on document engineering requirements: mutability needs, ordering requirements, and duplicate handling preferences

Document Processing with Container Integration

Combining containers (e.g., lists of tuples, sets for unique words) enables sophisticated document analysis workflows
Type hints (List[str], Set[str], Dict[str, Any]) improve code readability and enable static analysis tools to catch errors
Container operations like list comprehensions and set operations provide efficient document processing without complex algorithms
Real-world document engineering applications include content management systems, automated documentation generators, and text analysis tools

Interactive Code Examples in Educational Slides

Pyodide enables browser-based Python execution, allowing students to experiment with container operations immediately
Code examples demonstrate practical document engineering scenarios like file organization, metadata management, and keyword analysis
Interactive examples reinforce learning through hands-on experimentation and immediate feedback
Progressive complexity from basic operations to integrated document analysis builds student confidence and understanding

Support for Week Nine Slides Content

Python Dictionaries for Document Engineering

Dictionaries are Python's implementation of hash tables, providing O(1) average-case lookup time for key-value pairs, making them ideal for document indexing and metadata storage
The dictionary data structure maps keys to values, mirroring real-world document organization systems like catalogs, indexes, and bibliographic databases
Python dictionaries maintain insertion order (as of Python 3.7+), enabling predictable iteration through document collections while preserving fast lookups
Dictionary comprehensions and methods like get(), items(), keys(), and values() provide efficient document processing operations

JSON and Dictionary Interoperability

JSON (JavaScript Object Notation) is defined by RFC 8259 and serves as the standard data interchange format for web APIs and configuration files
Python's json module in the standard library enables seamless conversion between JSON strings and Python dictionaries using json.loads() and json.dumps()
JSON's object notation directly maps to Python dictionaries, with JSON arrays becoming Python lists and nested objects becoming nested dictionaries
Modern document management systems extensively use JSON for metadata storage, configuration files, and API communication

Dictionary Operations for Document Management

Key existence checking with in operator provides O(1) average-case performance, enabling efficient document verification in catalogs
Dictionary update() method merges multiple document collections, supporting aggregation workflows common in document management systems
The del statement and pop() method enable safe removal of document entries from catalogs and indexes
Iterating through keys(), values(), and items() provides flexible access patterns for document processing pipelines

Nested Dictionaries for Complex Document Structures

Nested dictionaries model hierarchical document structures like document catalogs with metadata, sections with subsections, and bibliographic entries with detailed attributes
Accessing nested values with bracket notation (dict[key1][key2]) provides clear syntax for retrieving specific document properties
Complex document schemas benefit from nested dictionary structures that can represent arbitrary levels of organization
Type hints like Dict[str, Dict[str, Any]] document nested structure expectations and enable static type checking

Dictionaries with Varied Value Types

Python dictionaries support heterogeneous value types, allowing storage of strings, integers, lists, tuples, sets, and nested dictionaries in the same container
Document metadata naturally requires mixed types: strings for titles, integers for word counts, lists for authors, and dictionaries for nested properties
Using lists as dictionary values enables storage of document sections, version histories, and related file collections
Using sets as dictionary values automatically deduplicates tags, keywords, and categories associated with documents

Practical Document Engineering Applications

Word frequency analysis with dictionaries demonstrates fundamental text processing technique used in search engines and content analysis tools
Document indexing with inverted indexes (mapping words to document IDs) enables efficient full-text search functionality
Metadata filtering and querying operations support document library management systems used in academic and professional settings
JSON-based configuration files for document processing tools follow industry standard practices in software development

Educational Slide Design Principles

Progressive complexity introduction starts with basic dictionary creation and advances to nested structures and JSON parsing
Interactive Pyodide code blocks enable hands-on learning and immediate feedback for dictionary operations
Concrete document engineering examples connect abstract programming concepts to practical prosegrammer applications
Clear slide titles, appropriate icon usage, and incremental content display support effective learning and presentation quality

Support for Week Eleven Slides Content

Searching and Sorting for Document Engineering

Searching and sorting are fundamental algorithmic operations that directly apply to document engineering workflows including indexing, retrieval, and organization of large document collections
Binary search provides O(log n) lookup time for sorted document indexes, significantly faster than linear search's O(n) complexity for large collections
Sorting algorithms organize documents by various criteria (date, title, relevance score) enabling efficient browsing and retrieval in documentation systems, digital libraries, and content management platforms
Real-world applications include search engine indexing, bibliography sorting, API function alphabetization, and chronological blog post organization

Big-O Notation for Algorithm Analysis

Big-O notation describes algorithmic time complexity in terms of input size, providing a standardized way to compare algorithm efficiency as documented in computer science literature (Cormen et al., "Introduction to Algorithms")
O(1) represents constant time operations like dictionary key lookups, O(log n) represents logarithmic operations like binary search, O(n) represents linear operations like sequential search, and O(n²) represents quadratic operations like nested iterations
Understanding big-O notation helps document engineers select appropriate algorithms for different dataset sizes and performance requirements
Time complexity analysis is essential for building scalable document processing systems that handle growing content volumes

Binary Search in Document Systems

Binary search requires sorted data and provides logarithmic search time by repeatedly dividing the search space in half, making it efficient for large sorted document indexes
Practical applications include searching documentation indexes, alphabetized API references, and sorted bibliographic databases
The algorithm's O(log n) complexity means search time grows slowly even as document collections scale to millions of entries
Binary search demonstrates the performance benefits of maintaining sorted data structures in document management systems

Sorting Algorithms for Document Organization

Merge sort provides O(n log n) worst-case performance and stable sorting, making it suitable for sorting documents while preserving original order of equal elements
Quick sort offers average O(n log n) performance with in-place sorting, reducing memory overhead for large document collections
Practical document engineering applications include sorting blog posts by publication date, organizing API functions alphabetically, and ranking search results by relevance scores
Python's built-in sorted() function and .sort() method use Timsort, a hybrid algorithm combining merge sort and insertion sort optimized for real-world data patterns

Document Engineering Context for Algorithms

Search functionality in documentation systems (ReadTheDocs, Sphinx) relies on indexing and search algorithms to quickly locate relevant content
Content management systems sort articles by date, category, and popularity using efficient sorting algorithms
Alphabetical organization of API documentation improves discoverability and navigation for developers consulting reference materials
Understanding algorithmic complexity helps prosegrammers make informed decisions about data structure choices and algorithm selection for document processing tasks

Beginner-Friendly Algorithm Presentation

Simplified recursion coverage focuses on practical usage rather than theoretical computer science details, making concepts accessible to beginners
Concrete examples with small datasets (5-10 elements) demonstrate algorithm behavior without overwhelming complexity
Visual representation through code examples shows step-by-step algorithm execution with actual document-related data
Focus on "when to use" rather than "how to implement" supports practical application over theoretical analysis

Support for Week Ten Slides Content

File Input/Output for Document Engineering

File I/O operations are fundamental to document engineering, enabling prosegrammers to read source documents, write processed output, and persist analysis results
Python's built-in open() function provides basic file access with modes ('r' for reading, 'w' for writing, 'a' for appending) as documented in Python's official documentation
Context managers using the with statement ensure proper file closure and resource management, following Python best practices outlined in PEP 343
File operations enable the entire document processing pipeline: ingestion, transformation, analysis, and output generation

pathlib.Path for Cross-Platform File Management

The pathlib module (introduced in Python 3.4, PEP 428) provides object-oriented filesystem path handling that works consistently across Windows, macOS, and Linux
Path objects offer methods like read_text(), write_text(), exists(), and iterdir() that abstract platform-specific filesystem differences
Using pathlib.Path instead of string-based paths prevents common errors from backslash/forward slash differences between operating systems
Modern Python projects prefer pathlib over older os.path module for filesystem operations, as recommended in Python documentation

JSON as Structured Document Format

JSON (JavaScript Object Notation) is defined by RFC 8259 as a lightweight, human-readable data interchange format widely used for configuration files, APIs, and structured document storage
Python's standard library json module provides json.loads() for parsing JSON strings into dictionaries and json.dumps() for serializing dictionaries to JSON format
JSON's nested object structure naturally maps to Python dictionaries, enabling seamless integration between document storage and processing
Document metadata, bibliographic records, and structured content benefit from JSON's combination of readability and machine-parseability

Document Analysis Through Counting Operations

Counting operations are fundamental text analysis techniques used in search engines, content recommendation systems, and bibliometric research
Frequency analysis identifies important terms, common patterns, and statistical properties of document collections
Dictionary-based counting (using dict.get() with default values) provides efficient accumulation of frequency statistics
Count-based metrics enable document classification, similarity detection, and quality assessment

Statistical Aggregations for Document Collections

Computing minimum, maximum, and average values provides statistical summaries essential for understanding document collection characteristics
Python's built-in min(), max(), and sum() functions work with dictionary values to compute aggregate statistics
Statistical analysis of document properties (word counts, author counts, keyword frequencies) supports collection management and quality control
Unique value counting and frequency distributions reveal collection diversity and content patterns

Complete Analysis Pipelines

End-to-end document processing workflows combine reading, parsing, analyzing, and summarizing into integrated pipelines
Modular function design (separate functions for reading, parsing, analyzing) follows software engineering best practices and enables code reuse
Pipeline architecture mirrors professional document processing systems used in content management, digital libraries, and publishing workflows
Writing analysis results back to JSON files closes the processing loop, enabling iterative refinement and long-term storage of insights

Document Engineering with Standard Library Only

Using only Python's standard library (open(), pathlib, json) ensures maximum portability and minimal dependency management
Standard library tools provide sufficient functionality for most document processing tasks without requiring external packages
Learning standard library approaches builds foundational understanding before introducing specialized libraries like pandas or nltk
Dependency-free code simplifies deployment, reduces maintenance burden, and works across different Python environments

Create Slides for Week Ten in `slides/weekten/index.qmd`

Create Slides for Week Twelve in `slides/weektwelve/index.qmd`

Support for Week Twelve Slides Content

Regular Expressions for Document Engineering

Regular expressions (regex) are patterns used to match character combinations in text, as defined by formal language theory and implemented across virtually all programming languages
Python's re module in the standard library provides comprehensive regex functionality following Perl-compatible regular expression (PCRE) syntax
Regex patterns enable efficient text parsing, validation, extraction, and transformation operations essential to document engineering workflows
Common document engineering applications include email validation, date extraction, markdown parsing, and structured text processing

Regex Components and Syntax

Metacharacters (., ^, $, *, +, ?, {n,m}) provide pattern building blocks as standardized in IEEE POSIX regex specifications
Character classes ([abc], \d, \w, \s) define sets of matching characters, following standard regex notation used across programming languages
Quantifiers control repetition in patterns, enabling flexible matching of variable-length text segments in documents
The re.compile() function improves performance by pre-compiling patterns for reuse, as recommended in Python's official documentation

Pattern Matching Methods in Python

re.search() finds first match anywhere in string, ideal for locating patterns within large documents
re.match() matches only at start of string, useful for validating document format headers and structured text beginnings
re.findall() returns all non-overlapping matches, supporting comprehensive pattern extraction from document collections
re.sub() performs pattern-based substitution, enabling automated document cleaning and transformation pipelines
re.split() divides strings on pattern matches, facilitating document parsing and tokenization

Document Engineering Applications of Regex

Date extraction from documents uses patterns like \d{4}-\d{2}-\d{2} to identify ISO 8601 formatted dates common in technical documentation
Email validation patterns ensure proper format in contact information and bibliographic metadata
Markdown syntax parsing relies on regex to identify headers, links, code blocks, and formatting markers
Log file analysis and structured text parsing benefit from regex pattern matching for extracting relevant information

Testing Regular Expressions

The unittest framework provides structured testing for regex patterns, ensuring pattern reliability across different text inputs
Test cases should cover positive matches (valid patterns), negative matches (invalid patterns), and edge cases (boundary conditions, special characters)
Regex testing validates pattern correctness before deployment in production document processing systems
Tools like regex101.com provide interactive regex testing and debugging with visual pattern explanation

Benefits and Limitations of Regex

Benefits include powerful pattern matching, built-in language support, concise syntax for complex text operations, and widespread adoption across tools and platforms
Limitations include readability challenges with complex patterns, performance concerns with catastrophic backtracking, and difficulty debugging intricate regular expressions
Alternatives like parser libraries (e.g., pyparsing) offer better solutions for highly structured document formats
Best practices recommend using regex for pattern-based tasks while choosing specialized parsers for formal grammars and complex document structures

Create Slides for Week Eleven in `slides/weekeleven/index.qmd`

Create Slides for Week Thirteen in `slides/weekthirteen/index.qmd`

Support for Week Thirteen Slides Content

Natural Language Processing for Document Engineering

Natural language processing (NLP) is a field of computer science and linguistics concerned with interactions between computers and human language, enabling automated text analysis and understanding
Basic NLP techniques like tokenization, stemming, and frequency analysis form the foundation of search engines, content management systems, and document analysis tools
Implementing NLP functions from scratch helps learners understand core concepts before using specialized libraries like NLTK, spaCy, or Gensim
Document engineering benefits from NLP through automated keyword extraction, content summarization, and document classification

Tokenization and Segmentation

Tokenization breaks text into individual units (tokens) such as words, numbers, or punctuation marks, forming the basis for all text analysis operations
Python's str.split() method provides basic whitespace tokenization, while re.findall() enables pattern-based tokenization to extract specific token types
Segmentation divides documents into larger units like sentences or paragraphs, using punctuation patterns or structural markers like double newlines
Regular expressions enable sophisticated tokenization and segmentation patterns for handling complex document structures

Stemming and Lemmatization

Stemming reduces words to their root form by removing common suffixes (e.g., "running" becomes "run"), enabling matching of related word forms in search and analysis
The Porter Stemmer algorithm is a well-known stemming approach, but simple rule-based suffix removal provides effective results for basic applications
Lemmatization maps words to their dictionary base forms (lemmas) using linguistic knowledge, providing more accurate normalization than stemming
Dictionary-based lemmatization offers beginner-friendly implementation using Python dictionaries to map word forms to base forms

Stop Word Removal

Stop words are high-frequency function words (articles, prepositions, conjunctions) that carry little semantic meaning in text analysis
Removing stop words focuses analysis on content-bearing terms, improving keyword extraction and topic identification accuracy
Python sets provide efficient stop word filtering using the in operator for fast membership testing
Stop word lists vary by application domain, with different sets appropriate for general text versus technical documentation

Word Frequency Analysis

Word frequency counting identifies the most common terms in documents using dictionary-based accumulation patterns
The dict.get() method with default values provides concise frequency counting without explicitly checking for key existence
Frequency distributions reveal document themes, author style, and content patterns useful for classification and summarization
Top-N word extraction helps identify key topics without reading entire documents

Keyword Extraction and KWIC

Keyword extraction combines frequency analysis with stop word filtering to identify terms that best represent document content
Scoring keywords by frequency, length, or TF-IDF metrics enables ranking of term importance for indexing and search applications
Keyword in context (KWIC) displays show how terms are used within surrounding text, supporting concordance building and usage analysis
KWIC tools align keywords in formatted output for visual scanning and linguistic pattern identification

Educational Slide Design for NLP

Progressive complexity introduction starts with basic tokenization and builds to integrated analysis pipelines combining multiple techniques
Interactive Pyodide code blocks enable hands-on experimentation with NLP functions, providing immediate feedback for learning
Document engineering examples demonstrate practical applications of NLP techniques to real prosegrammer workflows
Simplified implementations using only Python standard library features ensure accessibility and portability across environments

Create Slides for Week Fifteen in `slides/weekfifteen/index.qmd`

Support for Week Fifteen Slides Content

Retrieval Augmented Generation for Document Engineering

Retrieval Augmented Generation (RAG) is a technique that combines information retrieval with natural language generation to produce more accurate, context-grounded responses, as described in the original RAG paper by Lewis et al. (2020) from Facebook AI Research
RAG addresses the limitation of language models generating "hallucinated" information by grounding responses in retrieved factual documents from a knowledge base
The approach has become fundamental to modern AI applications including ChatGPT plugins, documentation assistants, and question-answering systems
Document engineering benefits from RAG through automated technical documentation assistants, code explanation systems, and knowledge base chatbots

RAG Pipeline Components

Document ingestion and preprocessing forms the foundation of RAG systems by loading, cleaning, and normalizing text data from various sources
Text chunking divides long documents into smaller, semantically coherent segments that fit within context windows and improve retrieval precision
Vector embeddings transform text into numerical representations that enable semantic similarity calculations, typically using transformer-based models like BERT or sentence transformers
Retrieval mechanisms rank document chunks by relevance to queries using similarity metrics like cosine similarity or dot product
Context combination merges retrieved chunks with user queries to create comprehensive prompts for language models
Response generation uses language models (GPT, Claude, Llama) to synthesize answers grounded in retrieved information

Document Chunking Strategies

Sentence-based chunking splits text at sentence boundaries using punctuation patterns, providing natural semantic units for retrieval
Fixed-size word chunking creates uniform segments with consistent lengths, useful for controlling context window sizes in language models
Semantic chunking groups related content together, though this approach requires more sophisticated analysis beyond simple rule-based splitting
Overlapping chunks include content from adjacent segments to preserve context at boundaries, improving retrieval quality for queries spanning chunk boundaries
The choice of chunking strategy impacts retrieval precision, context completeness, and system performance

Simple Vector Representations

Word frequency vectors represent text as dictionaries mapping words to occurrence counts, providing a simple baseline for semantic similarity
Set-based overlap metrics calculate similarity as the ratio of shared words to total words, demonstrating basic retrieval relevance scoring
Real-world systems use dense embeddings from neural models (sentence transformers, OpenAI embeddings) that capture semantic meaning beyond simple word overlap
Vector databases (FAISS, ChromaDB, Pinecone) enable efficient similarity search over large document collections using approximate nearest neighbor algorithms
The simplified representations in course slides demonstrate core concepts without requiring external dependencies or API access

Retrieval and Ranking Mechanisms

Similarity scoring quantifies relevance between queries and document chunks using metrics like Jaccard similarity, cosine similarity, or BM25
Top-k selection retrieves the most relevant chunks while balancing context size and precision, with typical values ranging from 2 to 10 chunks
Re-ranking strategies apply multiple relevance signals to improve initial retrieval results, such as combining keyword match with semantic similarity
Source attribution tracks which chunks contributed to responses, enabling transparency and verification of generated information
Retrieval quality directly impacts response accuracy, making it a critical component of RAG systems

Context Construction and Response Generation

Context formatting structures retrieved chunks with queries into prompts that language models can effectively process
Template-based generation provides deterministic responses from retrieved information, suitable for educational demonstrations without requiring LLM APIs
Language model integration (GPT, Claude, Llama) enables flexible, natural-sounding responses that synthesize information from multiple sources
Prompt engineering techniques guide language models to ground responses in retrieved context and avoid hallucination
Multi-turn conversations maintain context across interactions, enabling follow-up questions and clarifications in RAG applications

Real-World RAG Applications

Technical documentation assistants use RAG to answer developer questions by retrieving relevant documentation sections and generating contextual explanations
Customer support chatbots combine company knowledge bases with conversational AI to provide accurate, source-grounded responses
Research paper assistants help scholars find relevant citations and summarize academic literature through retrieval-augmented summarization
Code explanation systems retrieve related code examples and documentation to explain programming concepts and provide usage examples
These applications demonstrate RAG's value for document-intensive domains where factual accuracy and source attribution are critical

Educational Slide Design for RAG

Progressive pipeline introduction builds from document ingestion through retrieval to generation, showing how components connect
Simplified implementations using only Python standard library demonstrate core concepts without external dependencies
Interactive Pyodide code blocks enable hands-on experimentation with chunking, retrieval, and generation functions
Tool reference slides list professional RAG libraries (LangChain, LlamaIndex) and vector databases while keeping examples dependency-free
Document engineering context frames RAG as practical skill for building intelligent documentation systems and knowledge management tools
Beginner-accessible explanations focus on conceptual understanding over mathematical complexity or advanced NLP techniques

FilesExpand file tree

COMPLETED.md

Latest commit

History

COMPLETED.md

File metadata and controls

Completed Tasks for Project

Completed Tasks

Revise the Index for the Entire Site

Create Slides for Week One in slides/weekone/index.qmd

Create Slides for Week Four in slides/weekfour/index.qmd

Create Slides for Week Six in slides/weeksix/index.qmd

Create Slides for Week Eight in slides/weekeight/index.qmd

Create Slides for Week Nine in slides/nine/index.qmd

Support for Content

Support for Index.qmd Revision Content

Definition and Etymology of "Prosegrammer"

Document Engineering as Academic Field

Python for Document Processing

Document Analysis Metrics

Support for Week One Slides Content

Document Engineering Definition and Scope

Python for Text Processing and Analysis

Development Tools for Document Engineering

Cross-Platform Tool Installation

AI Tool Responsibility

Support for Week Two Slides Content

Python as Beginner-Friendly Programming Language

Python Collections for Document Engineering

Sequence, Selection, and Iteration in Programming

Document Engineering Applications of Python Concepts

Python Type System for Document Engineering

Support for Week Three Slides Content

Object-Oriented Programming for Document Engineering

Document Classes and Inheritance

Polymorphism in Document Processing

Composition for Document Generation

OOP Principles Applied to Document Engineering

Interactive Code Examples

Support for Week Two Skill-Check Slides Content

Document Engineering Skill-Check Assessment

Automated Assessment with GatorGrade

Git Version Control Workflow

Programming Task Structure

Honor Code and Academic Integrity

Support for Week Four Slides Content

Markdown as Lightweight Markup Language

Markdown for Technical Documentation

Quarto as Publishing Platform

Document Engineering Workflow Applications

Accessibility and SEO Benefits

Mathematical Expression Support

Interactive Code Execution

Support for Week Six Slides Content

Quarto and Markdown for Document Engineering and Prosegramming

Software Testing for Document Engineering Tools

Document Analysis Testing Best Practices

Python Testing Ecosystem for Document Tools

Testing Integration with Document Workflows

Support for Week Eight Slides Content

Python Data Containers for Document Engineering

Container Characteristics and Use Cases

Document Processing with Container Integration

Interactive Code Examples in Educational Slides

Support for Week Nine Slides Content

Python Dictionaries for Document Engineering

JSON and Dictionary Interoperability

Dictionary Operations for Document Management

Nested Dictionaries for Complex Document Structures

Dictionaries with Varied Value Types

Practical Document Engineering Applications

Educational Slide Design Principles

Support for Week Eleven Slides Content

Searching and Sorting for Document Engineering

Big-O Notation for Algorithm Analysis

Binary Search in Document Systems

Sorting Algorithms for Document Organization

Document Engineering Context for Algorithms

Beginner-Friendly Algorithm Presentation

Support for Week Ten Slides Content

Create Slides for Week One in `slides/weekone/index.qmd`

Create Slides for Week Four in `slides/weekfour/index.qmd`

Create Slides for Week Six in `slides/weeksix/index.qmd`

Create Slides for Week Eight in `slides/weekeight/index.qmd`

Create Slides for Week Nine in `slides/nine/index.qmd`

Create Slides for Week Ten in `slides/weekten/index.qmd`

Create Slides for Week Twelve in `slides/weektwelve/index.qmd`

Create Slides for Week Eleven in `slides/weekeleven/index.qmd`

Create Slides for Week Thirteen in `slides/weekthirteen/index.qmd`

Create Slides for Week Fifteen in `slides/weekfifteen/index.qmd`