Skip to content

Latest commit

 

History

History
1548 lines (1346 loc) · 81.1 KB

File metadata and controls

1548 lines (1346 loc) · 81.1 KB

Completed Tasks for Project

Completed Tasks

Revise the Index for the Entire Site

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Rewrite each of the introductory paragraphs so that they focus on the concept of a "Prosegrammer".
  • Rewrite the course overview so that it connects to the concept of a "Prosegrammer" and a course on document engineering
  • Create a new Python source code example that connects to the topic of document engineering and does not require any dependencies.
  • Delete any content related to performance engineering, as this is not the focus of a course on document engineering.
  • Give the correct link to the discord server, which is specified in the _quarto.yml file.

Create Slides for Week One in slides/weekone/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering.
  • Following the guidelines for creating slides, translate the content in the index.qmd in the root of the repository to slides that introduce the course in the slides/weekone/index.qmd file.
  • Note that the existing content was from slides that Gregory M. Kapfhammer previously created to introduce a course in the field of algorithm analysis. You should revise all the technical content in these slides to fit into a course about document engineering. However, you should also use this content in these slides as good examples for what your generated slides should look like. Make sure to use similar formatting, layout, and content as the provided slides.
  • The slides that introduce the course should contain Python source code like that which you found in the index.qmd file in the root of the repository. Make sure that students can run this source code and see the output.
  • After finishing the slides that introduce the course, create more slides that introduce the following technologies and explain how to install them:
    • Terminal window
    • Git, GitHub, and GitHub CLI (i.e., gh)
    • Register for a free GitHub Student Developer Pack
    • Register for the free use of GitHub Copilot at the pro level
    • VS Code
    • uv (stress the use of uv for virtual environments, dependencies, and Python installation, especially focusing on the fact that a learner should not use alternative mechanisms for installing Python or any of the other aforementioned tasks like creating virtual environments)
    • Python 3.12 or 3.13 (which should come from using the uv tool)
    • Quarto
    • Quarto extension for VS Code
    • Customize VS Code by picking a theme and installing extensions
    • Npm and Node.js and all affiliated tools like npx
    • Google Gemini CLI (run through the use of npx)
    • OpenCode (run through the use of npx)
  • Ensure that the instructions in the slides from the previous step will work correctly regardless of the operating system (Windows, MacOS, Linux).
  • Ensure that the instructions for installing each of the aforementioned tools clearly explain what the tool does, why it is important, and how it can help a prosegrammer to create, maintain, and analyze documents.
  • Ensure that the instructions for installing each of the aforementioned tools stress the importance of testing the setup to make sure that they are working. There should be links to online documentation that a learner can read if they have trouble installing or testing any of the tools.
  • Add content about the responsible use of artificial intelligence (AI) coding and writing tools that use large language models (LLMs) like Claude Sonnet 4 or GPT-4. Make it clear that the prosegrammer who uses these tools is ultimately responsible for wielding them correctly and ethically.

Create Slides for Week Four in slides/weekfour/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering.
  • Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
  • The new slides that I want you to create should be in the file slides/weekfour/index.qmd. The purpose of these slides is to introduce the basics of Markdown and Quarto. Here are some basic features of Markdown and Quarto that you should include in these slides:
    • Structure of a Markdown document
    • Headings and subheadings
    • Paragraphs and line breaks
    • Bold and italic text
    • Lists (ordered and unordered)
    • Links and images
    • Code blocks and inline code
    • Tables
    • Blockquotes
    • Horizontal rules
    • Mathematical expressions
  • You can look at the other slide decks that I have already prepared:
    • slides/weekone/index.qmd
    • slides/weektwo/index.qmd
    • slides/weekthree/index.qmd and see how I am currently using Markdown and Quarto in my slides! I want students to be able to understand all of these examples and know how to write them on their own for their own documentation.
  • When you create all of these examples, make sure that they connect to the concepts of document engineering and prosegramming, as I have already defined in the contents of this GitHub repository. For instance, you can connect the need to write Markdown to the Markdown files that they write to document their own document engineering projects, where they are building tools that input and process and analyze text documents in JSON, YAML, Markdown, and plaintext. You can add slides that make suggestions on how they could use the features of Markdown inside of the README.md file for the tool. You can also add slides that explain how they could create a website for their tool using Quarto.
  • You must show the actual source code of the basic feature (e.g., bold and italic text) and then show how that actually renders. This means that the student should be able to look at the slide and see both the source code and the rendered output.
  • Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
  • Make sure that all the content has concrete examples that make points clear to beginners.

Create Slides for Week Six in slides/weeksix/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering.
  • Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
  • The new slides that I want you to create should be in the file slides/weeksix/index.qmd. The purpose of these slides is to introduce the basics of software testing. I previously created these slides for a course on Algorithm Analysis. However, I would like you to customize all of this content for document engineering.
  • You can look at the other slide decks that I have already prepared:
    • slides/weekone/index.qmd
    • slides/weektwo/index.qmd
    • slides/weekthree/index.qmd and see how I am currently using Markdown and Quarto in my slides!
  • Please note that I am currently working on the slides in slides/weekfour/index.qmd that introduce Markdown and Quarto. So, please don't use those as an example because they are not yet complete.
  • I want students to understand the basics of software testing so that when they are building their document engineering tools they can also test them.
  • Please customize all the examples in the slides so that they connect to document engineering and are accessible to beginners. However, you should keep the simple DaysOfTheWeek source code because that one is easy to understand and accessible to beginners.
  • If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. You should illustrate how to write test cases for a function like this one.
  • Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
  • Make sure that all the content has concrete examples that make points clear to beginners.
  • I added some template content to start off the slides. You should keep the first and last slides, but make sure to customize them for a course on document engineering and the module that is about data containers.

Create Slides for Week Eight in slides/weekeight/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering.
  • Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
  • The new slides that I want you to create should be in the file slides/weekeight/index.qmd. The purpose of these slides is to introduce the basics of Python data containers for document engineering. I previously created these slides for a course on Algorithm Analysis. However, I would like you to customize all of this content for document engineering.
  • You can look at the other slide decks that I have already prepared:
    • slides/weekone/index.qmd
    • slides/weektwo/index.qmd
    • slides/weekthree/index.qmd
    • slides/weekfour/index.qmd
    • slides/weekfive/index.qmd
    • slides/weeksix/index.qmd and see how I am currently using Markdown and Quarto in my slides!
  • Please do not use Markdown or Quarto formats that I am not currently using in my slides to make sure that the content has a consistent format.
  • Please remember that I am currently working on the slides in slides/weekeight/index.qmd that introduce how to use data containers in Python. I want the slides to cover these topics:
    • Lists (single-dimensional and two-dimensional)
      • A single-dimensional list that stores five different documents
      • A two-dimensional list that stores a list of the same document but with different metadata or arising from different formats
    • Tuples and sets
      • A tuple that stores the title, author, and date of a document
      • A set that stores the unique keywords or tags associated with a document
    • Any other basic content about lists, sets, or tuples, including:
      • Creating and initializing lists, tuples, and sets
      • Accessing elements in lists, tuples, and sets
      • Adding and removing elements from lists, tuples, and sets
  • If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. Please use simple examples like this one to illustrate how to use containers like lists, tuples, and sets in Python.
  • Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
  • Make sure that all the content has concrete examples that make points clear to beginners. Provide simple summaries of the concrete code examples.

Create Slides for Week Nine in slides/nine/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering. All the slides that you create have to specifically connect to the theme of prosegramming and document engineering.
  • Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
  • The new slides that I want you to create should be in the file slides/weeknine/index.qmd. The purpose of these example slides is to introduce the basics of Python data containers for document engineering. However, I would like you to customize all of this content so that it is about using dictionaries for document engineering. I have only provided you these slides so that you can see what some introductory slides for my slide decks looks like.
  • You can look at the other slide decks that I have already prepared:
    • slides/weekone/index.qmd
    • slides/weektwo/index.qmd
    • slides/weekthree/index.qmd
    • slides/weekfour/index.qmd
    • slides/weekfive/index.qmd
    • slides/weeksix/index.qmd
    • slides/weekeight/index.qmd and see how I am currently using Markdown and Quarto in my slides!
  • Please do not use Markdown or Quarto formats that I am not currently using in my slides to make sure that the content has a consistent format.
  • Please remember that I am currently working on the slides in slides/weeknine/index.qmd that introduce how to use data containers in Python. I want the slides to cover these topics:
    • Dictionaries for storing data in document engineering
      • A dictionary that maps document IDs to document content
      • Dictionaries that have key-value pairs of different types
        • Dictionary that uses a string to map to a string
        • Dictionary that uses a string to map to an integer
        • Dictionary that uses a string to map to a list
        • Dictionary that uses a string to map to another dictionary
        • Dictionary that uses a string to map to a tuple
        • Dictionary that uses a string to map to a set
    • Reading in a JSON file and parsing it into a dictionary
    • Make sure that all work with JSON files uses only packages that are part of the standard Python library.
    • Do not use any external libraries when in the code that you implement.
    • Any other basic content about dictionaries, including:
      • Creating and initializing a dictionary
      • Accessing a key or a value in a dictionary
      • Iterating through the keys or the values in a dictionary
      • Adding and removing elements from a dictionary
  • If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. Please use simple examples like this one to illustrate how to use containers like lists, tuples, and sets in Python.
  • Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
  • Make sure that all the content has concrete examples that make points clear to beginners. Provide simple summaries of the concrete code examples.

Support for Content

Support for Index.qmd Revision Content

Definition and Etymology of "Prosegrammer"

  • The term "prosegrammer" effectively combines "prose" (written text) and "programmer" (software developer)
  • This portmanteau reflects the interdisciplinary nature of document engineering
  • Similar compound terms exist in technical fields (e.g., "bioinformatics" combines biology and informatics)

Document Engineering as Academic Field

  • Document engineering is recognized as a legitimate academic discipline combining computer science and technical communication
  • Research in this field includes automated document generation, content management systems, and text analytics
  • Universities offer courses in technical writing, computational linguistics, and information design that align with document engineering principles

Python for Document Processing

  • Python's string manipulation capabilities, regex support, and libraries like NLTK make it ideal for document analysis
  • The string module provides built-in methods for text cleaning and processing
  • Dictionary-based word frequency analysis is a standard technique in natural language processing and computational linguistics

Document Analysis Metrics

  • Word count, sentence count, and readability metrics are standard measures in content analysis
  • The Flesch-Kincaid readability score and similar metrics rely on words-per-sentence calculations
  • Document summary statistics help technical writers assess and improve content quality

Support for Week One Slides Content

Document Engineering Definition and Scope

  • Document engineering combines computational methods with technical writing principles, as evidenced by academic programs at institutions like Carnegie Mellon and MIT that integrate technical communication with computer science
  • The field encompasses automated document generation, content analysis, and workflow optimization, supported by research in computational linguistics and information science

Python for Text Processing and Analysis

  • Python's built-in string module and re library provide robust text processing capabilities that are fundamental to document engineering tasks
  • Word frequency analysis using dictionary-based counting is a standard technique in natural language processing, as demonstrated in NLTK documentation and computational linguistics textbooks
  • Document statistics like word count, sentence count, and readability metrics are established measures in content analysis research

Development Tools for Document Engineering

  • VS Code with Quarto extension provides integrated development environment for combining code and prose, as documented in Quarto's official documentation
  • UV package manager represents modern Python dependency management, following best practices from Python Enhancement Proposals (PEPs) and recommendations from the Python Software Foundation
  • Git and GitHub form the industry standard for version control and collaboration, with widespread adoption in both software development and academic publishing workflows

Cross-Platform Tool Installation

  • Installation instructions for UV, Python, and Quarto are designed to work across Windows, macOS, and Linux systems, following each tool's official documentation and installation guides
  • Command-line verification methods (--version flags) provide standard approaches for confirming successful installations across operating systems
  • Documentation links provided (UV docs, Quarto docs, VS Code docs) offer official troubleshooting resources maintained by tool creators

AI Tool Responsibility

  • Responsible use of AI coding assistants is emphasized in academic literature on AI ethics and educational guidelines from institutions implementing AI tools in curricula
  • The principle that users remain responsible for AI-generated content aligns with emerging best practices in AI-assisted writing and coding, as outlined by organizations like the ACM and IEEE

Support for Week Two Slides Content

Python as Beginner-Friendly Programming Language

  • Python's syntax is designed to be readable and intuitive, following the Zen of Python principle "Readability counts" (PEP 20)
  • Python consistently ranks as one of the top programming languages for beginners according to IEEE Spectrum's annual programming language rankings
  • The language's emphasis on clear, English-like syntax reduces cognitive load for new programmers, as documented in educational research on programming language design

Python Collections for Document Engineering

  • Python's built-in collection types (strings, lists, tuples, dictionaries, sets) provide comprehensive data structures for organizing document information
  • String objects in Python are immutable sequences, making them safe for storing document content that shouldn't be accidentally modified
  • Lists provide mutable sequences ideal for document sections and chapters that may change during editing
  • Dictionaries implement hash tables for efficient key-value storage, perfect for document metadata and properties
  • Sets ensure unique elements, valuable for maintaining collections of document keywords and tags without duplicates

Sequence, Selection, and Iteration in Programming

  • These three fundamental programming concepts (sequence, selection, iteration) form the theoretical foundation of structured programming, as defined by computer science pioneers like Edsger Dijkstra
  • Sequential execution ensures predictable program behavior, essential for document processing workflows
  • Conditional statements (selection) enable adaptive document formatting and content generation based on different criteria
  • Loops (iteration) facilitate processing of document collections and repetitive text operations

Document Engineering Applications of Python Concepts

  • Text processing functions like word_frequency and document_summary demonstrate practical applications of Python for document analysis
  • Containment checking operations (in operator) are fundamental for search functionality in document management systems
  • Collection slicing enables extraction of document sections and content segments for analysis and manipulation
  • String methods like lower(), upper(), and title() provide essential text formatting capabilities for document standardization

Python Type System for Document Engineering

  • Python's type hints (introduced in PEP 484) improve code readability and help catch errors in document processing functions
  • Strong typing helps ensure data integrity when working with document metadata and content
  • Type annotations serve as inline documentation, making code more maintainable for collaborative document engineering projects

Support for Week Three Slides Content

Object-Oriented Programming for Document Engineering

  • Object-oriented programming (OOP) is a fundamental paradigm in software engineering that models real-world entities as objects with properties and behaviors
  • OOP principles (abstraction, inheritance, encapsulation, polymorphism) provide structured approaches to software design, as established in design patterns literature like the Gang of Four book "Design Patterns"
  • Document engineering benefits from OOP by modeling documents, sections, and processing operations as objects with well-defined interfaces

Document Classes and Inheritance

  • Class-based inheritance allows creation of specialized document types while maintaining shared functionality, following the Liskov Substitution Principle
  • The Document base class encapsulates common document properties (title, author, content, metadata) following encapsulation principles
  • Technical documents often require specialized attributes like complexity scoring and code examples, justifying inheritance hierarchies in document systems

Polymorphism in Document Processing

  • Polymorphic interfaces enable uniform processing of different document types, supporting the Open/Closed Principle of software design
  • Abstract base classes define contracts for document processors, ensuring consistent interfaces across different implementations
  • Duck typing in Python allows flexible object interactions based on behavior rather than inheritance, supporting agile document processing architectures

Composition for Document Generation

  • Composition over inheritance promotes flexible object assembly, as recommended in software design best practices
  • Document generators composed of sections provide greater flexibility than rigid inheritance hierarchies
  • The Strategy pattern (using composition) enables runtime selection of document generation strategies, supporting diverse output formats

OOP Principles Applied to Document Engineering

  • Abstraction hides complexity of document operations, allowing users to work with high-level document objects rather than low-level file operations
  • Encapsulation protects document state and ensures data integrity through controlled access methods
  • Inheritance creates type hierarchies for different document categories while maintaining shared behavior
  • Polymorphism enables extensible document processing systems that can handle new document types without modifying existing code

Interactive Code Examples

  • Pyodide enables browser-based Python execution, providing immediate feedback for educational content
  • Interactive examples reinforce learning through hands-on experimentation
  • Real-time code execution helps students understand OOP concepts through practical application in document engineering scenarios

Support for Week Two Skill-Check Slides Content

Document Engineering Skill-Check Assessment

  • Skill-checks provide formative assessment opportunities that measure student progress in programming skills, following educational best practices for frequent, low-stakes testing
  • Friday skill-checks create regular learning checkpoints that help students maintain consistent engagement with course material
  • GitHub Classroom provides industry-standard workflow experience, mirroring professional software development practices that students will encounter in their careers

Automated Assessment with GatorGrade

  • Automated testing and code quality checking reflects industry practices where continuous integration and automated testing are standard procedures
  • The gatorgrade tool provides objective, consistent assessment criteria that ensure fairness across all student submissions
  • Real-time feedback enables students to iteratively improve their solutions, supporting mastery-based learning approaches
  • Pytest integration follows Python testing best practices and prepares students for professional development workflows

Git Version Control Workflow

  • Regular commits and pushes reinforce version control best practices that are essential for collaborative software development
  • Git workflow mirrors professional development environments where frequent commits and proper version control are critical skills
  • GitHub Actions integration provides experience with automated testing pipelines common in modern software development

Programming Task Structure

  • TODO markers and function stubs provide scaffolding that supports learning progression from novice to expert, following educational research on cognitive load theory
  • Docstring-driven development encourages clear documentation practices essential for maintainable code
  • Type annotations requirement reinforces modern Python best practices and helps prevent runtime errors

Honor Code and Academic Integrity

  • Explicit honor code requirements establish ethical frameworks for academic work, particularly important when AI tools are available
  • Citation requirements for AI assistance teach responsible use of emerging technologies in educational settings
  • Individual assessment format ensures authentic demonstration of student learning and skill development

Support for Week Four Slides Content

Markdown as Lightweight Markup Language

  • Markdown was created by John Gruber in 2004 as a lightweight markup language designed to be easy to read and write in plain text form
  • The CommonMark specification (2019) standardizes Markdown syntax to ensure consistency across different implementations and platforms
  • GitHub Flavored Markdown (GFM) extends standard Markdown with features like tables, task lists, and syntax highlighting, making it the de facto standard for technical documentation

Markdown for Technical Documentation

  • Stack Overflow Developer Survey consistently shows Markdown as one of the most loved markup languages among developers for its simplicity and readability
  • Major platforms like GitHub, GitLab, Reddit, and Discord use Markdown for content creation, demonstrating its widespread adoption in technical communities
  • README files in Markdown format are industry standard for project documentation, with GitHub automatically rendering README.md files as project homepages

Quarto as Publishing Platform

  • Quarto is developed by Posit (formerly RStudio) and represents the evolution of scientific publishing tools, combining the best features of R Markdown, Jupyter notebooks, and modern web technologies
  • The ability to execute code blocks in multiple languages (Python, R, Julia) makes Quarto suitable for reproducible research and technical documentation
  • Quarto's support for multiple output formats (HTML, PDF, Word, presentations) follows the principle of single-source publishing used in professional technical writing

Document Engineering Workflow Applications

  • Version control of documentation with Git follows software engineering best practices, enabling collaborative editing and change tracking for technical documents
  • Automated documentation generation from code comments and Markdown source files is standard practice in software development, using tools like Sphinx, mkdocs, and Quarto
  • The concept of "docs-as-code" treats documentation with the same rigor as source code, applying version control, review processes, and automated testing

Accessibility and SEO Benefits

  • Alt text for images is required by Web Content Accessibility Guidelines (WCAG) 2.1 Level AA, making content accessible to users with visual impairments
  • Semantic HTML structure generated from Markdown headings improves search engine optimization (SEO) and document navigation
  • Proper heading hierarchy (H1, H2, H3, H4) creates logical document structure essential for screen readers and automated content analysis

Mathematical Expression Support

  • MathJax and KaTeX provide browser-based rendering of LaTeX mathematical notation, enabling complex mathematical expressions in web documents
  • Mathematical markup in documentation is essential for technical fields like data science, engineering, and computer science algorithm documentation
  • The example document quality formula demonstrates how mathematical concepts can be applied to evaluate documentation effectiveness

Interactive Code Execution

  • Pyodide enables client-side Python execution in web browsers, providing immediate feedback for educational content without requiring server resources
  • Interactive code examples improve learning outcomes by allowing students to experiment with code modifications and see immediate results
  • Live code execution in documentation serves as both tutorial and testing mechanism, ensuring code examples remain functional and up-to-date

Support for Week Six Slides Content

Quarto and Markdown for Document Engineering and Prosegramming

  • Quarto is developed by Posit and is widely used for technical publishing, supporting reproducible research and professional documentation (see Quarto documentation).
  • Markdown is the de facto standard for readable, plain-text documentation in software projects, with widespread adoption on platforms like GitHub, Stack Overflow, and Discord.
  • Combining Quarto and Markdown enables prosegrammers to automate, analyze, and publish documentation that is clear, interactive, and professional (see Quarto and Markdown official docs).
  • The "docs-as-code" approach treats documentation with the same rigor as source code, applying version control, review, and automated testing (see Sphinx, mkdocs, Quarto docs-as-code philosophy).
  • Document engineering blends code and prose to create resources for both humans and machines, as supported by academic research in technical communication and computational linguistics.
  • Mastery of Quarto and Markdown transforms coders into document engineers—prosegrammers who craft content that informs, inspires, and endures (see ACM/IEEE guidelines on technical documentation).

Software Testing for Document Engineering Tools

  • Software testing principles apply directly to document processing systems, ensuring reliability and correctness in text analysis, parsing, and generation
  • The IEEE Standard for Software Unit Testing (IEEE 829) provides established methodologies that adapt well to document processing validation
  • Test-driven development practices help create robust document analysis functions by defining expected behavior before implementation

Document Analysis Testing Best Practices

  • Testing document processing functions requires validation of text parsing, content extraction, and format conversion accuracy
  • Edge cases in document processing include empty documents, malformed markup, encoding issues, and extremely large text files
  • Automated testing frameworks like pytest enable systematic validation of document engineering tools across diverse input scenarios

Python Testing Ecosystem for Document Tools

  • pytest provides parameterized testing capabilities ideal for testing document processing functions with varied input formats and content types
  • coverage.py helps ensure comprehensive testing of document analysis code paths, critical for tools that process diverse document structures
  • Property-based testing with hypothesis generates diverse text inputs to discover edge cases in document processing algorithms

Testing Integration with Document Workflows

  • Continuous integration testing ensures document processing tools remain reliable as codebases evolve and new document formats are supported
  • Mutation testing with tools like mutmut validates test suite quality by introducing controlled defects to verify test detection capabilities
  • Performance testing of document processing tools helps identify bottlenecks in text analysis and generation pipelines

Support for Week Eight Slides Content

Python Data Containers for Document Engineering

  • Python's built-in container types (lists, tuples, sets) provide fundamental data structures for document processing without requiring external libraries
  • Lists enable mutable sequences ideal for document collections that change during processing workflows
  • Tuples offer immutable records perfect for document metadata that should remain constant
  • Sets automatically handle uniqueness, making them ideal for keyword and tag management in document categorization systems

Container Characteristics and Use Cases

  • Lists: Mutable and ordered containers that allow duplicates, supporting dynamic document collections and hierarchical structures like chapter organizations
  • Tuples: Immutable and ordered containers that allow duplicates, ensuring document metadata integrity and providing structured access to fixed properties
  • Sets: Mutable and unordered containers that prevent duplicates, enabling efficient keyword deduplication and set-based document analysis operations
  • Container choice depends on document engineering requirements: mutability needs, ordering requirements, and duplicate handling preferences

Document Processing with Container Integration

  • Combining containers (e.g., lists of tuples, sets for unique words) enables sophisticated document analysis workflows
  • Type hints (List[str], Set[str], Dict[str, Any]) improve code readability and enable static analysis tools to catch errors
  • Container operations like list comprehensions and set operations provide efficient document processing without complex algorithms
  • Real-world document engineering applications include content management systems, automated documentation generators, and text analysis tools

Interactive Code Examples in Educational Slides

  • Pyodide enables browser-based Python execution, allowing students to experiment with container operations immediately
  • Code examples demonstrate practical document engineering scenarios like file organization, metadata management, and keyword analysis
  • Interactive examples reinforce learning through hands-on experimentation and immediate feedback
  • Progressive complexity from basic operations to integrated document analysis builds student confidence and understanding

Support for Week Nine Slides Content

Python Dictionaries for Document Engineering

  • Dictionaries are Python's implementation of hash tables, providing O(1) average-case lookup time for key-value pairs, making them ideal for document indexing and metadata storage
  • The dictionary data structure maps keys to values, mirroring real-world document organization systems like catalogs, indexes, and bibliographic databases
  • Python dictionaries maintain insertion order (as of Python 3.7+), enabling predictable iteration through document collections while preserving fast lookups
  • Dictionary comprehensions and methods like get(), items(), keys(), and values() provide efficient document processing operations

JSON and Dictionary Interoperability

  • JSON (JavaScript Object Notation) is defined by RFC 8259 and serves as the standard data interchange format for web APIs and configuration files
  • Python's json module in the standard library enables seamless conversion between JSON strings and Python dictionaries using json.loads() and json.dumps()
  • JSON's object notation directly maps to Python dictionaries, with JSON arrays becoming Python lists and nested objects becoming nested dictionaries
  • Modern document management systems extensively use JSON for metadata storage, configuration files, and API communication

Dictionary Operations for Document Management

  • Key existence checking with in operator provides O(1) average-case performance, enabling efficient document verification in catalogs
  • Dictionary update() method merges multiple document collections, supporting aggregation workflows common in document management systems
  • The del statement and pop() method enable safe removal of document entries from catalogs and indexes
  • Iterating through keys(), values(), and items() provides flexible access patterns for document processing pipelines

Nested Dictionaries for Complex Document Structures

  • Nested dictionaries model hierarchical document structures like document catalogs with metadata, sections with subsections, and bibliographic entries with detailed attributes
  • Accessing nested values with bracket notation (dict[key1][key2]) provides clear syntax for retrieving specific document properties
  • Complex document schemas benefit from nested dictionary structures that can represent arbitrary levels of organization
  • Type hints like Dict[str, Dict[str, Any]] document nested structure expectations and enable static type checking

Dictionaries with Varied Value Types

  • Python dictionaries support heterogeneous value types, allowing storage of strings, integers, lists, tuples, sets, and nested dictionaries in the same container
  • Document metadata naturally requires mixed types: strings for titles, integers for word counts, lists for authors, and dictionaries for nested properties
  • Using lists as dictionary values enables storage of document sections, version histories, and related file collections
  • Using sets as dictionary values automatically deduplicates tags, keywords, and categories associated with documents

Practical Document Engineering Applications

  • Word frequency analysis with dictionaries demonstrates fundamental text processing technique used in search engines and content analysis tools
  • Document indexing with inverted indexes (mapping words to document IDs) enables efficient full-text search functionality
  • Metadata filtering and querying operations support document library management systems used in academic and professional settings
  • JSON-based configuration files for document processing tools follow industry standard practices in software development

Educational Slide Design Principles

  • Progressive complexity introduction starts with basic dictionary creation and advances to nested structures and JSON parsing
  • Interactive Pyodide code blocks enable hands-on learning and immediate feedback for dictionary operations
  • Concrete document engineering examples connect abstract programming concepts to practical prosegrammer applications
  • Clear slide titles, appropriate icon usage, and incremental content display support effective learning and presentation quality

Support for Week Eleven Slides Content

Searching and Sorting for Document Engineering

  • Searching and sorting are fundamental algorithmic operations that directly apply to document engineering workflows including indexing, retrieval, and organization of large document collections
  • Binary search provides O(log n) lookup time for sorted document indexes, significantly faster than linear search's O(n) complexity for large collections
  • Sorting algorithms organize documents by various criteria (date, title, relevance score) enabling efficient browsing and retrieval in documentation systems, digital libraries, and content management platforms
  • Real-world applications include search engine indexing, bibliography sorting, API function alphabetization, and chronological blog post organization

Big-O Notation for Algorithm Analysis

  • Big-O notation describes algorithmic time complexity in terms of input size, providing a standardized way to compare algorithm efficiency as documented in computer science literature (Cormen et al., "Introduction to Algorithms")
  • O(1) represents constant time operations like dictionary key lookups, O(log n) represents logarithmic operations like binary search, O(n) represents linear operations like sequential search, and O(n²) represents quadratic operations like nested iterations
  • Understanding big-O notation helps document engineers select appropriate algorithms for different dataset sizes and performance requirements
  • Time complexity analysis is essential for building scalable document processing systems that handle growing content volumes

Binary Search in Document Systems

  • Binary search requires sorted data and provides logarithmic search time by repeatedly dividing the search space in half, making it efficient for large sorted document indexes
  • Practical applications include searching documentation indexes, alphabetized API references, and sorted bibliographic databases
  • The algorithm's O(log n) complexity means search time grows slowly even as document collections scale to millions of entries
  • Binary search demonstrates the performance benefits of maintaining sorted data structures in document management systems

Sorting Algorithms for Document Organization

  • Merge sort provides O(n log n) worst-case performance and stable sorting, making it suitable for sorting documents while preserving original order of equal elements
  • Quick sort offers average O(n log n) performance with in-place sorting, reducing memory overhead for large document collections
  • Practical document engineering applications include sorting blog posts by publication date, organizing API functions alphabetically, and ranking search results by relevance scores
  • Python's built-in sorted() function and .sort() method use Timsort, a hybrid algorithm combining merge sort and insertion sort optimized for real-world data patterns

Document Engineering Context for Algorithms

  • Search functionality in documentation systems (ReadTheDocs, Sphinx) relies on indexing and search algorithms to quickly locate relevant content
  • Content management systems sort articles by date, category, and popularity using efficient sorting algorithms
  • Alphabetical organization of API documentation improves discoverability and navigation for developers consulting reference materials
  • Understanding algorithmic complexity helps prosegrammers make informed decisions about data structure choices and algorithm selection for document processing tasks

Beginner-Friendly Algorithm Presentation

  • Simplified recursion coverage focuses on practical usage rather than theoretical computer science details, making concepts accessible to beginners
  • Concrete examples with small datasets (5-10 elements) demonstrate algorithm behavior without overwhelming complexity
  • Visual representation through code examples shows step-by-step algorithm execution with actual document-related data
  • Focus on "when to use" rather than "how to implement" supports practical application over theoretical analysis

Support for Week Ten Slides Content

File Input/Output for Document Engineering

  • File I/O operations are fundamental to document engineering, enabling prosegrammers to read source documents, write processed output, and persist analysis results
  • Python's built-in open() function provides basic file access with modes ('r' for reading, 'w' for writing, 'a' for appending) as documented in Python's official documentation
  • Context managers using the with statement ensure proper file closure and resource management, following Python best practices outlined in PEP 343
  • File operations enable the entire document processing pipeline: ingestion, transformation, analysis, and output generation

pathlib.Path for Cross-Platform File Management

  • The pathlib module (introduced in Python 3.4, PEP 428) provides object-oriented filesystem path handling that works consistently across Windows, macOS, and Linux
  • Path objects offer methods like read_text(), write_text(), exists(), and iterdir() that abstract platform-specific filesystem differences
  • Using pathlib.Path instead of string-based paths prevents common errors from backslash/forward slash differences between operating systems
  • Modern Python projects prefer pathlib over older os.path module for filesystem operations, as recommended in Python documentation

JSON as Structured Document Format

  • JSON (JavaScript Object Notation) is defined by RFC 8259 as a lightweight, human-readable data interchange format widely used for configuration files, APIs, and structured document storage
  • Python's standard library json module provides json.loads() for parsing JSON strings into dictionaries and json.dumps() for serializing dictionaries to JSON format
  • JSON's nested object structure naturally maps to Python dictionaries, enabling seamless integration between document storage and processing
  • Document metadata, bibliographic records, and structured content benefit from JSON's combination of readability and machine-parseability

Document Analysis Through Counting Operations

  • Counting operations are fundamental text analysis techniques used in search engines, content recommendation systems, and bibliometric research
  • Frequency analysis identifies important terms, common patterns, and statistical properties of document collections
  • Dictionary-based counting (using dict.get() with default values) provides efficient accumulation of frequency statistics
  • Count-based metrics enable document classification, similarity detection, and quality assessment

Statistical Aggregations for Document Collections

  • Computing minimum, maximum, and average values provides statistical summaries essential for understanding document collection characteristics
  • Python's built-in min(), max(), and sum() functions work with dictionary values to compute aggregate statistics
  • Statistical analysis of document properties (word counts, author counts, keyword frequencies) supports collection management and quality control
  • Unique value counting and frequency distributions reveal collection diversity and content patterns

Complete Analysis Pipelines

  • End-to-end document processing workflows combine reading, parsing, analyzing, and summarizing into integrated pipelines
  • Modular function design (separate functions for reading, parsing, analyzing) follows software engineering best practices and enables code reuse
  • Pipeline architecture mirrors professional document processing systems used in content management, digital libraries, and publishing workflows
  • Writing analysis results back to JSON files closes the processing loop, enabling iterative refinement and long-term storage of insights

Document Engineering with Standard Library Only

  • Using only Python's standard library (open(), pathlib, json) ensures maximum portability and minimal dependency management
  • Standard library tools provide sufficient functionality for most document processing tasks without requiring external packages
  • Learning standard library approaches builds foundational understanding before introducing specialized libraries like pandas or nltk
  • Dependency-free code simplifies deployment, reduces maintenance burden, and works across different Python environments

Create Slides for Week Ten in slides/weekten/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering. All the slides that you create have to specifically connect to the theme of prosegramming and document engineering.
  • Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
  • The new slides that I want you to create should be in the file slides/weekten/index.qmd. The purpose of these slides is to introduce how to read and write files from the filesystem in Python. The files that will be input and output are JSON files. I would like you to create all of this content so that it is about using JSON files for document engineering.
  • Please remember that I am currently working on the slides in slides/weekten/index.qmd that introduce how to use file input and output in Python. I want the slides to cover these topics:
    • File input and output using the open() function
    • Reading and writing files using the read() and write() methods
    • File input and output using the Pathlib module and the Path class
    • Using the json module to parse and analyze JSON files
    • Reading in a JSON file and parsing it into a dictionary
    • Analyzing the contents of a JSON file that has been parsed into a dictionary according to the following analyses:
      • Counting the number of key, value pairs in the dictionary
      • Counting the min, max, and average count of unique values for all the values in the dictionary
    • Creating a dictionary to store the summary data of the counts of the unique values stored in the dictionary
    • Writing the summary data dictionary back out to a new JSON file
    • Make sure that you have an initial JSON file that can be input by these code examples in the slides. You should create a JSON file that contains sample data that can be used for document engineering analysis.
    • Make sure that all work with JSON files and their input and output only uses packages that are part of the standard Python library.
    • Do not use any external libraries when in the code that you implement.
  • You can look at the other slide decks that I have already prepared:
    • slides/weekone/index.qmd
    • slides/weektwo/index.qmd
    • slides/weekthree/index.qmd
    • slides/weekfour/index.qmd
    • slides/weekfive/index.qmd
    • slides/weeksix/index.qmd
    • slides/weekeight/index.qmd
    • slides/weeknine/index.qmd ... and see how I am currently using Markdown and Quarto in my slides!
  • Please do not use Markdown or Quarto formats that I am not currently using in my slides to make sure that the content has a consistent format.
  • If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. Please use simple examples like this one to illustrate how to use containers like lists, tuples, and sets in Python.
  • Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
  • Make sure that all the content has concrete examples that make points clear to beginners. Provide simple summaries of the concrete code examples.
  • Always include "signposting" slides that clearly state what you are going to explain in the next block of slides about a specific topic. These signposting slides are at the # level in the Markdown file.
  • Keep the total number of slides to a count less than the prior slide decks that I have created. For instance, this topic is less complicated than the material that I produced for weeknine and thus the slide count should be less.

Create Slides for Week Twelve in slides/weektwelve/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering. All the slides that you create have to specifically connect to the theme of prosegramming and document engineering.
  • Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
  • Please remember that I am currently working on the slides in slides/weektwelve/index.qmd that introduce how to create and use regular expressions for the purposes of document engineering. I want the slides to define what a regular expression is an explain their benefits and drawbacks.
  • I have already written some slides and I don't want you to delete them. However, I need you to add more slides about tasks like:
    • Create a regular expression in Python
    • The various components of a regular expression in Python
    • How to use regular expressions in Python
    • How to perform pattern matching with a regular expression
    • How to use regular expressions in Python for document engineering
    • How to test a program (maybe use unittest) using a regex
    • Other simple and easy to understand topics about regex
  • You can look at the other slide decks that I have already prepared:
    • slides/weekone/index.qmd
    • slides/weektwo/index.qmd
    • slides/weekthree/index.qmd
    • slides/weekfour/index.qmd
    • slides/weekfive/index.qmd
    • slides/weeksix/index.qmd
    • slides/weekeight/index.qmd
    • slides/weeknine/index.qmd
    • slides/weekten/index.qmd
    • slides/weekeleven/index.qmd ... and see how I am currently using Markdown and Quarto in my slides!
  • Please do not use Markdown or Quarto formats that I am not currently using in my slides to make sure that the content has a consistent format.
  • If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. Please use simple examples like this one to illustrate how to use containers like lists, tuples, and sets in Python.
  • Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
  • Make sure that all the content has concrete examples that make points clear to beginners. Provide simple summaries of the concrete code examples.
  • Always include "signposting" slides that clearly state what you are going to explain in the next block of slides about a specific topic. These signposting slides are at the # level in the Markdown file.
  • Do not create slides that break the formatting or use icons that do not exist. You should use short titles that fit on a single line. You should use bullet points for ideas and make then short and clear.
  • Keep the total number of slides to a count the same as the prior slide decks that I have created.

Support for Week Twelve Slides Content

Regular Expressions for Document Engineering

  • Regular expressions (regex) are patterns used to match character combinations in text, as defined by formal language theory and implemented across virtually all programming languages
  • Python's re module in the standard library provides comprehensive regex functionality following Perl-compatible regular expression (PCRE) syntax
  • Regex patterns enable efficient text parsing, validation, extraction, and transformation operations essential to document engineering workflows
  • Common document engineering applications include email validation, date extraction, markdown parsing, and structured text processing

Regex Components and Syntax

  • Metacharacters (., ^, $, *, +, ?, {n,m}) provide pattern building blocks as standardized in IEEE POSIX regex specifications
  • Character classes ([abc], \d, \w, \s) define sets of matching characters, following standard regex notation used across programming languages
  • Quantifiers control repetition in patterns, enabling flexible matching of variable-length text segments in documents
  • The re.compile() function improves performance by pre-compiling patterns for reuse, as recommended in Python's official documentation

Pattern Matching Methods in Python

  • re.search() finds first match anywhere in string, ideal for locating patterns within large documents
  • re.match() matches only at start of string, useful for validating document format headers and structured text beginnings
  • re.findall() returns all non-overlapping matches, supporting comprehensive pattern extraction from document collections
  • re.sub() performs pattern-based substitution, enabling automated document cleaning and transformation pipelines
  • re.split() divides strings on pattern matches, facilitating document parsing and tokenization

Document Engineering Applications of Regex

  • Date extraction from documents uses patterns like \d{4}-\d{2}-\d{2} to identify ISO 8601 formatted dates common in technical documentation
  • Email validation patterns ensure proper format in contact information and bibliographic metadata
  • Markdown syntax parsing relies on regex to identify headers, links, code blocks, and formatting markers
  • Log file analysis and structured text parsing benefit from regex pattern matching for extracting relevant information

Testing Regular Expressions

  • The unittest framework provides structured testing for regex patterns, ensuring pattern reliability across different text inputs
  • Test cases should cover positive matches (valid patterns), negative matches (invalid patterns), and edge cases (boundary conditions, special characters)
  • Regex testing validates pattern correctness before deployment in production document processing systems
  • Tools like regex101.com provide interactive regex testing and debugging with visual pattern explanation

Benefits and Limitations of Regex

  • Benefits include powerful pattern matching, built-in language support, concise syntax for complex text operations, and widespread adoption across tools and platforms
  • Limitations include readability challenges with complex patterns, performance concerns with catastrophic backtracking, and difficulty debugging intricate regular expressions
  • Alternatives like parser libraries (e.g., pyparsing) offer better solutions for highly structured document formats
  • Best practices recommend using regex for pattern-based tasks while choosing specialized parsers for formal grammars and complex document structures

Create Slides for Week Eleven in slides/weekeleven/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering. All the slides that you create have to specifically connect to the theme of prosegramming and document engineering.
  • Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
  • Please remember that I am currently working on the slides in slides/weekeleven/index.qmd that introduce how to perform searching and sorting. I created these slides for the algorithm analysis class that I also teach. Here is what I need you to add to them:
    • One slide that explains the basics of the big-O notation that I use throughout other parts of this slide deck.
    • Big-picture slides that explain why searching and sorting are important operations in computer science and document engineering.
  • Here is what I need you to revise in the slides:
    • Currently, the slides are too complicated! Please simplify the content so that it is accessible to beginners who do not have extensive experience with programming or algorithm analysis.
    • There are too many slides about the details of recursion, which I have not talked about during the other weeks of the course. Please simplify the content about recursion and focus on the big-picture ideas of searching and sorting.
    • Add some concrete examples that illustrate how searching and sorting are used in document engineering applications. For instance, you can illustrate how searching is used in search engines and how sorting is used in organizing documents.
    • As you revise the slides, please keep about the same number of slides as in the other slide decks that I have created for this course. You can use your tools to track the number of slides that I created in each of the other slide decks.
  • You can look at the other slide decks that I have already prepared:
    • slides/weekone/index.qmd
    • slides/weektwo/index.qmd
    • slides/weekthree/index.qmd
    • slides/weekfour/index.qmd
    • slides/weekfive/index.qmd
    • slides/weeksix/index.qmd
    • slides/weekeight/index.qmd
    • slides/weeknine/index.qmd
    • slides/weekten/index.qmd ... and see how I am currently using Markdown and Quarto in my slides!
  • Please do not use Markdown or Quarto formats that I am not currently using in my slides to make sure that the content has a consistent format.
  • If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. Please use simple examples like this one to illustrate how to use containers like lists, tuples, and sets in Python.
  • Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
  • Make sure that all the content has concrete examples that make points clear to beginners. Provide simple summaries of the concrete code examples.
  • Always include "signposting" slides that clearly state what you are going to explain in the next block of slides about a specific topic. These signposting slides are at the # level in the Markdown file.
  • Keep the total number of slides to a count less than the prior slide decks that I have created. For instance, this topic is less complicated than the material that I produced for weeknine and thus the slide count should be less.

Create Slides for Week Thirteen in slides/weekthirteen/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering. All the slides that you create have to specifically connect to the theme of prosegramming and document engineering.
  • Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
  • Please remember that I am currently working on the slides in slides/weekthirteen/index.qmd that introduce how to implement your own functions for natural language processing. I don't want any of the slides to use functions from packages like nltk or spacy or gensim. I want to only introduce basic concepts of natural language processing by explaining them with bullet points and examples and then adding source code examples that students can try out and run in the slides with {pyodide} blocks.
  • Here is what I want you to add to the slides:
    • Tokenization
    • Segmentation
    • Stemming
    • Lemmatization
    • Stop word removal
    • Part of speech tagging
    • Named entity recognition
    • Key word extraction
    • Word frequency analysis
    • Key-word in context data structure and use
  • Again, none of the slides should use functions from packages like nltk or spacy or gensim. They have to be simple and implemented from scratch!
  • You can look at the other slide decks that I have already prepared:
    • slides/weekone/index.qmd
    • slides/weektwo/index.qmd
    • slides/weekthree/index.qmd
    • slides/weekfour/index.qmd
    • slides/weekfive/index.qmd
    • slides/weeksix/index.qmd
    • slides/weekeight/index.qmd
    • slides/weeknine/index.qmd
    • slides/weekten/index.qmd ... and see how I am currently using Markdown and Quarto in my slides! There are also slides in other directories and you can preview them as well.
  • Please do not use Markdown or Quarto formats that I am not currently using in my slides to make sure that the content has a consistent format.
  • If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. Please use simple examples like this one to illustrate how to use containers like lists, tuples, and sets in Python.
  • Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
  • Make sure that all the content has concrete examples that make points clear to beginners. Provide simple summaries of the concrete code examples.
  • Always include "signposting" slides that clearly state what you are going to explain in the next block of slides about a specific topic. These signposting slides are at the # level in the Markdown file.
  • Always include a conclusion slide that has the title Key takeaways for prosegrammers and then summarizes the key points that students learned.
  • Keep the total number of slides to a count less than the prior slide decks that I have created. For instance, this topic is less complicated than the material that I produced for weeknine and thus the slide count should be less.

Support for Week Thirteen Slides Content

Natural Language Processing for Document Engineering

  • Natural language processing (NLP) is a field of computer science and linguistics concerned with interactions between computers and human language, enabling automated text analysis and understanding
  • Basic NLP techniques like tokenization, stemming, and frequency analysis form the foundation of search engines, content management systems, and document analysis tools
  • Implementing NLP functions from scratch helps learners understand core concepts before using specialized libraries like NLTK, spaCy, or Gensim
  • Document engineering benefits from NLP through automated keyword extraction, content summarization, and document classification

Tokenization and Segmentation

  • Tokenization breaks text into individual units (tokens) such as words, numbers, or punctuation marks, forming the basis for all text analysis operations
  • Python's str.split() method provides basic whitespace tokenization, while re.findall() enables pattern-based tokenization to extract specific token types
  • Segmentation divides documents into larger units like sentences or paragraphs, using punctuation patterns or structural markers like double newlines
  • Regular expressions enable sophisticated tokenization and segmentation patterns for handling complex document structures

Stemming and Lemmatization

  • Stemming reduces words to their root form by removing common suffixes (e.g., "running" becomes "run"), enabling matching of related word forms in search and analysis
  • The Porter Stemmer algorithm is a well-known stemming approach, but simple rule-based suffix removal provides effective results for basic applications
  • Lemmatization maps words to their dictionary base forms (lemmas) using linguistic knowledge, providing more accurate normalization than stemming
  • Dictionary-based lemmatization offers beginner-friendly implementation using Python dictionaries to map word forms to base forms

Stop Word Removal

  • Stop words are high-frequency function words (articles, prepositions, conjunctions) that carry little semantic meaning in text analysis
  • Removing stop words focuses analysis on content-bearing terms, improving keyword extraction and topic identification accuracy
  • Python sets provide efficient stop word filtering using the in operator for fast membership testing
  • Stop word lists vary by application domain, with different sets appropriate for general text versus technical documentation

Word Frequency Analysis

  • Word frequency counting identifies the most common terms in documents using dictionary-based accumulation patterns
  • The dict.get() method with default values provides concise frequency counting without explicitly checking for key existence
  • Frequency distributions reveal document themes, author style, and content patterns useful for classification and summarization
  • Top-N word extraction helps identify key topics without reading entire documents

Keyword Extraction and KWIC

  • Keyword extraction combines frequency analysis with stop word filtering to identify terms that best represent document content
  • Scoring keywords by frequency, length, or TF-IDF metrics enables ranking of term importance for indexing and search applications
  • Keyword in context (KWIC) displays show how terms are used within surrounding text, supporting concordance building and usage analysis
  • KWIC tools align keywords in formatted output for visual scanning and linguistic pattern identification

Educational Slide Design for NLP

  • Progressive complexity introduction starts with basic tokenization and builds to integrated analysis pipelines combining multiple techniques
  • Interactive Pyodide code blocks enable hands-on experimentation with NLP functions, providing immediate feedback for learning
  • Document engineering examples demonstrate practical applications of NLP techniques to real prosegrammer workflows
  • Simplified implementations using only Python standard library features ensure accessibility and portability across environments

Create Slides for Week Fifteen in slides/weekfifteen/index.qmd

  • Review the content in the GEMINI.md file (or the AGENTS.md file) that explains the theme of the course on document engineering.
  • Review the content in the index.qmd file in the root of the repository that explains the idea of a "Prosegrammer" and the concept of document engineering. All the slides that you create have to specifically connect to the theme of prosegramming and document engineering.
  • Review the content in the index.qmd file in the syllabus directory of the repository that explains rules and regulations for this course on document engineering. Note that these are the rules that students follow and not specifically the rules and regulations that you follow as a coding agent.
  • Please remember that I am currently working on the slides in slides/weekfifteen/index.qmd that introduce how to perform retrieval augmented generation (RAG). I do not want any of the slides to use functions from packages like nltk or spacy or gensim or SentenceTransformers. I want to only introduce basic concepts of retrieval augmented generation by explaining them with bullet points and examples and then adding source code examples that students can try out and run in the slides with {pyodide} blocks.
  • If there are specific tools that would normally be used for a specific step in the RAG pipeline, then please reference them in a bulleted list and explain what they do. However, you cannot make the slides import those packages in Python or install a tool, like a vector database, on a computer.
  • The high-level workflow that you should start with is as follows:
    • Document ingestion
    • Data cleaning and preprocessing
    • Converting text to chunks
    • Create and store vector embeddings
    • Retrieve relevant documents based on user queries
    • Make it clear how the documents are relevant with simple examples
    • Combine retrieved documents with user queries
    • Generate responses using a language model
  • Please note that you cannot use any external packages that are not part of the Python standard library to complete any of these steps.
  • Again, none of the slides should use functions that come from any external package that is available on PyPI or through any external site. All the content must be from the standard Python library. It has to be simple and implemented from scratch!
  • You can look at the other slide decks that I have already prepared:
    • slides/weekone/index.qmd
    • slides/weektwo/index.qmd
    • slides/weekthree/index.qmd
    • slides/weekfour/index.qmd
    • slides/weekfive/index.qmd
    • slides/weeksix/index.qmd
    • slides/weekeight/index.qmd
    • slides/weeknine/index.qmd
    • slides/weekten/index.qmd ... and see how I am currently using Markdown and Quarto in my slides! There are also slides in other directories and you can preview them as well.
  • You should definitely review the slides from weekthirteen called "Natural Language Processing for Document Engineering" as this will illustrate how I taught the students about some of the NLP concepts you can build on in this slide deck. Again, notice that none of that content uses an external package, it is all "simple" and also "built from scratch".
  • Please do not use Markdown or Quarto formats that I am not currently using in my slides to make sure that the content has a consistent format.
  • If you check the index.qmd file in this GitHub repository, you can see a simple example of word frequency analysis. Please use simple examples like this one to illustrate how to perform retrieval augmented generation (RAG).
  • Make sure that all the content is accessible to beginners who do not have extensive experience with programming or the documentation of a software tool.
  • Make sure that all the content has concrete examples that make points clear to beginners. Provide simple summaries of the concrete code examples.
  • Always include "signposting" slides that clearly state what you are going to explain in the next block of slides about a specific topic. These signposting slides are at the # level in the Markdown file.
  • Always include a conclusion slide that has the title Key takeaways for prosegrammers and then summarizes the key points that students learned.
  • Keep the total number of slides to a count less than the prior slide decks that I have created. For instance, this topic is less complicated than the material that I produced for weeknine and thus the slide count should be less.

Support for Week Fifteen Slides Content

Retrieval Augmented Generation for Document Engineering

  • Retrieval Augmented Generation (RAG) is a technique that combines information retrieval with natural language generation to produce more accurate, context-grounded responses, as described in the original RAG paper by Lewis et al. (2020) from Facebook AI Research
  • RAG addresses the limitation of language models generating "hallucinated" information by grounding responses in retrieved factual documents from a knowledge base
  • The approach has become fundamental to modern AI applications including ChatGPT plugins, documentation assistants, and question-answering systems
  • Document engineering benefits from RAG through automated technical documentation assistants, code explanation systems, and knowledge base chatbots

RAG Pipeline Components

  • Document ingestion and preprocessing forms the foundation of RAG systems by loading, cleaning, and normalizing text data from various sources
  • Text chunking divides long documents into smaller, semantically coherent segments that fit within context windows and improve retrieval precision
  • Vector embeddings transform text into numerical representations that enable semantic similarity calculations, typically using transformer-based models like BERT or sentence transformers
  • Retrieval mechanisms rank document chunks by relevance to queries using similarity metrics like cosine similarity or dot product
  • Context combination merges retrieved chunks with user queries to create comprehensive prompts for language models
  • Response generation uses language models (GPT, Claude, Llama) to synthesize answers grounded in retrieved information

Document Chunking Strategies

  • Sentence-based chunking splits text at sentence boundaries using punctuation patterns, providing natural semantic units for retrieval
  • Fixed-size word chunking creates uniform segments with consistent lengths, useful for controlling context window sizes in language models
  • Semantic chunking groups related content together, though this approach requires more sophisticated analysis beyond simple rule-based splitting
  • Overlapping chunks include content from adjacent segments to preserve context at boundaries, improving retrieval quality for queries spanning chunk boundaries
  • The choice of chunking strategy impacts retrieval precision, context completeness, and system performance

Simple Vector Representations

  • Word frequency vectors represent text as dictionaries mapping words to occurrence counts, providing a simple baseline for semantic similarity
  • Set-based overlap metrics calculate similarity as the ratio of shared words to total words, demonstrating basic retrieval relevance scoring
  • Real-world systems use dense embeddings from neural models (sentence transformers, OpenAI embeddings) that capture semantic meaning beyond simple word overlap
  • Vector databases (FAISS, ChromaDB, Pinecone) enable efficient similarity search over large document collections using approximate nearest neighbor algorithms
  • The simplified representations in course slides demonstrate core concepts without requiring external dependencies or API access

Retrieval and Ranking Mechanisms

  • Similarity scoring quantifies relevance between queries and document chunks using metrics like Jaccard similarity, cosine similarity, or BM25
  • Top-k selection retrieves the most relevant chunks while balancing context size and precision, with typical values ranging from 2 to 10 chunks
  • Re-ranking strategies apply multiple relevance signals to improve initial retrieval results, such as combining keyword match with semantic similarity
  • Source attribution tracks which chunks contributed to responses, enabling transparency and verification of generated information
  • Retrieval quality directly impacts response accuracy, making it a critical component of RAG systems

Context Construction and Response Generation

  • Context formatting structures retrieved chunks with queries into prompts that language models can effectively process
  • Template-based generation provides deterministic responses from retrieved information, suitable for educational demonstrations without requiring LLM APIs
  • Language model integration (GPT, Claude, Llama) enables flexible, natural-sounding responses that synthesize information from multiple sources
  • Prompt engineering techniques guide language models to ground responses in retrieved context and avoid hallucination
  • Multi-turn conversations maintain context across interactions, enabling follow-up questions and clarifications in RAG applications

Real-World RAG Applications

  • Technical documentation assistants use RAG to answer developer questions by retrieving relevant documentation sections and generating contextual explanations
  • Customer support chatbots combine company knowledge bases with conversational AI to provide accurate, source-grounded responses
  • Research paper assistants help scholars find relevant citations and summarize academic literature through retrieval-augmented summarization
  • Code explanation systems retrieve related code examples and documentation to explain programming concepts and provide usage examples
  • These applications demonstrate RAG's value for document-intensive domains where factual accuracy and source attribution are critical

Educational Slide Design for RAG

  • Progressive pipeline introduction builds from document ingestion through retrieval to generation, showing how components connect
  • Simplified implementations using only Python standard library demonstrate core concepts without external dependencies
  • Interactive Pyodide code blocks enable hands-on experimentation with chunking, retrieval, and generation functions
  • Tool reference slides list professional RAG libraries (LangChain, LlamaIndex) and vector databases while keeping examples dependency-free
  • Document engineering context frames RAG as practical skill for building intelligent documentation systems and knowledge management tools
  • Beginner-accessible explanations focus on conceptual understanding over mathematical complexity or advanced NLP techniques