Preferences and patterns learned from building the PyFlink streaming workshop.
- Introduce services one at a time, not all at once. Start with one container (e.g., Redpanda), explain it, use it. Then add the next (PostgreSQL), etc.
- Start with the simplest version that works (plain Python consumer), then motivate the more complex tool (Flink) by showing what's missing.
- Use
docker compose up <service> -dto start services selectively during the gradual buildup.docker compose up --build -donly when everything is ready.
- Use real datasets, not fake test data. NYC taxi data
(
yellow_tripdata_YYYY-MM.parquet) is a good go-to. - Limit to manageable sizes (e.g., first 1000 rows) for workshop speed.
- Assume starting from scratch:
uv init -p 3.12+uv add <package>. - Add dependencies gradually as they're needed in the narrative
(e.g.,
uv add kafka-python pandas pyarrowfirst,uv add psycopg2-binarylater when PostgreSQL is introduced). - Always note "if you cloned the repo, run
uv syncinstead" as a blockquote.
- Break large code blocks into small, focused blocks. Each block should do one thing. Don't dump a full script in one block.
- Pattern for code blocks: short intro line (what it does), then the code, then the explanation of how it works below. Don't put detailed explanations before the code - let the reader see the code first.
- Keep imports local to each block - don't introduce all imports upfront. Each block should only import what it uses.
- Introduce functions and utilities where they're first used, not earlier.
For example, show
dataclasses.asdict()in the block that calls it, not in the block that defines the dataclass. - When introducing a function, show a test with sample data before using it in the real code. For example, create a test binary string to verify a deserializer, then pass it to the consumer.
- Prefer named functions over inline lambdas. A named function is reusable,
testable, and easier to explain step by step. For example,
value_deserializer=ride_deserializerinstead ofvalue_deserializer=lambda m: json.loads(m.decode('utf-8')). - Extract repetitive logic into named functions. For example, row-to-object
conversion that appears in multiple places should be a function like
ride_from_row(row). - Split one-liner functions into multiple lines. Each step (decode, parse, construct) on its own line is easier to follow and explain.
- Show the simple approach first, then improve it. For example, show a
generic
json_serializerwith manualdataclasses.asdict()calls, then introduce a specializedride_serializerthat handles the conversion internally. Let the student feel the friction before showing the fix. - Extract shared code (dataclasses, serializers, deserializers, converters)
into shared modules (e.g.,
models.py) so multiple scripts can import from one place. - Reference the complete script at the end (e.g., "> The complete script is
in
src/producers/producer.py."). - For infrastructure files that are long or complex (Dockerfile, YAML configs),
link to the file on GitHub and provide a short summary list of what it does.
Use
wgetto download from the GitHub repo instead of asking students to type them. - Mention that students can run Python code in Jupyter notebooks
(
uv add jupyter,uv run jupyter lab) as an alternative to .py scripts. The small-block style maps naturally to notebook cells. - Flink jobs must remain as .py files (they're submitted to the cluster via
docker compose exec). Add a note explaining this distinction.
- No bold formatting (
**text**) in README files. Use plain text. - No em dashes. Use hyphens with spaces (
-) instead. - Use
pythonnotpython3. - Use
docker composenotdocker-compose. - Use
uvx pgclinot justpgcli. - Use
uv run pythonnotpythonfor running scripts.
- Use meaningful names that reflect purpose, not generic placeholders.
For example,
group_id='rides-console'orgroup_id='rides-to-postgres', notgroup_id='test-consumer-group'.
- For complex configurations (like Redpanda's docker-compose command), explain every parameter in a table or list.
- Explain the "why" not just the "what" (why two Kafka addresses? why checkpointing every 10 seconds? why watermarks?).
- Use tables for parameter explanations and comparisons.
- Include sample output for every command students will run.
- Use
>blockquotes for tips, notes about the repo, and common mistakes from original workshops/streams. - For complex concepts (watermarks, task slots, parallelism), pull the explanation out of bullet lists into its own multi-paragraph section. State the value or syntax in the bullet, then explain the concept below in separate paragraphs for easier reading.
- Use lists for multi-point summaries instead of packing everything into one long sentence.
- When showing a development shortcut (like mounting local files into Docker), add a note explaining how it works in production. Students benefit from understanding real-world deployment patterns alongside the workshop setup.
- Define the source (where you read from) before the sink (where you write to) when presenting code blocks. Set up the consumer/reader first, then the database connection or output destination.
- Don't use
container_nameorhostname- Docker Compose handles naming automatically. - Don't use
extra_hostsunless specifically needed. - Service names are automatically resolvable as hostnames within the Docker network.
- Prefer short service names (e.g.,
redpandanotredpanda-1). - Keep
restart: on-failureonly for services that need it (like databases).
- Always use the latest stable versions of images and libraries.
- Pin exact versions for Flink and its connectors (they must match).
- Use
uvfor everything Python-related (package management, running scripts, even installing Python itself inside Docker). - Prefer
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/in Dockerfiles instead ofapt-get install.
- Credit the original stream/video at the top with a link.
- If the new video is not yet available, put "TBA" with a sign-up link (e.g., Luma).
- Brief description of what we'll build and prerequisites.
- Introduce the first component (message broker, database, etc.)
- Set up with docker-compose (explain parameters)
- Create a simple producer/writer
- Create a simple consumer/reader
- Add a database, save data
- Show limitations of the simple approach
- Introduce the framework (Flink, Spark, etc.)
- Reproduce the simple case with the framework
- Do something the simple approach can't (aggregation, windowing)
- Explain advanced concepts (window types, offsets, etc.)
- Cleanup
- Q&A - questions and answers from the original stream. Include production deployment topics here rather than as standalone sections.