-
Notifications
You must be signed in to change notification settings - Fork 7
add prefect to organize into flows and tasks #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1085bc0
e9f81a7
24c1569
893c755
2355e6a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,17 +14,35 @@ poetry install --no-root | |
| poetry self add poetry-plugin-shell | ||
| poetry shell | ||
|
|
||
| # Install Jupyter kernel for this environment (needed for Jupyter notebooks) | ||
| python -m ipykernel install --user --name=tgov-scraper --display-name="TGOV Scraper" | ||
| # Set up pre-commit hooks | ||
| poetry run pre-commit install | ||
|
|
||
| # Verify pre-commit hooks are working | ||
| poetry run pre-commit run --all-files | ||
|
|
||
| # See notebook_precommit.md for more details on how notebook outputs are automatically stripped | ||
| ``` | ||
|
|
||
| ## Running | ||
| ### Jupyter notebooks | ||
|
|
||
| ```bash | ||
| # Install Jupyter kernel for this environment (needed for Jupyter notebooks) | ||
| python -m ipykernel install --user --name=tgov-scraper --display-name="TGOV Scraper" | ||
|
|
||
| jupyter notebook | ||
| ``` | ||
|
|
||
| ## Running Tests | ||
| ### Prefect flows | ||
| See https://docs.prefect.io/get-started | ||
|
|
||
| ```bash | ||
| prefect server start # to start the persistent server | ||
|
|
||
| python -m flows.translate_meetings # to run a specific flow | ||
| ``` | ||
|
|
||
| ### Tests | ||
|
|
||
| ```bash | ||
| # Run all tests | ||
|
|
@@ -39,12 +57,16 @@ pytest -v | |
|
|
||
| ## Project Structure | ||
|
|
||
| - `data/`: local data artifacts | ||
| - `flows/`: prefect flows | ||
| - `notebooks/`: Jupyter notebooks for analysis and exploration | ||
| - `scripts/`: one off scripts for downloading, conversions, etc | ||
| - `src/`: Source code for the scraper | ||
| - `models/`: Pydantic models for data representation | ||
| - 'scripts`: one off scripts for downloading, conversions, etc | ||
| - `tasks/`: prefect tasks | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the next task to convert all scripts to prefect tasks?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not necessarily. The only thing I'm sure is that we should move as much of the core logic (fetching, parsing, transforming, invoking models, writing outputs, etc.) as possible into either "src" or "functions" modules that do not import or depend on any orchestration library. That way, we can seep our core logic as de-coupled as possible from prefect, or airflow, or langchain, or whatever other orchestration tool we want to try. So, I think it would be better to convert or refactor code from scripts into code in "src" or "functions" modules. |
||
| - `tests/`: Test files | ||
| - `notebooks/`: Jupyter notebooks for analysis and exploration | ||
| - `data/`: output from notebooks | ||
| - `data/`: output from notebooks | ||
|
|
||
|
|
||
| ## Running the transcription scripts | ||
|
|
||
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| from prefect import flow | ||
|
|
||
| from tasks.meetings import create_meetings_csv | ||
|
|
||
|
|
||
| @flow(log_prints=True) | ||
| async def translate_meetings(): | ||
| await create_meetings_csv() | ||
| # TODO: await download_videos() | ||
| # TODO: await transcribe_videos() | ||
| # TODO: await diarize_transcriptions() | ||
| # TODO: await translate_transcriptions() | ||
| # TODO: await create_subtitled_video_pages() | ||
|
|
||
| if __name__ == "__main__": | ||
| import asyncio | ||
| asyncio.run(translate_meetings()) |
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,11 +2,11 @@ | |
| name = "tgov scraper" | ||
| version = "0.1.0" | ||
| description = "A set of scripts and notebooks for exploring Tulsa Government Access Television" | ||
| authors = ["jdungan <[email protected]>"] | ||
| authors = ["jdungan <[email protected]>", "groovecoder <[email protected]>"] | ||
| readme = "README.md" | ||
|
|
||
| [tool.poetry.dependencies] | ||
| python = "3.11.*" | ||
| python = ">=3.11,<3.13" | ||
| selectolax = "^0.3.28" | ||
| aiohttp = "^3.11.13" | ||
| pytest-asyncio = "^0.25.3" | ||
|
|
@@ -25,14 +25,17 @@ jupyter-nbextensions-configurator = "^0.6.4" | |
| python-dotenv = "^1.0.1" | ||
| aiofiles = "^24.1.0" | ||
| faster-whisper = "^1.1.1" | ||
| prefect = "^3.3.0" | ||
| boto3 = "^1.37.24" | ||
|
|
||
|
|
||
| [tool.poetry.group.dev.dependencies] | ||
| jupyter = "^1.1.1" | ||
| ipdb = "^0.13.13" | ||
| ipykernel = "^6.29.5" | ||
| pytest = "^8.0.0" | ||
| pre-commit = "^4.2.0" | ||
| jupyter = "^1.1.1" | ||
| nbstripout = "^0.8.1" | ||
| pre-commit = "^4.2.0" | ||
| pytest = "^8.0.0" | ||
|
|
||
| [build-system] | ||
| requires = ["poetry-core"] | ||
|
|
||
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdungan : be sure to remove this line from the
README.mdtoo.