Skip to content

Commit 54f1e94

Browse files
committed
KMDS refactoring.
1 parent 69a11f7 commit 54f1e94

File tree

92 files changed

+9350
-8771
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

92 files changed

+9350
-8771
lines changed

.github/workflows/ci.yml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
pull_request:
7+
branches: [ main ]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
matrix:
14+
python-version: ["3.10", "3.11"]
15+
steps:
16+
- uses: actions/checkout@v4
17+
- name: Set up Python ${{ matrix.python-version }}
18+
uses: actions/setup-python@v4
19+
with:
20+
python-version: ${{ matrix.python-version }}
21+
- name: Install uv
22+
uses: astral-sh/setup-uv@v3
23+
with:
24+
version: "latest"
25+
- name: Load cached venv
26+
id: cached-uv-dependencies
27+
uses: actions/cache@v3
28+
with:
29+
path: .venv
30+
key: venv-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('**/uv.lock') }}
31+
- name: Install dependencies
32+
run: uv sync --no-install-project
33+
- name: Run tests
34+
run: uv run pytest
35+
36+
build:
37+
runs-on: ubuntu-latest
38+
steps:
39+
- uses: actions/checkout@v4
40+
- name: Set up Python
41+
uses: actions/setup-python@v4
42+
with:
43+
python-version: "3.10"
44+
- name: Install uv
45+
uses: astral-sh/setup-uv@v3
46+
with:
47+
version: "latest"
48+
- name: Load cached venv
49+
id: cached-uv-dependencies
50+
uses: actions/cache@v3
51+
with:
52+
path: .venv
53+
key: venv-${{ runner.os }}-3.10-${{ hashFiles('**/uv.lock') }}
54+
- name: Install dependencies
55+
run: uv sync
56+
- name: Build package
57+
run: uv build
58+
- name: Upload build artifacts
59+
uses: actions/upload-artifact@v3
60+
with:
61+
name: dist
62+
path: dist/

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,4 +62,6 @@ _solr/schema.xml
6262
_src/*
6363
local/*
6464
.env
65+
.venv/
66+
.pytest_cache/
6567

Changelog.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,8 @@
77
* Wiki page on design perspective added
88
* Examples for Minio based file read and write added. Docker files added.
99
* Examples of generating semantic meta-data for clean data represenations that are used for modelling is illustrated with woodwork.
10+
11+
[0.0.3.0]
12+
* Removed cloud features and packages related to documenting meta-data for datasets, these can be captured by code generation tools now
13+
* Refactored examples to illustrate the use of generative AI in data science and or data analysis projects
14+
* Created better example documentation using generative AI tools.

README.md

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,11 @@
1111

1212
</div>
1313

14-
### What is this tool used for?
14+
### What is the tool and why do you need it?
1515

16-
This tool is used for knowledge management in data science. In software design, conceptual domain models inform lower-level models. In data science work, experiments yield knowledge that informs modeling choices. Data science models are almost always informed by a variety of analysis experiments. Experimentation and organization of knowledge artifacts supporting modeling decisions is a requirement for reproducible data science models and improving model quality and performance. ([video](https://www.youtube.com/watch?v=ckr8YQJxF9I)).
16+
This tool is used for knowledge management in data science. As data scientists, incremental experimentation is a way of life. The problem is we have a lot of them and even small projects accumalate context, decisions and rationale over time. This is not a problem if we have both the need for experimentation (the design question or issue) and the results documented over time, but this tends to be done in an adhoc manner, so when its time to rebuild or revisit a particular question, we can't find the research and the results related to it. This is the need this tool fulfils.
1717

18-
Please see [knowledge application development context](https://github.com/rajivsam/KMDS/blob/main/feature_documentation/knowledge_management_in_DS.md) for a description of a typical knowledge application development setting.
19-
20-
### Why do you need this tool?
21-
22-
The above narrative suggests that ability to retrieve knowledge about experiments and historical models in an adhoc manner is critical in data science. It is. It is also grossly underserved. Knowledge management tools for domain specific models exist, knowledge management tools for dev-ops and ML-Ops exist, but tools for analytics and model development are siloed. Information gets fragmented over time. So analysts and data scientists often have to go to experiment tools, data catalogs or ML-Ops tools to fetch information they need to develop a model. In a subsequent iteration of this model, the contextual information that informed the development of this model is often lost, and the development team, possibly with new team members, have the task of reconstructing this contextual information again. This library is a step in fixing this problem. The central idea is to organize tasks in terms of a sequence of steps that are pretty standard in data analysis work and capture knowledge in context while these tasks are performed.
18+
Please see [knowledge application development context](https://github.com/rajivsam/KMDS/blob/main/feature_documentation/knowledge_management_in_DS.md) for a description of a typical knowledge application development setting. Please see [the video](example_documentation/video/Knowledge_Management_for_Data_Science_comp.mp4) for a quick overview.
2319

2420
### How is it related to process guidelines and vocabularies for machine learning?
2521
Initiatives such as [CRISP DM](https://www.datascience-pm.com/crisp-dm-2/) provide guidelines and processes for developing data science projects. Projects such as [Open ML](https://openml.github.io/openml-python/main/index.html) provide semantic vocabulary standardization for machine learning tasks. These are excellent tools. However, the guidelines they provide are task focussed. The gap between a conceptual idea and the final, or, even candidate data science tasks for a project is filled with many assumptions and experimental evaluations. The information in these assumptions and experimental evaluations is what this tool aims to capture. There is also an ordering to these assumptions and experimental evaluations. This is also what this tool aims to capture.
@@ -29,20 +25,16 @@ Initiatives such as [CRISP DM](https://www.datascience-pm.com/crisp-dm-2/) provi
2925
This tool is for data scientists and data analysts.
3026

3127
### How do you use this tool?
28+
This version of the tool takes all the recent advances (as of early 2026) into consideration in how this tool is used. This is a python package. It is assumed that you have an API key to a provider. The basic usage scenario is as follows:
29+
1. Install the python package in the environment where you intend to experiment and do your data science analysis.
30+
2. Work through your analysis plan for your model development or experiment. The tool will not offer help with how your analysis or experiment will be done. It assumes you are the expert and you know how to do this. Of course, you can use __Jupyternaut__ or a similar generative AI tool to generate your code for you.
31+
3. As you work through your exploratory data analysis, data representation and modeling phases, log your findings to ```kmds```
32+
4. Run a report to fetch the details of your design rationale as needed. To communicate your findings to your team or management, simply export your knowledge base. Point a generative AI tool such as __NotebookLM__ to the export and generate your report, video or other documentation artifact.
3233

33-
1. You install this library along with the other python dependecies you need for your analysis task
34-
2. Review the [basic recipe](https://github.com/rajivsam/KMDS/blob/main/examples_of_use/workflow_recipe.md) for capturing your observations.
35-
3. Review [the templates section](https://github.com/rajivsam/KMDS/blob/main/examples_of_use/workflow_recipe.md) to find the example relevant to you. For analytics projects, review the analytics template. For machine learning projects, review the machine learning template.
36-
4. Start using the tool in your projects using the information from your review of the above two steps.
37-
38-
39-
_Note:_
40-
1. The examples are based on using the files in the package, but it is quite straight forward to connect to S3 storage to get your data files, see [the connection notes document](https://github.com/rajivsam/KMDS/blob/main/examples_of_use/connection_notes.md) for details. Minio provides a sandbox where you can try this
41-
42-
2. _Please the [wiki pages](https://github.com/rajivsam/KMDS/wiki/KMDS-Design-Perspectives) section of the repository for design perspectives and documentation_. This is work in progress.
43-
44-
34+
### Examples of use
35+
The repository contains two examples of use. One example is from analytics, the other is from machine learning. The notebooks for analytics are in [the analytics example](examples_of_use/analytics) and the notebooks for machine learning are in [the machine learning example](examples_of_use/machine_learning). The analytics example evaluates the effectiveness of ticket resolution help desk. Using ticket resolution data for a particular quarter, Q2 2016, the example illustrates how effectiveness of the organization can be evaluated. The reader can explore the notebooks to see the details of the implementation and details of how findings in each phase of the model building cycle are logged. The findings from the resulting knowledge base can be exported to create materials to communicate the details of the project to team members and management, see [this video](examples_of_use/analytics/Help_Desk_Analytics%20_comp.mp4) and this [infographic](examples_of_use/analytics/usecase_overview_mindmap.png)
4536

37+
The machine learning example illustrates how Principal Components Analysis can be used to summarize the sales activity in an online store for a particular quarter. The reader can view the notebooks under [the machine learning example](examples_of_use/machine_learning) for details of the implementation. As with the analytics example, generative AI tools (Notebook LM in this case) can be used to communicate the findings and results from the knowledge base, see [this infographic](examples_of_use/machine_learning/ml_infographic_kmds.png).
4638

4739
### Licensing and Feature Questions
4840

build_instructions/build_instructions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
### Building KMDS
2-
This package is built with poetry. Install poetry, then run ``` poetry build```
2+
This package is built with poetry. Install poetry, then run ``` uv pip build ```
33

44
### Updating Documentation
55
This is the sequence of steps to follow to make changes to the documentation.
11.5 MB
Binary file not shown.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
1. The context in which development is done for most enterprise projects is provided in [this document](../feature_documentation/knowledge_management_in_DS.md
2+
2. A basic recipe for capturing observations is provided in [this document](../feature_documentation/km_app_pipeline.md
3+
3. The need to use an ontology for knowledge capture is provided in [this document](../feature_documentation/ontology_management.md)
4+
4. A glossary of observation types is provided in [this document](../feature_documentation/glossary_observation_types.md)
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)