|
| 1 | +# Create Separate Repo for User Utilities |
| 2 | + |
| 3 | +## Idea Overview |
| 4 | + |
| 5 | +Create a separate repository within the `instructlab` GitHub org to house scripts that aid in the data preparation necessary on documents before they are used in the Instructlab Synthetic Data Generation (SDG) process. |
| 6 | + |
| 7 | +Scripts in this repository may become features or incorporated in the Instructlab core repository after use and review by users and developers. |
| 8 | + |
| 9 | +## Motivation for this Proposal |
| 10 | + |
| 11 | +Instructlab does an excellent job creating data samples based on a document in the SDG process. |
| 12 | +The SDG process ingests qna.yaml files, and documents. After ingestion, SDG chunks those documents up, and creates samples based on those chunks. |
| 13 | + |
| 14 | +Qna.yaml file generation is currently done by hand and is a major pain point from users. |
| 15 | +Additionally there is very high variability in the content inside of source documents. Documents can have graphs, mathematical notation, pictures, tables, and other symbols that provide information users want to teach models. |
| 16 | +Even after documents are converted by a document conversion tool like [docling](https://github.com/DS4SD/docling), some information is not rendered correctly and need to be fixed manually. |
| 17 | +In large documents, manual fixes are very labor intensive. |
| 18 | +Creating great samples hinges on the documents being converted properly, thus to push the boundaries of the utility of Instructlab, ingesting large complex documents is a must. |
| 19 | + |
| 20 | +This repository is a place where the community can submit personal scripts that aid the data preparation before the SDG step and other parts of the Instructlab processes. |
| 21 | +These scripts maybe be related to accessing document readiness, document conversation for certain types of documents, automated document review and fixing, inspecting generated data, and automated qna.yaml creation. |
| 22 | +Many users and community members already have scripts they use day to day when using Instructlab. The `utils` repo would be a place where the maintainers of the project can collect and curate them for the benefit of the community. |
| 23 | + |
| 24 | +## Additional Info |
| 25 | + |
| 26 | +A few areas of focus for scripts that will be added to the repository are: |
| 27 | + |
| 28 | +- Assessing document readiness knowing the limitations of docling |
| 29 | +- Automating qna.yaml creation |
| 30 | +- Automating document review and fixing |
| 31 | + |
| 32 | +The docs, testing, CI, and release cadence, for this repository will be determined in an iterative manner. |
0 commit comments