Skip to content

SDG fails on markdown table with preceding text #548

@cfchase

Description

@cfchase

Describe the bug
When a source document has a table with some preceding text, SDG fails with failed to generate data with exception: list index out of range

To Reproduce
Steps to reproduce the behavior:

  1. Create a Markdown in a git repo such as
    https://github.com/cfchase/sample-md/blob/main/README.md
Hello World

| Hello | Hello |
|-------|-------|
| World | World |
  1. Create a qna.yaml in your taxonomy referring to the markdown file such as
    https://github.com/cfchase/sample-md/blob/main/qna.yaml
#~/.local/share/instructlab/taxonomy/knowledge/qna.yaml
...snip...
document:
  repo: 'https://github.com/cfchase/sample-md.git'
  commit: b5bbdd7516fd5f06956f2a1e3f207790a750c00e
  patterns:
    - 'README.md'
  1. Run ilab data generate
  2. See error failed to generate data with exception: list index out of range

Expected behavior
SDG continues past the document ingestion

Command Used
ilab data generate --pipeline=simple

Screenshots

Device Info (please complete the following information):

  • Hardware Specs: Apple M3 Pro Chip, 36 GB Memory
  • OS Version: [e.g. Mac OS 15.3
  • Python Version: Python 3.11.9
  • InstructLab Version:
  sys.version: 3.11.9 (main, Aug 26 2024, 10:26:18) [Clang 15.0.0 (clang-1500.3.9.4)]
  sys.platform: darwin
  os.name: posix
  platform.release: 24.3.0
  platform.machine: arm64
  platform.node: cchase-mac
  platform.python_version: 3.11.9
  platform.cpu_brand: Apple M3 Pro
  memory.total: 36.00 GB
  memory.available: 12.11 GB
  memory.used: 18.85 GB

InstructLab:
  instructlab.version: 0.23.0rc1.dev124
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.7.1.dev46
  instructlab-training.version: 0.7.0

Torch:
  torch.version: 2.4.1
  torch.backends.cpu.capability: NO AVX
  torch.version.cuda: None
  torch.version.hip: None
  torch.cuda.available: False
  torch.backends.cuda.is_built: False
  torch.backends.mps.is_built: True
  torch.backends.mps.is_available: True

llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: True

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions