SDG fails on markdown table with preceding text

**Describe the bug**
When a source document has a table with some preceding text, SDG fails with `failed to generate data with exception: list index out of range`


**To Reproduce**
Steps to reproduce the behavior:
1. Create a Markdown in a git repo such as 
https://github.com/cfchase/sample-md/blob/main/README.md
```markdown
Hello World

| Hello | Hello |
|-------|-------|
| World | World |
```

2. Create a qna.yaml in your taxonomy referring to the markdown file such as
https://github.com/cfchase/sample-md/blob/main/qna.yaml
```yaml
#~/.local/share/instructlab/taxonomy/knowledge/qna.yaml
...snip...
document:
  repo: 'https://github.com/cfchase/sample-md.git'
  commit: b5bbdd7516fd5f06956f2a1e3f207790a750c00e
  patterns:
    - 'README.md'
```
3. Run `ilab data generate`
4. See error `failed to generate data with exception: list index out of range`

**Expected behavior**
SDG continues past the document ingestion

**Command Used**
` ilab data generate --pipeline=simple`

**Screenshots**


**Device Info (please complete the following information):**
 - Hardware Specs: Apple M3 Pro Chip, 36 GB Memory
 - OS Version: [e.g. Mac OS 15.3
 - Python Version: Python 3.11.9
 - InstructLab Version: 
```Platform:
  sys.version: 3.11.9 (main, Aug 26 2024, 10:26:18) [Clang 15.0.0 (clang-1500.3.9.4)]
  sys.platform: darwin
  os.name: posix
  platform.release: 24.3.0
  platform.machine: arm64
  platform.node: cchase-mac
  platform.python_version: 3.11.9
  platform.cpu_brand: Apple M3 Pro
  memory.total: 36.00 GB
  memory.available: 12.11 GB
  memory.used: 18.85 GB

InstructLab:
  instructlab.version: 0.23.0rc1.dev124
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.7.1.dev46
  instructlab-training.version: 0.7.0

Torch:
  torch.version: 2.4.1
  torch.backends.cpu.capability: NO AVX
  torch.version.cuda: None
  torch.version.hip: None
  torch.cuda.available: False
  torch.backends.cuda.is_built: False
  torch.backends.mps.is_built: True
  torch.backends.mps.is_available: True

llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: True
```

**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDG fails on markdown table with preceding text #548

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SDG fails on markdown table with preceding text #548

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions