Skip to content

Commit 00035a9

Browse files
authored
Merge pull request #4 from SamPlaysKeys/feature/multifile-output
Feature/multifile output
2 parents b2de0c5 + 8653378 commit 00035a9

File tree

2 files changed

+158
-53
lines changed

2 files changed

+158
-53
lines changed

README.md

Lines changed: 53 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,11 @@ It’s nothing too fancy, but it helps bridge the gap when you need to pull code
99
## Features
1010

1111
* **Batch Processing:** Convert a whole folder of Markdown files into corresponding YAML files.
12-
* **Single File Mode:** targeted extraction of a single file to a specific output name.
13-
* **Custom Triggers:** You can define the specific text string (e.g., `## Solution`) that tells the script where to start looking for code.
12+
* **Single File Mode:** Targeted extraction of a single file to a specific output name.
13+
* **Custom Triggers:** Define specific text strings or a list of strings (e.g., `## Solution`) to locate code blocks.
14+
* **Multiple Blocks:** Extract every code block after a trigger using `-a` or `--all`.
15+
* **Concatenation:** Combine multiple extracted blocks into one YAML file using `-c` or `--concat`.
16+
* **Output Placeholders:** Use `{filename}` and `{firstword}` in output paths for dynamic file naming.
1417

1518
## Setup
1619

@@ -19,24 +22,26 @@ It’s nothing too fancy, but it helps bridge the gap when you need to pull code
1922

2023
```bash
2124
pip install -r requirements.txt
22-
2325
```
2426

2527
## How to Use
2628

27-
The script `extract_cli.py` is run from the command line. It accepts three arguments:
29+
The script `extract_cli.py` is run from the command line. It accepts several arguments:
2830

2931
* `-i` or `--input`: The path to a file or a folder of files.
30-
* `-o` or `--output`: The destination path.
31-
* `--trigger`: (Optional) The specific string to look for. Defaults to `## Solution`.
32+
* `-o` or `--output`: The destination path (supports `{filename}` and `{firstword}` placeholders).
33+
* `-t` or `--trigger`: (Optional) The string(s) to look for. Defaults to `YAML`.
34+
* Single trigger: `--trigger "## Solution"`
35+
* Multiple triggers: `--trigger "([Trigger 1], [Trigger 2])"`
36+
* `-a` or `--all`: (Optional) Extract all code blocks found after the trigger.
37+
* `-c` or `--concat`: (Optional) When using `-a`, combine all blocks into a single YAML file instead of splitting them.
3238

3339
### 1. Processing a specific file
3440

3541
Use this if you want to convert one `.md` file into one specific `.yaml` file.
3642

3743
```bash
3844
python extract_cli.py -i ./docs/my_notes.md -o ./exports/output_data.yaml
39-
4045
```
4146

4247
### 2. Processing a whole folder (Batch)
@@ -46,36 +51,69 @@ Use this if you have a directory of markdown files. The script will create a YAM
4651
```bash
4752
# Takes all .md files in 'raw_docs' and saves .yaml files to 'processed_data'
4853
python extract_cli.py -i ./raw_docs -o ./processed_data
54+
```
55+
56+
### 3. Extracting Multiple Blocks
57+
58+
If a file has multiple code blocks after a trigger, use `-a` to get them all. By default, they are saved as separate files (`output.yaml`, `output_2.yaml`, etc.).
4959

60+
```bash
61+
python extract_cli.py -a -i ./docs/file.md -o ./output/data.yaml
5062
```
5163

52-
### 3. Changing the search trigger
64+
### 4. Concatenating Blocks
5365

54-
If your markdown files use a different header or string to denote where the code starts (e.g., `### Code Example`), you can override the default:
66+
To combine all blocks from one file into a single multi-document YAML file:
5567

5668
```bash
57-
python extract_cli.py -i ./docs -o ./output --trigger "### Code Example"
69+
python extract_cli.py -a -c -i ./docs/file.md -o ./output/combined.yaml
70+
```
71+
72+
### 5. Using Multiple Triggers
5873

74+
You can search for multiple different markers in the file:
75+
76+
```bash
77+
python extract_cli.py --trigger "([Metadata], [Configuration])" -i ./docs -o ./output
5978
```
6079

61-
## Example
80+
## Examples
6281

82+
### Dynamic Batch Output
6383
```bash
64-
python extract_cli.py -i ./docs -o ./output/{filename}/output.yaml
84+
python extract_cli.py -a -i ./docs -o ./output/{filename}/output.yaml
6585
```
6686

87+
**Structure:**
6788
```
6889
docs/
69-
├── file1.md
70-
└── file2.md
90+
├── file1.md (contains 2 blocks)
91+
└── file2.md (contains 1 block)
7192
7293
output/
7394
├── file1/
74-
│ └── output.yaml
95+
│ ├── output.yaml
96+
│ └── output_2.yaml
7597
└── file2/
7698
└── output.yaml
7799
```
78100

101+
### Concatenated Batch Output
102+
```bash
103+
python extract_cli.py -a -c -i ./docs -o ./output/{filename}_all.yaml
104+
```
105+
106+
**Structure:**
107+
```
108+
docs/
109+
├── file1.md (contains 2 blocks)
110+
└── file2.md (contains 1 block)
111+
112+
output/
113+
├── file1_all.yaml (multi-document YAML)
114+
└── file2_all.yaml
115+
```
116+
79117
## Future Plans
80118

81119
I am planning to do more work on this as I learn more. Some features I'm hoping to add include:

extract_cli.py

Lines changed: 105 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -18,25 +18,44 @@ class IndentedDumper(yaml.Dumper):
1818
def increase_indent(self, flow=False, indentless=False):
1919
return super(IndentedDumper, self).increase_indent(flow, False)
2020

21-
def extract_from_file(file_path, trigger_string):
21+
def extract_from_file(file_path, triggers):
2222
"""
23-
Helper function to read a file and extract the first code block after the trigger.
24-
Returns a list containing the first code block (string) or None if no trigger found.
23+
Helper function to read a file and extract code blocks after the triggers.
24+
Returns a combined list of code blocks (strings) found for all triggers,
25+
deduplicated by their position in the file.
2526
"""
2627
try:
2728
with open(file_path, "r", encoding="utf-8") as f:
2829
content = f.read()
2930

30-
if trigger_string in content:
31-
# Split and take content after the trigger
32-
relevant_content = content.split(trigger_string, 1)[1]
33-
34-
# Regex for code blocks
35-
code_block_pattern = re.compile(r"```(?:\w+)?\n(.*?)```", re.DOTALL)
36-
match = code_block_pattern.search(relevant_content)
37-
38-
return [match.group(1).strip()] if match else []
39-
return None
31+
unique_blocks = {} # Map start_index -> block_content
32+
33+
# Ensure triggers is a list
34+
if isinstance(triggers, str):
35+
triggers = [triggers]
36+
37+
for trigger in triggers:
38+
# Find the first occurrence of the trigger
39+
# We use re.escape to treat the trigger string literally
40+
trigger_match = re.search(re.escape(trigger), content)
41+
if trigger_match:
42+
start_search_pos = trigger_match.end()
43+
relevant_content = content[start_search_pos:]
44+
45+
# Regex for code blocks
46+
code_block_pattern = re.compile(r"```(?:\w+)?\n(.*?)```", re.DOTALL)
47+
48+
for match in code_block_pattern.finditer(relevant_content):
49+
# Calculate absolute position in the file to identify unique blocks
50+
abs_start = start_search_pos + match.start()
51+
52+
if abs_start not in unique_blocks:
53+
unique_blocks[abs_start] = match.group(1).strip()
54+
55+
# Return blocks sorted by their appearance in the file
56+
sorted_blocks = [unique_blocks[k] for k in sorted(unique_blocks.keys())]
57+
return sorted_blocks if sorted_blocks else None
58+
4059
except Exception as e:
4160
print(f"Error reading {file_path}: {e}")
4261
return None
@@ -56,35 +75,75 @@ def main():
5675
help="Path to output file (if input is file) OR output directory (if input is dir)."
5776
)
5877
parser.add_argument(
59-
"--trigger",
78+
"-t", "--trigger",
6079
default="YAML",
6180
help="The string to search for. Defaults to the string 'YAML'."
6281
)
82+
parser.add_argument(
83+
"-a", "--all",
84+
action="store_true",
85+
help="Extract all code blocks into separate files. If not set, only the first block is extracted."
86+
)
87+
parser.add_argument(
88+
"-c", "--concat",
89+
action="store_true",
90+
help="Concatenate multiple code blocks into a single YAML file. Used with -a."
91+
)
6392

6493
args = parser.parse_args()
6594

95+
# --- TRIGGER PARSING ---
96+
triggers = []
97+
if args.trigger.startswith("(") and args.trigger.endswith(")"):
98+
# Format: ([Trigger 1], [Trigger 2])
99+
# Find content inside brackets
100+
raw_triggers = re.findall(r'\[(.*?)\]', args.trigger)
101+
triggers = raw_triggers
102+
else:
103+
triggers = [args.trigger]
104+
66105
# --- LOGIC ---
67106

68-
def save_extracted_blocks(blocks, dest_path):
107+
def save_single_file(data, path):
108+
"""Helper to write a list of data objects to a single YAML file."""
109+
with open(path, "w", encoding="utf-8") as outfile:
110+
if len(data) == 1:
111+
yaml.dump(data[0], outfile, Dumper=IndentedDumper, default_flow_style=False, allow_unicode=True, sort_keys=False)
112+
else:
113+
yaml.dump_all(data, outfile, Dumper=IndentedDumper, default_flow_style=False, allow_unicode=True, sort_keys=False)
114+
115+
def save_extracted_blocks(blocks, dest_path, split_files=False):
69116
"""
70-
Parses extracted blocks as YAML and saves them to dest_path.
71-
Unwraps single blocks to be the root object.
117+
Parses extracted blocks as YAML and saves them.
118+
If split_files is True, saves each block to a separate file (appending _N).
119+
Otherwise, saves all blocks to the single dest_path.
72120
"""
73-
data_to_dump = []
121+
# Parse all blocks first
122+
parsed_blocks = []
74123
for b in blocks:
124+
block_data = []
75125
try:
76-
# Use safe_load_all to handle potential multi-document blocks
77126
for doc in yaml.safe_load_all(b):
78-
data_to_dump.append(doc)
127+
block_data.append(doc)
79128
except Exception as e:
80-
print(f"Warning: Failed to parse block as YAML in {dest_path}. Keeping as string. Error: {e}")
81-
data_to_dump.append(b)
129+
print(f"Warning: Failed to parse block as YAML. Keeping as string. Error: {e}")
130+
block_data.append(b)
131+
parsed_blocks.append(block_data)
82132

83-
with open(dest_path, "w", encoding="utf-8") as outfile:
84-
if len(data_to_dump) == 1:
85-
yaml.dump(data_to_dump[0], outfile, Dumper=IndentedDumper, default_flow_style=False, allow_unicode=True)
86-
else:
87-
yaml.dump_all(data_to_dump, outfile, Dumper=IndentedDumper, default_flow_style=False, allow_unicode=True)
133+
if split_files:
134+
base, ext = os.path.splitext(dest_path)
135+
for i, data in enumerate(parsed_blocks):
136+
# Construct filename: file.yaml, file_2.yaml, file_3.yaml...
137+
if i == 0:
138+
current_path = dest_path
139+
else:
140+
current_path = f"{base}_{i+1}{ext}"
141+
142+
save_single_file(data, current_path)
143+
else:
144+
# Flatten all parsed data into one list for a single file
145+
all_data = [item for sublist in parsed_blocks for item in sublist]
146+
save_single_file(all_data, dest_path)
88147

89148
# CASE 1: Input is a Directory
90149
if os.path.isdir(args.input):
@@ -99,9 +158,7 @@ def save_extracted_blocks(blocks, dest_path):
99158
target_path = args.output
100159
replacements = {
101160
"{filename}": file_stem,
102-
"{firstword}": first_word,
103-
"$filename": file_stem,
104-
"$firstword": first_word
161+
"{firstword}": first_word
105162
}
106163

107164
for placeholder, replacement in replacements.items():
@@ -120,28 +177,38 @@ def save_extracted_blocks(blocks, dest_path):
120177
if dest_dir and not os.path.exists(dest_dir):
121178
os.makedirs(dest_dir, exist_ok=True)
122179

123-
blocks = extract_from_file(source_path, args.trigger)
180+
blocks = extract_from_file(source_path, triggers)
124181

125182
if blocks:
126-
save_extracted_blocks(blocks, dest_path)
127-
print(f"Processed: {filename} -> {dest_path}")
183+
# Filter blocks based on -a flag
184+
if not args.all:
185+
blocks = blocks[:1]
186+
187+
should_split = args.all and not args.concat
188+
save_extracted_blocks(blocks, dest_path, split_files=should_split)
189+
print(f"Processed: {filename} -> {dest_path} (Split: {should_split})")
128190
count += 1
129-
print(f"--- Batch Complete. Created {count} YAML files. ---")
191+
print(f"--- Batch Complete. Processed {count} source files. ---")
130192

131193
# CASE 2: Input is a Single File
132194
elif os.path.isfile(args.input):
133-
blocks = extract_from_file(args.input, args.trigger)
195+
blocks = extract_from_file(args.input, triggers)
134196

135197
if blocks:
136198
# If output path is a directory, verify valid filename provided or derive it
137199
if os.path.isdir(args.output):
138200
print("Error: Input is a file, but output is a directory. Please specify a full output filename.")
139201
sys.exit(1)
140202

141-
save_extracted_blocks(blocks, args.output)
142-
print(f"Success! Extracted to '{args.output}'")
203+
# Filter blocks based on -a flag
204+
if not args.all:
205+
blocks = blocks[:1]
206+
207+
should_split = args.all and not args.concat
208+
save_extracted_blocks(blocks, args.output, split_files=should_split)
209+
print(f"Success! Extracted to '{args.output}' (Split: {should_split})")
143210
else:
144-
print(f"No blocks found in '{args.input}' after trigger '{args.trigger}'")
211+
print(f"No blocks found in '{args.input}' with provided triggers.")
145212

146213
else:
147214
print("Error: Input path does not exist.")

0 commit comments

Comments
 (0)