A robust ETL (Extract, Transform, Load) pipeline built on LibreOffice Basic scripting that transforms unstructured dictionary data output from the 10ten Japanese Reader browser extension, by Birchill (dictionary data from JMdict/EDICT, KANJIDIC and JMnedict/ENAMDICT), into normalized, structured LibreOffice Calc spreadsheet data.
- Complete ETL Workflow: Validates, parses, transforms, loads, and formats data in a single execution
- Data Normalization: Converts semi-structured text into structured LibreOffice Calc columns
- Error Handling: Multi-layer validation to maintain data integrity
- Zero Dependencies: Pure UNO API implementation, no external requirements
- Built-in Safeguards: 50-line limit per entry prevents infinite processing loops
- Dynamic Row Insertion: Automatically creates rows when entry components exceed available space
- Verbose Commentary: Built for LibreOffice Basic learning and ease of maintenance
(Tested on 10ten Japanese Reader 1.26.1, Mozilla Firefox 147.0.3, and LibreOffice 25.8.4.2)
git clone https://github.com/sanchesfilho/10tenkaizen
- Open LibreOffice Calc
- At the program's main window, navigate: Tools → Macros → Organize Macros → Basic... (keyboard shortcut: Alt + F11)
- On the BASIC Macros dialog box, navigate to My Macros → Standard → Module 1, and click "Edit". (Alternatively, you can create new Modules by clicking "Organizer...", navigating to My Macros → Standard, clicking "New", and clicking "Ok" on the "New Module" dialog box.)
- The "My Macros & Dialogs" window will appear. There will be a template of code, which may look something like this:
1 REM ***** BASIC *****
2
3 Sub Main
4
5 End Sub
6
- Select and delete all that pre-existing code in the module (e.g. by clicking on the code editor window and pressing Ctrl + A → Delete). Make sure that the cursor is positioned at line 1.
- Navigate to File → Import BASIC...
- On the Open File dialog, navigate to /10tenkaizen/, select "10tenKaizen_v1.0.0.bas" and click "Open"
- Click "Save" (keyboard shortcut: Ctrl + S). You can now run the macro by clicking the "Run" button (keyboard shortcut: F5) inside the "My Macros & Dialogs" window
- At the program's main window, navigate to Tools → Customize...
- Under the "Keyboard" tab, go into the "Category" panel, and navigate to Application Macros → My Macros → Standard → Module 1 (or other Module you have created).
- Select "Macro10tenKaizen" on the "Function" panel.
- Select your desired shortcut key or key combination inside the menu "Shortcut Keys", click "Assign", and click "OK".
- Copy Data: There are two main ways to copy an entry with the 10ten Japanese Reader browser extension:
- Method 1 (Menu navigation): Hover over a Japanese word → wait for the popup → click the word inside the popup → click "Entry" under the "Copy" tab.
- Method 2 (Keyboard shortcut): Hover over a word → wait for the popup → press the sequence C → E on your keyboard.
IMPORTANT: Other copying methods are NOT supported by this code. Only the "Copy entry" format produces bracketed readings (e.g. [かな]) which are required by this macro.
- Paste: Paste the copied 10ten Japanese Reader output (keyboard shortcut: Ctrl + V) into any LibreOffice Calc cell. The Text Import dialog is expected to appear automatically; then, click "OK". This spreads the output vertically across rows in the same column.
(If the Text Import dialog does not appear, use Paste Special (keyboard shortcut: Ctrl + Shift + V) and select "Use text import dialog").
You can paste either one or multiple 10ten Japanese Reader outputs, as long as they are all in the same column and there are no empty cells between them—the macro will automatically detect and process all consecutive entries until an empty cell is encountered. - Execute: Select the first cell of the first entry (i.e., the headword) and run
Macro10tenKaizenas explained above, either via the "My Macros & Dialogs" window or the keyboard shortcut you have assigned. - Result: The macro automatically parses, normalizes, structures, merges, numbers, and formats the data across columns A–E.
Raw Data Output From 10ten Japanese Reader:
仮名, 仮字, 假名 [かな]
(n) kana; Japanese syllabaries (i.e. hiragana and katakana)
Normalized Data Output Schema via 10tenKaizen:
| A | B | C | D | E |
|---|---|---|---|---|
1 |
仮名 |
仮字假名 |
かな |
(n) kana; Japanese syllabaries (i.e. hiragana and katakana) |
Auto-generatedgroup identifier |
Headword |
Alternative forms |
Reading |
Definitions |
- Parse: Tokenize the first line from the output ("仮名, 仮字, 假名 [かな]") into semantic components
- Extract: Isolate headword, alternative writings, and enclosed reading ("かな") from parsed tokens
- Collect: Aggregate definition lines vertically below the main entry (row 2 onwards) in the source column
- Calculate: Determine required block height based on whichever is largest: word count, reading count, or definition count
- Expand: Dynamically insert rows if calculated height exceeds available space between entries
- Distribute: Map parsed elements to normalized columns B–E
- Number: Assign sequential Group ID numbers to column A (auto-incremented based on previous blocks)
- Structure: Apply cell merging (columns A–D) and border formatting for visual grouping
- Cleanup: Erase processed source data to maintain sheet cleanliness
- Iterate: Continue processing subsequent entries in the same manner until empty cell is encountered
Jayme Sanches Filho
- GitHub: @sanchesfilho
- Email: jssfo@protonmail.com
Distributed under the MIT License. See LICENSE for more information.