Skip to content

MdEhsanAhsan/CustomTextParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“‚ CustomTextParser

πŸ”„ Concordance .DAT File Toolkit

Easily convert and manipulate Concordance .DAT load files β€” perfect for legal e-discovery, metadata extraction, and bulk processing.


πŸ›  What It Does

A powerful Python CLI tool designed to handle complex .DAT files with custom delimiters (ΓΎ, control characters), broken encodings, and Excel-incompatible data.

This tool can:

  • βœ… Convert .DAT to .CSV
  • πŸ” Compare two .DAT files (with optional field mapping)
  • 🧠 Replace or remap headers
  • πŸ”— Merge multiple .DAT files intelligently
  • 🧹 Delete rows based on field values
  • 🎯 Extract and export selected fields

βš™οΈ Key Features

  • Handles Concordance .DAT files with embedded line breaks
  • Supports various encodings: UTF-8, UTF-16, Windows-1252, and more
  • Robust parsing even with Excel’s 32,767 character cell limit
  • CLI-first design β€” ideal for automation and scripting

πŸš€ Use Cases

  • Legal eDiscovery processing
  • Metadata cleanup and normalization
  • Custom conversions and field extraction
  • Comparing vendor-delivered load files

πŸ“¦ Installation

πŸ“₯ Download EXE

Clone the repo

git clone https://github.com/yourusername/dat-file-tool.git
cd dat-file-tool

Install dependencies (optional)

This tool uses only built-in libraries β€” no external packages required!


✨ Features

  • βœ… Convert .dat to .csv or keep as .dat
  • πŸ”€ Compare two .dat files (with optional header mapping)
  • 🧹 Delete specific rows from .dat using a value list
  • πŸ” Merge .dat files by common headers
  • πŸ”€ Auto-detect encoding (UTF-8, UTF-16, Windows-1252, Latin-1)
  • πŸ’¬ Smart line reader handles embedded newlines and quoted fields
  • πŸ“ Output directory support via -o DIR
  • ⚠️ Excel field-length warning for long text fields (>32,767 chars)
  • 🎯 Select only specific fields from a DAT file using --select
Feature Description
--csv Export DAT file to CSV format (Comma Separated Value)
--tsv Export DAT file to TSV format (Tab Separated Value)
--dat Export to DAT format (default if none specified)
-c, --compare Compare two DAT files line-by-line
-r, --replace-header Replace headers using a mapping file (old_header,new_header)
--merge Merge multiple DAT files grouped by matching headers
--delete Delete rows based on field values listed in a file
--select Export only selected fields from the DAT file
-o DIR Specify output directory for generated files

πŸ§ͺ Usage Examples

πŸ” Convert DAT to CSV / TSV

python Main.py input.dat --csv
# Output: input_converted.csv

python Main.py input.dat --tsv
# Output: input_converted.tsv

You can also specify custom output paths:

python Main.py input.dat --csv output.csv

πŸ†š Compare Two DAT Files

python Main.py file1.dat file2.dat -c

With optional header mapping:

python Main.py file1.dat file2.dat -c -m mapping.csv

Outputs differences to value_diff.csv (or .dat).


πŸ”„ Replace Headers Using Mapping File

Create a mapping file (mapping.csv) like:

OldHeader1,NewHeader1
OldHeader2,NewHeader2

Run:

python Main.py input.dat -r mapping.csv --csv
# Output: input_Replaced.csv

πŸ“¦ Merge Multiple DAT Files

Create a merge list file (merge_list.csv) containing one file path per line:

file1.dat
file2.dat
file3.dat

Then run:

python Main.py --merge merge_list.csv
# Outputs: merged_group_1.dat, merged_group_2.dat, etc.

Also creates a log file: merged_group_log.csv.


πŸ—‘οΈ Delete Rows Based on Field Value

Create a delete file (delete.csv) with the field name on the first line and values to delete below:

ID
1001
1003
1007

Run:

python Main.py input.dat --delete delete.csv --csv
# Outputs: input{kept}.csv and input{removed}.csv

πŸ” Select Only Specific Fields

Create a selection file (select.txt) with one header per line:

Name
Age
City

Run:

python Main.py input.dat --select select.txt --csv
# Output: input_selected.csv

βš™οΈ Optional Arguments

Flag Description
-o DIR Set output directory
--help Show help message

πŸ“¦ Output Files

  • All exports go to the directory specified by -o, or default to the input file's folder.
  • Output filenames include tags like {kept}, {removed}, or _Replaced.

πŸ’‘ Encoding Detection Logic

Handles:

  • UTF-8 BOM
  • UTF-16 LE/BE
  • Windows-1252 (via printable category)
  • Latin-1 fallback

πŸ§ͺ Excel Limit Check

Warns if any field exceeds Excel's max cell limit (32,767 chars).


πŸ“ Requirements

  • Python 3.7+
  • No external libraries (uses standard library only)

🧰 Development Tips

VS Code Debug Setup (optional)

Add .vscode/launch.json:

{
  "name": "Debug Merge Example",
  "type": "python",
  "request": "launch",
  "program": "${workspaceFolder}/Main.py",
  "console": "integratedTerminal",
  "args": [
    "--merge", "File_list.csv", "--csv", "-o", "merged/"
  ]
}

🀝 Contributing

Feel free to fork, enhance, or report issues! Contributions are welcome πŸ’¬


πŸ‘€ Author

Md Ehsan Ahsan πŸ“§ MyGitHub πŸ› οΈ Built with love using Python 🐍


⚠️ Disclaimer

This tool is provided as-is without any warranties.
Use it at your own risk.
I am not responsible if it eats your files, breaks your computer, or ruins your spreadsheet.

πŸš€ But Hey, if it helps you automate the boring stuff β€” you're welcome! πŸ˜„


πŸ“ License

This project is free to use under the MIT License.