How to Safely Rescue Mixed-Charset Files in Bulk: Why I Built This Tool

Introduction

I built this tool for one simple reason. Handling this manually every single time was exhausting.

CSV files from internal systems, CSV converted from Excel, and configuration files exported from Linux, Windows, or databases all arrive with different origins. Whenever the origin changes, the character encoding, line endings, and BOM usage drift too.

I expect UTF-8, but it is Shift_JIS
I expect LF, but it is CRLF
Some systems require BOM, while others break when BOM exists

Doing detection and conversion by human eyes and intuition does not scale. So I built a tool that accepts files from any origin and normalizes output to the exact import requirements.

Use this:

Character Encoding Converter

The real pain is not one broken file

If only one file is corrupted, we can usually recover it. The hard case is when 10 or 100 files arrive together from mixed origins.

Even worse, each destination system has a different definition of “correct.”

System A accepts only UTF-8 without BOM
System B is stable with UTF-8 with BOM
System C assumes Shift_JIS and fails import even if UTF-8 looks readable

In that situation, the generic advice “just use UTF-8” is useless. What we need is a safe, repeatable operation tuned to the destination.

What I wanted this tool to do

The design has only three principles.

Accept files from any origin
Convert based on destination import requirements
Avoid increasing incidents even in batch processing

This is not a showcase of conversion tricks. It is an operational tool that reduces friction.

The procedure I actually use

1. Decide the output specification first

Do not start from input assumptions; lock output requirements first. Fix these three items:

Encoding (UTF-8 / Shift_JIS, etc.)
Line endings (LF / CRLF)
BOM (with / without)

If this stays ambiguous, results will change whenever the operator changes.

2. Split conversion targets into small groups

Do not process everything at once. Split by system, time period, or file type, and start with small batches.

Why: rollback is possible when something fails. A full-volume one-shot run is fast only when it succeeds.

3. Freeze conditions and run bulk conversion

Use the Character Encoding Converter with fixed conditions for each destination. Do not tweak settings mid-run. The key is repeatability.

4. Never rely on visual checks only

“File opens fine” is the entrance to incidents. At minimum, verify:

Row count matches before and after
CSV/TSV column counts are preserved
� (replacement character) does not increase
Key columns (ID/code) keep their length and character type

If any check fails, do not proceed to import.

BOM is not ideology; decide by counterpart requirements

Arguments about BOM vs no BOM are mostly pointless in operations. Only one thing matters: how the counterpart system consumes files.

If the counterpart is stable with BOM, output with BOM
If the counterpart breaks with BOM, output without BOM

Prioritize “a format that does not break the other side” over “the theoretically correct format.” That is practical operations.

The value is not conversion itself

The true value is twofold:

Decisions do not vary by person
Repeated verification cost goes down

Encoding incidents cannot be reduced to absolute zero. But we can stop repeating the same incident patterns. That is why I prepared a tool that accepts anything and returns files normalized for the target use.

Conclusion

This tool was not built from ideal theory. It was built to eliminate recurring, tedious encoding adjustments in real operations.

Accept files from different origins, align encoding/line endings/BOM to destination requirements, and return normalized outputs. Automating this flow alone makes operations much easier.

Mojibake handling is not a job to survive with willpower. Turn it into a reproducible procedure so anyone gets the same result. Only then can we say we can safely rescue mixed files in bulk.