DiffMate

Back to Blog

How to Fix Garbled Text When Comparing Files (Encoding Guide)

April 1, 2025

Have you ever opened a file for comparison only to see garbled characters, or found that a perfectly normal file shows everything as "changed" in comparison results? In most cases, encoding issues are the cause.

This article provides an in-depth look at common encoding problems during file comparison, covering the history of character encoding, practical debugging techniques, and real-world solutions.

A Brief History of Character Encoding: From ASCII to UTF-8

Understanding the history of character encoding helps explain why these problems exist in the first place.

**ASCII (1963)**: Created in the early days of computing, ASCII used 7 bits to represent 128 characters. It covered English letters (upper and lowercase), digits, and basic punctuation, but had zero support for non-English scripts like Chinese, Korean, Japanese, Arabic, or Cyrillic.

**ISO-8859 Series (1987)**: Extended ASCII to 8 bits, allowing 256 characters. ISO-8859-1 (Latin-1) covered Western European languages, ISO-8859-2 handled Central European, and so on. The problem was that each region had its own standard, making it impossible to create documents mixing multiple scripts in a single file.

**Regional Encodings (1990s)**: Countries developed their own encoding systems. Korea created EUC-KR (based on KS X 1001, supporting 2,350 precomposed Hangul characters), Japan developed Shift_JIS and EUC-JP, and China adopted GB2312 and later GBK. These encodings were mutually incompatible. A file encoded in EUC-KR would display as garbage if opened with Shift_JIS. Legacy systems from this era continue to produce encoding headaches to this day.

**Unicode (1991)**: The ambitious project to unify all writing systems under a single standard. Unicode assigns each character a unique code point (e.g., U+AC00 = '가', U+4E2D = '中'). As of Unicode 15.0, it includes 149,186 characters spanning 161 scripts, covering everything from ancient Egyptian hieroglyphs to modern emoji.

**UTF-8 (1993)**: The most widely adopted encoding for Unicode. As of 2024, over 98% of all web pages use UTF-8. Its variable-length design maintains full backward compatibility with ASCII while supporting every Unicode character. This is the encoding you should use for all new files and projects.

How UTF-8 Actually Works: Variable-Length Encoding

Understanding UTF-8's byte-level mechanics is invaluable for debugging encoding issues.

UTF-8 uses 1 to 4 bytes per character, with the leading bits indicating how many bytes are in the sequence:

  • **1 byte (0xxxxxxx)**: ASCII range (U+0000 to U+007F). English letters, digits, basic punctuation. Example: 'A' = 0x41
  • **2 bytes (110xxxxx 10xxxxxx)**: U+0080 to U+07FF. Latin extensions, Greek, Cyrillic, Arabic, Hebrew. Example: 'ü' = 0xC3 0xBC
  • **3 bytes (1110xxxx 10xxxxxx 10xxxxxx)**: U+0800 to U+FFFF. CJK characters (Chinese, Japanese, Korean), most Asian scripts. Example: '가' = 0xEA 0xB0 0x80, '中' = 0xE4 0xB8 0xAD
  • **4 bytes (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx)**: U+10000 to U+10FFFF. Emoji, musical notation, historic scripts, mathematical symbols. Example: '😀' = 0xF0 0x9F 0x98 0x80

The key insight is that any byte starting with 0 is a single-byte ASCII character, any byte starting with 110, 1110, or 11110 begins a multi-byte sequence, and any byte starting with 10 is a continuation byte. This self-synchronizing property means that if you drop into the middle of a UTF-8 stream, you can always find the next character boundary by scanning forward for a byte that does not start with 10.

This design means English text in UTF-8 is byte-identical to ASCII (a massive compatibility win), while CJK characters take 3 bytes each. Compared to EUC-KR where Korean characters use 2 bytes, UTF-8 files with Korean text are about 50% larger, but the universality trade-off is overwhelmingly worthwhile.

Encoding Comparison Table

Here is a practical comparison of encodings you are most likely to encounter:

**UTF-8**: Variable 1-4 bytes. The global standard. Supports all Unicode characters. ASCII-compatible. Default encoding for the web and modern systems.

**UTF-16**: Variable 2-4 bytes. Used internally by Windows, Java, and .NET. Requires BOM (Byte Order Mark) to specify byte order. Comes in Little Endian (LE) and Big Endian (BE) variants. Not ASCII-compatible.

**EUC-KR**: Variable 1-2 bytes. Korean-specific. Supports 2,350 precomposed Hangul characters. Still found in legacy Korean government systems and older websites.

**CP949 (MS949)**: Microsoft's extension of EUC-KR. Supports all 11,172 modern Hangul syllables. Default encoding on Korean Windows systems.

**GB2312 / GBK / GB18030**: Chinese encodings. GB2312 covers 6,763 characters, GBK extends to 21,886, and GB18030 is a full Unicode-compatible encoding mandated by the Chinese government.

**Shift_JIS**: Japanese encoding. Variable 1-2 bytes. Famous for the backslash/yen sign problem where 0x5C represents both characters, causing path-related bugs on Japanese Windows.

**ISO-8859-1 (Latin-1)**: Fixed 1-byte encoding. Covers Western European languages. Can only represent 256 characters. Historically significant as the default HTTP encoding.

Identifying Causes by Symptoms

"Characters appear garbled" — This occurs when the file's actual encoding differs from what the program uses to interpret it. Opening an EUC-KR file as UTF-8 turns Korean characters into sequences of accented Latin characters. For example, the Korean word "한글" stored as EUC-KR bytes (\xc7\xd1\xb1\xdb) would display as "ÇѱÛ" when misinterpreted as UTF-8.

"File content is the same but comparison shows differences" — This may be due to BOM (Byte Order Mark) presence differences. UTF-8 with BOM and UTF-8 without BOM look identical to human eyes but differ at the byte level (the first 3 bytes). Line ending differences (Windows CRLF vs Unix LF) can also cause every single line to show as changed.

"Only certain characters are broken" — Some special characters or emoji are not supported in the encoding. EUC-KR only supports a subset of Korean syllables; less common combinations or emoji will break. A file that was transcoded through an intermediate encoding may have had unsupported characters replaced with question marks or the Unicode replacement character (U+FFFD).

"Strange characters at the beginning of the file" — This is the BOM being displayed as text. The UTF-8 BOM bytes (EF BB BF) render as '' in many editors. This is particularly problematic in CSV files where the invisible BOM prepends to the first column header, causing field-matching failures in code.

Real-World Encoding Horror Stories

**Story 1 — Database Migration Gone Wrong**: A company migrating from MySQL to PostgreSQL discovered that their MySQL tables were declared as latin1 but actually contained EUC-KR data. During migration, the data was double-encoded: first the raw bytes were interpreted as latin1 (producing mojibake), then that mojibake was encoded to UTF-8. Recovering hundreds of thousands of customer records required writing a custom reverse-encoding script and took two weeks.

**Story 2 — API Data Exchange**: A government agency's API returned responses in EUC-KR, but the Content-Type header lacked a charset parameter. Clients defaulting to UTF-8 received garbled Korean text. The fix was simple (adding charset=euc-kr to the header), but debugging took days because the raw bytes looked plausible in a hex editor.

**Story 3 — CSV File Merge Catastrophe**: Multiple departments submitted CSV files for consolidation. Department A used UTF-8, Department B used EUC-KR, and Department C used UTF-8 with BOM. Naively concatenating the files resulted in encoding breaks mid-file and BOM characters appearing as data in middle rows. The solution required detecting each file's encoding individually before merging.

**Story 4 — Git Repository Encoding Chaos**: Some team members on Windows committed source files with CP949-encoded comments, while macOS users committed in UTF-8. Every git diff showed all Korean comments as changed. Adding a .gitattributes file with encoding normalization rules and re-encoding all files fixed the issue.

Solution 1: Check File Encoding

The first step is always to identify the actual encoding of your files.

**Using text editors**: In VS Code, the encoding is displayed in the bottom-right status bar. Click it to access "Reopen with Encoding" which lets you try different encodings. In Notepad++, check the "Encoding" menu.

**Command-line tools**:

The `file` command on macOS/Linux provides basic encoding detection:

``` file -bi document.txt # Output: text/plain; charset=utf-8 ```

Python's chardet library offers more accurate detection with confidence scores:

``` pip install chardet chardetect document.txt # Output: document.txt: EUC-KR with confidence 0.99 ```

The uchardet tool, based on Mozilla's encoding detection library, is another excellent option:

``` brew install uchardet # macOS apt install uchardet # Ubuntu/Debian uchardet document.txt # Output: EUC-KR ```

For batch detection of multiple files, you can combine these tools with shell scripts to audit an entire directory.

Solution 2: Convert Encoding

When comparing two files with different encodings, unify them to a single encoding first. UTF-8 is the best target.

**Using iconv on the command line**:

``` iconv -f EUC-KR -t UTF-8 input.txt > output.txt ```

**Python conversion**:

``` with open('input.txt', 'r', encoding='euc-kr') as f: content = f.read() with open('output.txt', 'w', encoding='utf-8') as f: f.write(content) ```

**Node.js with iconv-lite**:

``` const iconv = require('iconv-lite'); const fs = require('fs'); const buffer = fs.readFileSync('input.txt'); const content = iconv.decode(buffer, 'euc-kr'); fs.writeFileSync('output.txt', iconv.encode(content, 'utf-8')); ```

For bulk conversions, write a shell script that iterates through a directory and converts each file, optionally backing up the originals.

Solution 3: BOM Deep Dive

The BOM (Byte Order Mark) is a special marker indicating encoding and byte order.

**Types of BOM**: - UTF-8 BOM: EF BB BF (3 bytes) - UTF-16 LE BOM: FF FE (2 bytes) - UTF-16 BE BOM: FE FF (2 bytes) - UTF-32 LE BOM: FF FE 00 00 (4 bytes) - UTF-32 BE BOM: 00 00 FE FF (4 bytes)

**When BOM helps**: On Windows, Excel requires a BOM to correctly open UTF-8 CSV files with non-ASCII characters. For UTF-16 files, BOM is essential for specifying byte order.

**When BOM hurts**: BOM before PHP opening tags causes "headers already sent" errors. BOM in JSON files can cause parsing failures. BOM before a shell script's shebang (#!/bin/bash) prevents execution. BOM in CSV files parsed programmatically adds an invisible character before the first field name.

**Removing BOM**:

``` # Using sed on Linux/macOS sed -i '1s/^\xEF\xBB\xBF//' file.txt

# Using Python (utf-8-sig automatically strips BOM) with open('file.txt', 'r', encoding='utf-8-sig') as f: content = f.read() with open('file.txt', 'w', encoding='utf-8') as f: f.write(content) ```

**Adding BOM** (for Excel CSV compatibility):

``` # Python: write UTF-8 with BOM with open('file.csv', 'w', encoding='utf-8-sig') as f: f.write(content) ```

Excel's Encoding Traps

Excel is one of the most common sources of encoding problems.

**CSV export issues**: When you save as CSV in Excel, it uses the system's default encoding (CP949 on Korean Windows, Windows-1252 on English Windows). You must specifically choose "CSV UTF-8 (Comma delimited)" to get UTF-8 output, and even then Excel adds a BOM.

**CSV import issues**: Double-clicking a CSV file opens it with the system default encoding. A UTF-8 file without BOM may display Korean, Chinese, or Japanese characters as garbled text. The workaround is to use Data tab, From Text/CSV, then manually select encoding 65001 (UTF-8).

**The hidden BOM trap**: Excel's UTF-8 CSV export always includes a BOM. Uploading this file to a Linux server and parsing it programmatically means the first column header has an invisible 3-byte prefix, causing column name mismatches that are extremely difficult to debug because the BOM is invisible in most text displays.

Cross-Platform Encoding Pitfalls

Differences between Windows, macOS, and Linux extend beyond encoding to line endings and filename normalization.

**Line ending differences**: - Windows: CRLF (\r\n, bytes 0x0D 0x0A) - macOS/Linux: LF (\n, byte 0x0A) - Classic Mac OS (pre-OS X): CR (\r, byte 0x0D)

When comparing a file created on Windows with one from macOS, every line may show as "changed" because the invisible line-ending bytes differ, even though the visible content is identical.

**Filename encoding**: macOS uses NFD (decomposed) Unicode normalization for filenames, while Windows and Linux use NFC (composed). The Korean filename "가.txt" is stored as different byte sequences on each OS. This causes filename mismatches when extracting ZIP archives or accessing network shares across platforms.

**Default encoding differences**: Korean Windows defaults to CP949; macOS and Linux default to UTF-8. Files created with Notepad on Windows 10 and later default to UTF-8, but older versions and many legacy applications still use CP949.

Solution 4: How DiffMate's Encoding Cascade Works

DiffMate uses an automatic encoding detection cascade when opening files. Here is the detailed process:

**Step 1 — BOM Detection**: The first bytes of the file are examined for BOM markers. If found, the encoding is determined immediately: UTF-8 BOM (EF BB BF), UTF-16 LE BOM (FF FE), or UTF-16 BE BOM (FE FF).

**Step 2 — UTF-8 Attempt**: Without a BOM, DiffMate attempts UTF-8 decoding. Because UTF-8 has strict byte-pattern rules, invalid sequences (like a lone continuation byte or an overlong encoding) cause immediate failure, triggering a fallback.

**Step 3 — EUC-KR Attempt**: If UTF-8 fails, EUC-KR (CP949) is tried next. This covers the most common legacy encoding in Korean environments.

**Step 4 — ISO-8859-1 Fallback**: If EUC-KR also fails, ISO-8859-1 is used. Since this encoding maps every byte value (0x00-0xFF) to a valid character, it never fails. However, non-Latin text will not display correctly.

**Step 5 — UTF-16 Attempt**: As a final option, UTF-16 decoding is attempted for files that may be in that format.

This cascade means users never need to manually specify encoding. DiffMate handles the complexity automatically, letting you focus on comparing content rather than debugging encoding.

Programming Language Solutions

**Python**: Use the chardet or cchardet library for automatic encoding detection. Python 3 strings are Unicode by default; always specify the encoding parameter in open(). Use 'utf-8-sig' to automatically handle BOM.

**Node.js**: The iconv-lite package is the standard solution, supporting EUC-KR, Shift_JIS, GB2312, and dozens more. Pair it with jschardet for detection. Remember that Buffer.toString() only supports UTF-8; for other encodings, iconv-lite is mandatory.

**Java**: Specify a Charset in InputStreamReader's constructor. Java uses UTF-16 internally, so always be explicit about file encoding during I/O. The StandardCharsets class provides constants for common encodings.

Prevention Tips for Teams

  • Save all new files as UTF-8 without BOM
  • Share encoding standards via .editorconfig files in your repository
  • Add .gitattributes to normalize line endings across platforms
  • Verify encoding of externally received files before comparison
  • Use UTF-8 when exporting from databases
  • Document the encoding when exchanging CSV files with partners
  • Add encoding validation steps to CI/CD pipelines
  • Standardize IDE default encoding settings to UTF-8 across the team
  • When integrating with legacy systems, implement explicit encoding conversion layers

Conclusion

Encoding problems are systematic and solvable once you understand their root causes. Knowing the history from ASCII through regional encodings to UTF-8, understanding how each encoding represents characters at the byte level, and recognizing common symptoms lets you diagnose most issues quickly. Building a habit of checking and unifying encoding before file comparison eliminates hours of unnecessary debugging. DiffMate's automatic encoding detection cascade handles different encodings transparently, letting you compare files without manual conversion and significantly improving your workflow efficiency.

Compare Files with DiffMate