Tech 7 min read·By NexTool Team

Guide to Text Encoding: UTF-8, ASCII & Character Sets

Understand text encoding systems including UTF-8, ASCII, and Unicode. Learn why encoding matters, common issues, and how to fix character corruption.

ShareY

Try the free calculator

Use our JSON Formatter to run the numbers yourself.

What Is Character Encoding

Character encoding maps human-readable characters to numbers (code points) that computers can store and process. When you type the letter 'A', your computer stores the number 65 (in ASCII/Unicode). The encoding system determines which number corresponds to which character. Without knowing the correct encoding, a sequence of bytes is meaningless — the same bytes can represent different characters in different encodings, leading to garbled text (mojibake). Every text file, web page, database field, and API response has an encoding, and mismatches between the intended and interpreted encoding are the root cause of virtually all character-display problems.

From ASCII to Unicode

ASCII (American Standard Code for Information Interchange) encodes 128 characters using 7 bits — the English alphabet (uppercase and lowercase), digits, punctuation, and control characters. It works perfectly for English but cannot represent accented characters, Chinese, Arabic, or emoji. Various extended encodings (ISO-8859-1 for Western European, Shift_JIS for Japanese, GB2312 for Chinese) solved regional needs but created incompatibility between regions. Unicode solved this by assigning a unique code point (number) to every character in every writing system — over 149,000 characters across 161 scripts. Unicode is the universal character set; UTF-8 is its most popular encoding.

Why UTF-8 Won

UTF-8 is a variable-length encoding that uses 1 to 4 bytes per character. ASCII characters use 1 byte (making UTF-8 backward-compatible with ASCII), European and Middle Eastern characters use 2 bytes, East Asian characters use 3 bytes, and emoji use 4 bytes. This efficiency (English text is the same size as ASCII, while supporting all Unicode) made UTF-8 the dominant encoding on the web — over 98 percent of web pages use UTF-8. All modern programming languages, databases, and operating systems support UTF-8 natively. Always use UTF-8 unless you have a specific legacy requirement for another encoding.

Recommended Resources

Sponsored · We may earn a commission at no cost to you

Fixing Encoding Problems

Common symptoms of encoding issues include garbled characters (cafe showing as café), question marks or boxes replacing characters, and double-encoded strings (é instead of just e-accent). To fix: identify the actual encoding of the data (tools like chardet for Python can detect encoding), then convert to UTF-8 using iconv (command line), the encoding parameter in your programming language's file reading function, or database ALTER commands. Prevent encoding issues by declaring UTF-8 everywhere: HTML meta tag (<meta charset="UTF-8">), HTTP Content-Type header (charset=utf-8), database connection settings (SET NAMES utf8mb4 in MySQL), and source file encoding (most editors default to UTF-8 now).

Related Free Tools

Related Articles

Frequently Asked Questions

What is the difference between Unicode and UTF-8?

Unicode is a standard that assigns a unique number (code point) to every character — it is the universal catalog of characters. UTF-8 is one way to encode those Unicode code points as bytes for storage and transmission. Other encodings of Unicode exist: UTF-16 uses 2 or 4 bytes per character (common in Windows and Java internals) and UTF-32 uses exactly 4 bytes per character (simple but wasteful). UTF-8 is the most popular because of its efficiency and ASCII compatibility.

Why do some emoji show as boxes or question marks?

Emoji display requires three things: the correct encoding (UTF-8 with 4-byte support), a font that includes the emoji glyph, and an operating system or browser that renders it. Missing any of these causes the emoji to show as a box (missing glyph), question mark (encoding error), or two separate characters (missing combined-character support). Keeping your operating system and browser updated typically resolves emoji display issues.

What is UTF-8 with BOM?

BOM (Byte Order Mark) is a special Unicode character (U+FEFF) placed at the beginning of a file to signal its encoding. In UTF-8, the BOM is the bytes EF BB BF. While unnecessary for UTF-8 (which has no byte-order ambiguity), some Windows applications add it. The BOM can cause issues in Unix/Linux systems, particularly in shell scripts and PHP files where it may produce unexpected output. Modern best practice is to use UTF-8 without BOM unless a specific application requires it.