How Binary Text Is Stored- Computer Science Explained
What Actually Happens When You Save a Text File
Every time you save a document, write an email, or post a comment, your computer converts human-readable text into binary code. This isn't magic. It's a systematic conversion process that has been standardized for decades.
Here's how it works, step by step.
The Foundation: Binary Basics
Computers only understand two states: on and off. This maps perfectly to 1s and 0s. Each individual 1 or 0 is called a bit. Eight bits grouped together form a byte.
A byte can represent 256 different values (2^8 = 256). This is the basic unit of storage for text.
Quick Reference: Binary to Decimal
- 00000000 = 0
- 00000001 = 1
- 00001010 = 10
- 11111111 = 255
Every character you type gets assigned a numeric value. That number gets converted to binary and stored on disk.
Character Encoding: The Translation Layer
You need a system that maps characters to numbers. This is called character encoding. Without a shared encoding standard, your saved file would be gibberish when opened on another system.
The computer industry developed several encoding schemes over time. Each one solves specific limitations of the previous versions.
ASCII: The Original Standard
ASCII (American Standard Code for Information Interchange) uses 7 bits per character, giving 128 possible values (0-127). It covers:
- Uppercase and lowercase letters (A-Z, a-z)
- Numbers 0-9
- Common punctuation marks
- Control characters (newline, tab, carriage return)
Here's the letter 'A' in ASCII: decimal value 65, binary 1000001.
Here's the letter 'a': decimal value 97, binary 1100001.
The problem with ASCII: it only works for English. It can't represent characters from other languages, special symbols, or emoji.
Extended ASCII and ISO Standards
To fill the gap, systems used all 8 bits of a byte. This gave 256 values. Different regions created their own extensions:
- ISO-8859-1 (Latin-1): Western European languages
- ISO-8859-5: Cyrillic alphabet
- Windows-1252: Windows-specific extensions
This created chaos. A file saved with one encoding would display wrong on systems using another. You still see this problem today with garbled text called mojibake.
Unicode: One Standard to Rule Them All
Unicode was designed to assign a unique number to every character in every language. It currently covers 149,000+ characters from 161 scripts.
Unicode is not an encoding itself. It's a character set—a massive list of characters with assigned code points. A code point looks like this: U+0041 (that's the letter A).
UTF-8: The Dominant Encoding
UTF-8 is the most common encoding form for Unicode. It uses 1 to 4 bytes per character depending on the character:
- 1 byte: ASCII characters (0-127). This means ASCII files are 100% compatible with UTF-8.
- 2 bytes: Most European languages, Arabic, Hebrew
- 3 bytes: Asian scripts, mathematical symbols
- 4 bytes: Emoji, rare historical scripts, some Chinese characters
UTF-8 is variable-width encoding. Common characters use less space. Rare characters use more. This makes files smaller for typical Western text.
UTF-16 and UTF-32
Other Unicode encodings exist:
- UTF-16: Uses 2 or 4 bytes per character. Used internally by Windows and Java.
- UTF-32: Fixed 4 bytes for every character. Simple but wasteful of space.
How Text Files Are Actually Stored
A plain text file contains raw bytes. The file itself has no embedded information about its encoding. When you open a file, your operating system or application must guess the encoding.
This is why specifying UTF-8 encoding matters when saving files. The file contents are just bytes. The encoding tells the reader how to interpret those bytes.
Byte Order Mark (BOM)
Some UTF-8 files include a BOM (byte order mark) at the start: the bytes EF BB BF. This signals the file is UTF-8 encoded. It's optional and sometimes causes problems with programs that don't expect it.
Encoding Comparison Table
| Encoding | Bytes per Character | Character Range | Common Use |
|---|---|---|---|
| ASCII | 1 | 128 (English only) | Legacy systems, config files |
| ISO-8859-1 | 1 | 256 (Western Europe) | Old web pages, databases |
| UTF-8 | 1-4 variable | All Unicode | Web, modern files, email |
| UTF-16 | 2-4 variable | All Unicode | Windows, Java internal |
| UTF-32 | 4 fixed | All Unicode | Program internal processing |
How Text Encoding Works: A Concrete Example
Let's trace what happens when you save the word "Hi":
- The letter 'H' has Unicode code point U+0048. In UTF-8, this is encoded as byte 0x48 (binary: 01001000).
- The letter 'i' has code point U+0069. In UTF-8, this is byte 0x69 (binary: 01101001).
- These two bytes get written to disk.
- When opened, the reader interprets 0x48 as 'H' and 0x69 as 'i'.
Now let's look at an emoji: "😀" (grinning face)
- This emoji has code point U+1F600.
- In UTF-8, this requires 4 bytes: F0 9F 98 80.
- That's 4x the storage space of a typical letter.
Getting Started: How to Work with Text Encoding
You don't need to manually encode text, but you should know how to handle encoding issues.
Checking File Encoding in Practice
- Linux/Mac: Use the
filecommand in terminal:file -b myfile.txt - Windows: Use Notepad++ → Encoding menu to see current encoding
- VS Code: Click the encoding in the bottom-right status bar
Converting Between Encodings
If you have a file in the wrong encoding, you can convert it:
- Python:
content.encode('utf-8').decode('latin-1') - Iconv command line:
iconv -f ISO-8859-1 -t UTF-8 input.txt -o output.txt - Notepad++: Open file → Encoding menu → Convert to UTF-8
Specifying Encoding in Code
Always declare encoding in your projects:
- HTML files: Add
<meta charset="UTF-8">in the <head> - Python: Include
# -*- coding: utf-8 -*-at the top - Config files: Most modern formats (JSON, YAML) default to UTF-8
Common Encoding Problems and Fixes
Mojibake (Garbled Text)
When you see é instead of é, or опка instead of Russian text, the file was saved in one encoding but opened with another.
Fix: Reopen the file with the correct encoding. In most text editors, you can try different encodings until the text displays correctly.
Question Marks or Boxes
If you see ???? or □ characters, the encoding doesn't support those characters at all.
Fix: Convert the file to UTF-8, which supports all Unicode characters.
Accented Characters Broken in URLs
URLs must be ASCII. Browsers encode special characters automatically using percent-encoding (e.g., é becomes %C3%A9 in UTF-8).
Why UTF-8 Won
UTF-8 is the default for:
- Web pages (over 98% of websites)
- Email (SMTP standards)
- JSON and XML files
- Most programming languages
- Unix/Linux systems
It won because it's backward compatible with ASCII, handles all languages, and produces smaller files for English-heavy content. There's no reason to use anything else for new projects.
The Bottom Line
Text storage is a conversion process: characters → numbers → bytes. The encoding system determines how that mapping works.
Use UTF-8 for everything. It's the universal standard that handles every character you'll ever need. If you're dealing with legacy files, identify the encoding first, then convert to UTF-8.
Understanding this isn't academic. Encoding bugs cause data corruption, security vulnerabilities, and display errors. Now you know what's actually happening when you save a text file.