How Binary Text Is Stored- Computer Science Explained

What Actually Happens When You Save a Text File

Every time you save a document, write an email, or post a comment, your computer converts human-readable text into binary code. This isn't magic. It's a systematic conversion process that has been standardized for decades.

Here's how it works, step by step.

The Foundation: Binary Basics

Computers only understand two states: on and off. This maps perfectly to 1s and 0s. Each individual 1 or 0 is called a bit. Eight bits grouped together form a byte.

A byte can represent 256 different values (2^8 = 256). This is the basic unit of storage for text.

Quick Reference: Binary to Decimal

00000000 = 0
00000001 = 1
00001010 = 10
11111111 = 255

Every character you type gets assigned a numeric value. That number gets converted to binary and stored on disk.

Character Encoding: The Translation Layer

You need a system that maps characters to numbers. This is called character encoding. Without a shared encoding standard, your saved file would be gibberish when opened on another system.

The computer industry developed several encoding schemes over time. Each one solves specific limitations of the previous versions.

ASCII: The Original Standard

ASCII (American Standard Code for Information Interchange) uses 7 bits per character, giving 128 possible values (0-127). It covers:

Uppercase and lowercase letters (A-Z, a-z)
Numbers 0-9
Common punctuation marks
Control characters (newline, tab, carriage return)

Here's the letter 'A' in ASCII: decimal value 65, binary 1000001.

Here's the letter 'a': decimal value 97, binary 1100001.

The problem with ASCII: it only works for English. It can't represent characters from other languages, special symbols, or emoji.

Extended ASCII and ISO Standards

To fill the gap, systems used all 8 bits of a byte. This gave 256 values. Different regions created their own extensions:

ISO-8859-1 (Latin-1): Western European languages
ISO-8859-5: Cyrillic alphabet
Windows-1252: Windows-specific extensions

This created chaos. A file saved with one encoding would display wrong on systems using another. You still see this problem today with garbled text called mojibake.

Unicode: One Standard to Rule Them All

Unicode was designed to assign a unique number to every character in every language. It currently covers 149,000+ characters from 161 scripts.

Unicode is not an encoding itself. It's a character set—a massive list of characters with assigned code points. A code point looks like this: U+0041 (that's the letter A).

UTF-8: The Dominant Encoding

UTF-8 is the most common encoding form for Unicode. It uses 1 to 4 bytes per character depending on the character:

1 byte: ASCII characters (0-127). This means ASCII files are 100% compatible with UTF-8.
2 bytes: Most European languages, Arabic, Hebrew
3 bytes: Asian scripts, mathematical symbols
4 bytes: Emoji, rare historical scripts, some Chinese characters

UTF-8 is variable-width encoding. Common characters use less space. Rare characters use more. This makes files smaller for typical Western text.

UTF-16 and UTF-32

Other Unicode encodings exist:

UTF-16: Uses 2 or 4 bytes per character. Used internally by Windows and Java.
UTF-32: Fixed 4 bytes for every character. Simple but wasteful of space.

How Text Files Are Actually Stored

A plain text file contains raw bytes. The file itself has no embedded information about its encoding. When you open a file, your operating system or application must guess the encoding.

This is why specifying UTF-8 encoding matters when saving files. The file contents are just bytes. The encoding tells the reader how to interpret those bytes.

Byte Order Mark (BOM)

Some UTF-8 files include a BOM (byte order mark) at the start: the bytes EF BB BF. This signals the file is UTF-8 encoded. It's optional and sometimes causes problems with programs that don't expect it.

Encoding Comparison Table

Encoding	Bytes per Character	Character Range	Common Use
ASCII	1	128 (English only)	Legacy systems, config files
ISO-8859-1	1	256 (Western Europe)	Old web pages, databases
UTF-8	1-4 variable	All Unicode	Web, modern files, email
UTF-16	2-4 variable	All Unicode	Windows, Java internal
UTF-32	4 fixed	All Unicode	Program internal processing

How Text Encoding Works: A Concrete Example

Let's trace what happens when you save the word "Hi":

The letter 'H' has Unicode code point U+0048. In UTF-8, this is encoded as byte 0x48 (binary: 01001000).
The letter 'i' has code point U+0069. In UTF-8, this is byte 0x69 (binary: 01101001).
These two bytes get written to disk.
When opened, the reader interprets 0x48 as 'H' and 0x69 as 'i'.

Now let's look at an emoji: "😀" (grinning face)

This emoji has code point U+1F600.
In UTF-8, this requires 4 bytes: F0 9F 98 80.
That's 4x the storage space of a typical letter.

Getting Started: How to Work with Text Encoding

You don't need to manually encode text, but you should know how to handle encoding issues.

Checking File Encoding in Practice

Linux/Mac: Use the file command in terminal: file -b myfile.txt
Windows: Use Notepad++ → Encoding menu to see current encoding
VS Code: Click the encoding in the bottom-right status bar

Converting Between Encodings

If you have a file in the wrong encoding, you can convert it:

Python: content.encode('utf-8').decode('latin-1')
Iconv command line: iconv -f ISO-8859-1 -t UTF-8 input.txt -o output.txt
Notepad++: Open file → Encoding menu → Convert to UTF-8

Specifying Encoding in Code

Always declare encoding in your projects:

HTML files: Add <meta charset="UTF-8"> in the <head>
Python: Include # -*- coding: utf-8 -*- at the top
Config files: Most modern formats (JSON, YAML) default to UTF-8

Common Encoding Problems and Fixes

Mojibake (Garbled Text)

When you see Ã© instead of é, or Ð¾Ð¿ÐºÐ° instead of Russian text, the file was saved in one encoding but opened with another.

Fix: Reopen the file with the correct encoding. In most text editors, you can try different encodings until the text displays correctly.

Question Marks or Boxes

If you see ???? or □ characters, the encoding doesn't support those characters at all.

Fix: Convert the file to UTF-8, which supports all Unicode characters.

Accented Characters Broken in URLs

URLs must be ASCII. Browsers encode special characters automatically using percent-encoding (e.g., é becomes %C3%A9 in UTF-8).

Why UTF-8 Won

UTF-8 is the default for:

Web pages (over 98% of websites)
Email (SMTP standards)
JSON and XML files
Most programming languages
Unix/Linux systems

It won because it's backward compatible with ASCII, handles all languages, and produces smaller files for English-heavy content. There's no reason to use anything else for new projects.

The Bottom Line

Text storage is a conversion process: characters → numbers → bytes. The encoding system determines how that mapping works.

Use UTF-8 for everything. It's the universal standard that handles every character you'll ever need. If you're dealing with legacy files, identify the encoding first, then convert to UTF-8.

Understanding this isn't academic. Encoding bugs cause data corruption, security vulnerabilities, and display errors. Now you know what's actually happening when you save a text file.