How to Split Data from One Table into Multiple Tables in Python- A Step-by-Step Guide
Why Splitting Data Tables Matters
Working with large datasets is messy. One massive table with thousands of rows and dozens of columns is hard to manage, slow to process, and creates bottlenecks in your pipeline.
Splitting one table into multiple smaller tables solves these problems. You get cleaner data, faster queries, and easier maintenance.
This guide shows you exactly how to split data from a single table into multiple tables using Python. No fluff, just code and explanations that work.
The Core Methods for Splitting Tables
Three main approaches exist. Each has a use case.
1. Split by Column Values (Group-Based Splitting)
You have a column with categorical data. You want each unique value to become its own table. This is the most common scenario.
Example: A sales table with a "region" column. You want separate tables for North, South, East, and West.
2. Split by Row Count (Chunk-Based Splitting)
You have a huge file that crashes your memory when loaded at once. You split it into chunks of N rows each.
Example: A 5-million row CSV file split into 10 files with 500,000 rows each.
3. Split by Column Groups (Vertical Splitting)
You have many columns and want to separate them logically. Maybe keep identifiers in one table and metrics in another.
Example: A user table with profile info, preferences, and activity logs split into three separate tables.
Method Comparison
| Method | Best For | Speed | Memory Usage |
|---|---|---|---|
| Group-Based (pandas) | Categorical splits | Fast | Medium |
| Chunk-Based (read_csv) | Large file handling | Moderate | Low |
| Vertical Split (iloc) | Column separation | Fast | High (full load) |
| SQL-based export | Database tables | Varies | Medium |
Getting Started: Prerequisites
Install pandas if you haven't already:
pip install pandas openpyxl
openpyxl handles Excel files. You need it for .xlsx exports.
Splitting by Column Values (Group-Based)
This is the most practical approach for most use cases.
import pandas as pd
# Load your data
df = pd.read_csv('sales_data.csv')
# Check what column you want to split on
print(df['region'].unique())
# Split into separate DataFrames
for region in df['region'].unique():
subset = df[df['region'] == region]
subset.to_csv(f'sales_{region.lower()}.csv', index=False)
print(f'Created sales_{region.lower()}.csv with {len(subset)} rows')
This loops through each unique region value, filters the rows, and saves them to individual CSV files.
Saving to Excel Instead
Sometimes you need Excel output with multiple sheets:
import pandas as pd
df = pd.read_csv('sales_data.csv')
with pd.ExcelWriter('sales_split.xlsx') as writer:
for region in df['region'].unique():
subset = df[df['region'] == region]
sheet_name = region.replace(' ', '_')[:31] # Excel sheet name limit
subset.to_excel(writer, sheet_name=sheet_name, index=False)
Excel sheet names have a 31-character limit. The code handles that automatically.
Splitting Large Files by Row Chunks
When your file is too large to load into memory, read it in chunks.
import pandas as pd
chunk_size = 500000 # rows per file
output_prefix = 'chunk_'
file_number = 0
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
chunk.to_csv(f'{output_prefix}{file_number}.csv', index=False)
print(f'Written chunk {file_number} with {len(chunk)} rows')
file_number += 1
print(f'Total files created: {file_number}')
This processes 500,000 rows at a time. Adjust chunk_size based on your available RAM. Start with 100,000 if you're unsure.
Splitting by Column Groups (Vertical Split)
Separate columns into different tables while keeping them linked.
import pandas as pd
df = pd.read_csv('users_full.csv')
# Define your column groups
id_columns = ['user_id', 'email', 'created_at']
profile_columns = ['first_name', 'last_name', 'phone']
metrics_columns = ['purchases', 'total_spent', 'last_login']
# Create separate DataFrames
users_ids = df[id_columns]
users_profile = df[profile_columns]
users_metrics = df[metrics_columns]
# Export
users_ids.to_csv('users_ids.csv', index=False)
users_profile.to_csv('users_profile.csv', index=False)
users_metrics.to_csv('users_metrics.csv', index=False)
You lose the direct link between tables with this method. Add a key column (like user_id) to reconnect them when needed.
Using SQL for Database Table Splits
If your data lives in a database, do the splitting there. It's faster and uses less Python memory.
import sqlite3
import pandas as pd
conn = sqlite3.connect('database.db')
# Split by category in SQL
categories = pd.read_sql("SELECT DISTINCT category FROM products", conn)
for category in categories['category']:
query = f"SELECT * FROM products WHERE category = '{category}'"
df = pd.read_sql(query, conn)
filename = f"products_{category.lower().replace(' ', '_')}.csv"
df.to_csv(filename, index=False)
print(f'Exported {filename}')
conn.close()
Warning: This SQL approach is vulnerable to injection if the category names come from untrusted input. Use parameterized queries instead:
import sqlite3
import pandas as pd
conn = sqlite3.connect('database.db')
cursor = conn.cursor()
categories = cursor.execute("SELECT DISTINCT category FROM products").fetchall()
for (category,) in categories:
query = "SELECT * FROM products WHERE category = ?"
df = pd.read_sql(query, conn, params=(category,))
filename = f"products_{category.lower().replace(' ', '_')}.csv"
df.to_csv(filename, index=False)
conn.close()
Preserving Data Types and Formatting
Sometimes splitting messes up your data types. Dates become strings. Numbers become objects.
import pandas as pd
# Define types upfront
dtype_spec = {
'user_id': 'int64',
'purchase_amount': 'float64',
'purchase_date': 'datetime64'
}
df = pd.read_csv('data.csv', dtype=dtype_spec, parse_dates=['purchase_date'])
# Verify types before splitting
print(df.dtypes)
Define your types when reading the file. This prevents headaches later when you try to do math on string columns.
Common Problems and Fixes
- Missing values in split keys: Use
dropna=Falsein your groupby or filter to capture rows with null values separately. - Special characters in names: Replace spaces and symbols before using them as file names.
- Unicode errors: Specify encoding when reading:
pd.read_csv('file.csv', encoding='utf-8') - Memory errors: Reduce chunk_size or use the chunk-based reading method instead.
Automating the Process
Wrap the splitting logic in a reusable function:
import pandas as pd
import os
def split_by_column(input_file, split_column, output_dir='output', file_format='csv'):
os.makedirs(output_dir, exist_ok=True)
df = pd.read_csv(input_file)
for value in df[split_column].unique():
if pd.isna(value):
filename = f'{output_dir}/null_values.{file_format}'
else:
safe_value = str(value).replace(' ', '_').replace('/', '_')
filename = f'{output_dir}/{safe_value}.{file_format}'
subset = df[df[split_column] == value]
subset.to_csv(filename, index=False) if file_format == 'csv' else subset.to_excel(filename, index=False)
print(f'Saved {len(subset)} rows to {filename}')
# Usage
split_by_column('sales_data.csv', 'region', output_dir='regional_splits')
One function call handles the entire split. Point it at any CSV with any split column.
When Not to Split
Splitting isn't always the answer.
- Small datasets under 10MB are easier to keep as one file.
- If you constantly need to join the split tables back together, splitting adds unnecessary complexity.
- For real-time applications, multiple files mean multiple read operations. A single indexed table is faster.
Split when you have a clear operational reason. Not just because it feels organized.
Bottom Line
Splitting tables in Python is straightforward with pandas. The method you choose depends on your goal:
- Group-based splitting for categorical data separation
- Chunk-based splitting for memory management with large files
- Vertical splitting for logical column grouping
Copy the code blocks above, swap in your file names and column names, and run. That's it.