Python Performance: How I Optimized a Slow Data Processing Script by 90%
Python is loved for its simplicity, but it’s often criticized for being slow. Recently, I was tasked with processing a 5GB CSV file containing millions of rows of user transaction data. My first attempt was a disaster—it was projected to take 4 hours to complete. Here is how I refactored it to run in under 10 minutes.
The Initial Problem: The "Row-by-Row" Trap
The original script used a standard for loop to iterate through the data. In Python, iterating over millions of rows using a native loop is incredibly expensive because of the overhead of the Python interpreter for each iteration.
# What NOT to do with large data
import csv
with open('transactions.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
# Complex calculation here
tax = float(row['amount']) * 0.23
save_to_db(row['id'], tax)
This approach was slow because of constant Disk I/O and the lack of memory management. The script was essentially "choking" on the data volume.
Step 1: Implementing Generators for Memory Efficiency
The first step was to stop loading too much into memory. By using Generators (the yield keyword), I ensured that only one chunk of data was processed at a time, keeping the RAM usage constant regardless of file size.
Step 2: Vectorization with Pandas
The real "game changer" was moving from row-by-row logic to Vectorization. Instead of telling Python how to loop, I used Pandas to describe what to do with the entire column at once. Vectorized operations are executed in C, making them orders of magnitude faster.
import pandas as pd
# Processing in chunks to keep memory low
chunks = pd.read_csv('transactions.csv', chunksize=100000)
for chunk in chunks:
# Vectorized operation: Happens almost instantly!
chunk['tax'] = chunk['amount'] * 0.23
# Bulk insert into database
upload_to_db(chunk)
Step 3: Profiling the Bottlenecks
To find exactly where the time was being wasted, I used the cProfile module. I discovered that 30% of the time was spent on database connections. By switching to Bulk Inserts (sending 5,000 rows in one SQL command instead of 5,000 individual commands), the speed tripled again.
Final Results & Comparison
| Method | Time Taken | Memory Usage |
|---|---|---|
| Standard Loop | ~4 Hours | High (Spiky) |
| Pandas + Vectorization | 8 Minutes | Stable (Low) |
Summary of Lessons Learned
- Don't loop if you can vectorize: Libraries like NumPy and Pandas are your best friends for large datasets.
- Use Chunks: Never load a multi-gigabyte file into memory all at once.
- Bulk your I/O: Whether it's API calls or Database writes, batching is always faster than individual calls.
Optimizing code isn't just about making it run faster; it's about making your infrastructure more cost-effective and your users happier. Have you ever optimized a script that felt "unfixable"? Share your experience below!

Comments
Post a Comment