Python Performance: How I Optimized a Slow Data Processing Script by 90%

Python is loved for its simplicity, but it’s often criticized for being slow. Recently, I was tasked with processing a 5GB CSV file containing millions of rows of user transaction data. My first attempt was a disaster—it was projected to take 4 hours to complete. Here is how I refactored it to run in under 10 minutes.

The Initial Problem: The "Row-by-Row" Trap

The original script used a standard for loop to iterate through the data. In Python, iterating over millions of rows using a native loop is incredibly expensive because of the overhead of the Python interpreter for each iteration.

# What NOT to do with large data
import csv

with open('transactions.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Complex calculation here
        tax = float(row['amount']) * 0.23
        save_to_db(row['id'], tax)

This approach was slow because of constant Disk I/O and the lack of memory management. The script was essentially "choking" on the data volume.

Step 1: Implementing Generators for Memory Efficiency

The first step was to stop loading too much into memory. By using Generators (the yield keyword), I ensured that only one chunk of data was processed at a time, keeping the RAM usage constant regardless of file size.

Step 2: Vectorization with Pandas

The real "game changer" was moving from row-by-row logic to Vectorization. Instead of telling Python how to loop, I used Pandas to describe what to do with the entire column at once. Vectorized operations are executed in C, making them orders of magnitude faster.

import pandas as pd

# Processing in chunks to keep memory low
chunks = pd.read_csv('transactions.csv', chunksize=100000)

for chunk in chunks:
    # Vectorized operation: Happens almost instantly!
    chunk['tax'] = chunk['amount'] * 0.23
    
    # Bulk insert into database
    upload_to_db(chunk)

Step 3: Profiling the Bottlenecks

To find exactly where the time was being wasted, I used the cProfile module. I discovered that 30% of the time was spent on database connections. By switching to Bulk Inserts (sending 5,000 rows in one SQL command instead of 5,000 individual commands), the speed tripled again.

Final Results & Comparison

Method Time Taken Memory Usage
Standard Loop ~4 Hours High (Spiky)
Pandas + Vectorization 8 Minutes Stable (Low)

Summary of Lessons Learned

  • Don't loop if you can vectorize: Libraries like NumPy and Pandas are your best friends for large datasets.
  • Use Chunks: Never load a multi-gigabyte file into memory all at once.
  • Bulk your I/O: Whether it's API calls or Database writes, batching is always faster than individual calls.

Optimizing code isn't just about making it run faster; it's about making your infrastructure more cost-effective and your users happier. Have you ever optimized a script that felt "unfixable"? Share your experience below!

Comments

Popular posts from this blog

How to Compare Strings in C#: Best Practices

Do You Really Need Advanced Algorithms to Be a Great Developer in 2025?

Is Python Becoming Obsolete? A Look at Its Limitations in the Modern Tech Stack