Pandas: Speed up merges

November 9, 2017 less than 1 minute read

You can improve the speed of a merge by first specifying the key column of the merge as the index of your dataframes, and then using join instead of merge:

The following example shows a improvement by a factor of about 10:

>>> import pandas as pd
>>> left = pd.DataFrame(
    {
        'key': ['K0', 'K1', 'K2', 'K3'],
        'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3'],
    }
)
>>> right = pd.DataFrame(
    {
        'key': ['K0', 'K1', 'K2', 'K3'],
        'C': ['C0', 'C1', 'C2', 'C3'],
        'D': ['D0', 'D1', 'D2', 'D3'],
    }
)
>>> left2 = left.set_index('key')
>>> right2 = right.set_index('key')
>>> %timeit result2 = left2.join(right2)
416 µs ± 27.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit result = pd.merge(left, right, on='key')
4.81 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Via StackOverflow.

Share on

Twitter Facebook LinkedIn

Declare your python dependencies within your Jupyter notebook

January 16, 2025 less than 1 minute read

Reproducible workflows are simplified with tools like Nix for shell scripts and juv for Jupyter notebooks, enabling dependency declarations directly within s...

Why you should really prepare for your one-on-ones

January 16, 2025 less than 1 minute read

Maximize the impact of your 1-on-1 meetings by preparing thoroughly, not just with your direct reports but also with your managers, to boost both job perform...

Why You’re Not Getting Value from Your Data Science

May 16, 2023 1 minute read

If companies want to get value from their data, they need to focus on accelerating human understanding of data, scaling the number of modeling questions they...

List only untracked files

May 3, 2023 less than 1 minute read

Using ls-files

Francis T. O'Donovan