Pandas: Speed up merges
You can improve the speed of a merge
by first specifying the key column of the merge
as the index of your dataframes,
and then using join
instead of merge
:
The following example shows a improvement by a factor of about 10:
>>> import pandas as pd
>>> left = pd.DataFrame(
{
'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
}
)
>>> right = pd.DataFrame(
{
'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
}
)
>>> left2 = left.set_index('key')
>>> right2 = right.set_index('key')
>>> %timeit result2 = left2.join(right2)
416 µs ± 27.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit result = pd.merge(left, right, on='key')
4.81 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Via StackOverflow.
Leave a comment