Pandas concat basics and tricks

Let us continue to explore the IPL dataset on Kaggle. First part to this series can be found here

Now that we know how to fetch basic properties of our csv or group our data, and fetch some meaningful values, let us see how to combine data properly in two dataframes.

Let's dive into `pd.concat()`

used to concatenate dataframes along the rows or columns.
takes list of dataframes to concatenate as input
takes an optional argument axis to specify whether to concatenate along rows or columns
- axis defaults to 0, which means concatenate the rows.
- axis=1 will concatenate the columns.

Let's look at some examples from the IPL dataset now

Let's load the csv file and see what columns it has

balldf = pd.read_csv('/kaggle/input/ipl-complete-dataset-20082020/IPL Ball-by-Ball 2008-2020.csv')
balldf.shape
balldf.dtypes

Interested to see the list of all the players who played?

batgrp = balldf.groupby('batsman')
batgrp.groups.keys()

Let us fetch all the rows for Virendra Sehwag and Yuvraj Singh individually. Since we already grouped our data based on batsman, we can do get_group to get the data for any of the players now.

sehwag = batgrp.get_group('V Sehwag')
yuvraj = batgrp.get_group('Yuvraj Singh')

Here's how their individual dataframes look like

We can see here that Sehwag has 1833 rows while Yuvraj has 2205 rows. Let us now concatenate these two dataframes along the rows and columns and see what our new dataframe looks like.

1. Concatenate the rows

When we concatenate the two dataframes along the rows, we can see that now the resultant dataframe has 1833 + 2205 = 4038 rows, each with 18 columns.

If the two dataframes have different number of columns, the missing values will get filled by NaN (not a number)

df1 = pd.DataFrame({'c1':[1,2], 'c2':[3,4], 'c3':[5,6]},
                   index=['r1','r2'])
df2 = pd.DataFrame({'c1':[5,6], 'c2':[7,8]},
                   index=['r1','r2'])
pd.concat([df2, df1])

    c1  c2   c3
r1   5   7  NaN
r2   6   8  NaN
r1   1   3  5.0
r2   2   4  6.0

2. Concatenate the columns

To concatenate the data along the columns, we need to mention axis=1 Note here that the row labels in both the dataframes are the same.

df1 = pd.DataFrame({'c1':[1,2], 'c2':[3,4], 'c3':[5,6]},
                   index=['r1','r2'])
df2 = pd.DataFrame({'c1':[5,6], 'c2':[7,8]},
                   index=['r1','r2'])
pd.concat([df1, df2], axis=1)

    c1  c2  c3  c1  c2
r1   1   3   5   5   7
r2   2   4   6   6   8

We can see here that now there are total 2 rows with 5 columns (3 columns from df1, 2 columns from df2). The column index remains the same as what was in the original dataframes.

Now let's see what happens if the row labels are different in both dataframes

df1 = pd.DataFrame({'c1':[1,2], 'c2':[3,4], 'c3':[5,6]},
                   index=['r1','r2'])
df2 = pd.DataFrame({'c1':[5,6], 'c2':[7,8]},
                   index=['r3','r2'])
pd.concat([df1, df2], axis=1)
     c1   c2   c3   c1   c2
r1  1.0  3.0  5.0  NaN  NaN
r2  2.0  4.0  6.0  6.0  8.0
r3  NaN  NaN  NaN  5.0  7.0

Tricks to use `concat()` efficiently.

1. Ignore the index values

In the above example where we had concatenated dataframes sehwag and yuvraj along the rows, we could see that the rows had their original row indices. We can give optional argument ignore_index=True so that the original indices are ignored. Then, the new DataFrame index will be labeled with 0, …, n-1

2. Avoid duplicate index

If we want to maintain integrity and want to avoid having rows with duplicate index, we use the optional argument verify_integrity=True. When this value is True, pd.concat() will throw error if there are duplicate indices.

df1 = pd.DataFrame({'c1':[1,2], 'c2':[3,4], 'c3':[5,6]},
                   index=['r1','r2'])
df2 = pd.DataFrame({'c1':[5,6], 'c2':[7,8]},
                   index=['r3','r2'])
pd.concat([df2, df1], verify_integrity=True)

Traceback (most recent call last):
  File "main.py", line 15, in <module>
    concat = pd.concat([df2, df1], verify_integrity=True)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/reshape/concat.py", line 212, in concat
    copy=copy)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/reshape/concat.py", line 363, in __init__
    self.new_axes = self._get_new_axes()
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/reshape/concat.py", line 443, in _get_new_axes
    new_axes[self.axis] = self._get_concat_axis()
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/reshape/concat.py", line 500, in _get_concat_axis
    self._maybe_check_integrity(concat_axis)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/reshape/concat.py", line 509, in _maybe_check_integrity
    '{overlap!s}'.format(overlap=overlap))
ValueError: Indexes have overlapping values: ['r2']

3. Add multi-level index

We can add a hierarchical index using the keys keyword.

pd.concat([sehwag, yuvraj], keys=['Sehwag', 'Yuvraj'])

That’s it

Thanks for reading. Please check out the notebook for the source code.

Stay tuned if you are interested to learn ML related Python libraries and practical aspect of machine learning.