Lost in Pandas - Part 3

Data Science
Author

Nodar Okroshiashvili

Published

April 20, 2020

Introduction

I’ll show you how to split one column with repetitive index and/or value into multiple rows with unique index.

import pandas as pd
data = {
    "ID": [100, 100, 200, 200, 200, 300, 300, 400],
    "Product_Name": ["Apple", "Banana", "Cherry", "Apricot", "Apple", "Avocado", "Avocado", "Orange"],
}

df = pd.DataFrame(data)

df
ID Product_Name
0 100 Apple
1 100 Banana
2 200 Cherry
3 200 Apricot
4 200 Apple
5 300 Avocado
6 300 Avocado
7 400 Orange

We see that our dataframe contains repetitive index (ID) and also not unique values for each unique index. Our aim is to convert Product_Name column into rows such as to keep index unique and in each new colum to be each product name. We need the following format:

ID 0 1 2
100 Apple Banana NaN
200 Cherry Apricot Apple
300 Avocado Avocado NaN
400 Orange NaN NaN

To achieve our aim we can use cumcount for new columns names to MultiIndex by set_index and reshape by unstack. The second way to do this will be to create Series of lists and new DataFrame by contructor:

df.set_index(["ID", df.groupby("ID").cumcount()])["Product_Name"].unstack()
0 1 2
ID
100 Apple Banana NaN
200 Cherry Apricot Apple
300 Avocado Avocado NaN
400 Orange NaN NaN
# Second way of doing the same as above

s = df.groupby("ID")["Product_Name"].apply(list)

pd.DataFrame(s.values.tolist(), index=s.index)
0 1 2
ID
100 Apple Banana None
200 Cherry Apricot Apple
300 Avocado Avocado None
400 Orange None None

That’s good. We found the solution, but what if we want to have just two columns. I mean to aggregate Product_Name column values in one cell for each unique index? That’s even easy to do. Use groupby then apply join and reset_index

df.groupby("ID")["Product_Name"].apply(" ".join).reset_index(name="New_Product_Name")
ID New_Product_Name
0 100 Apple Banana
1 200 Cherry Apricot Apple
2 300 Avocado Avocado
3 400 Orange