from collections import defaultdict
import pandas as pd
import numpy as np
Introduction
Many have been told and written about Pandas and its capabilities. I could not imagine data scientist or data analyst who had not heard about Pandas or had not used it at least once. We all use it. Every day, every week. It does not matter how many times. It’s a great tool. I use it all the time when I want to do data analysis, either it is simple calculations or complex data transformations, and it surprises me. Pandas is so simple in its form. However, imagine, how much you can do with some simple method chaining.
Saying all of these, this blog aims to share my experience and amazment with Pandas. This series is not meant for beginner users and will not be short in length. These series will be based on my experience and I will try to give a detailed explanation for every step from problem definition to solving it. That’s enough for now. Let get down to business.
Problem Statement
We have data. This data comes from the HR department of the company. The data contains two columns, company name, and information about its employees. Each row of the employee information column is a list of lists. The lists inside the outer list can be duplicated. It also can have duplicate values, and inner lists have at most two values.
Disclaimer: Any name, phone, email, and the title is a pure coincidence. Data is random and fake.
Here is our data.
= {
data "company_name": ["A", "B", "C", "D", "E", "F"],
"info": [
"Name", "David Jones"], ["Title", "CEO"], ["Phone", "207-685-1626"], ["Email", "djones@example.org"]],
[[
["Name", "Kate Brown"],
["Title", "Senior Lawyer"],
["Phone", "316-978-7791"],
["Email", "Kate.Brown@example.edu"],
["Name", "Darin White"],
["Title", "Associate Vice President"],
["Phone", "316-978-3887"],
["Email", "Darin.White@example.edu"],
[
],
["Name", "Scott Lamb"],
["Title", "Actuary"],
["Phone", "316-978-3804"],
["Email", "scott.lamb@example.edu"],
["Name", "Scott Lamb"],
["Title", "Senior Officer"],
["Title", "Application Developer"],
["Title", "Blockchain Architect"],
["Title", "Director of External Affairs"],
["Name", "Scott"],
["Name", "Scott"],
["Title", "Director of Medicine"],
["Title", "Product Owner"],
["Name", "Mike"],
["Title", "Domain Expert"],
["Title", "Growth Hacker"],
["Title", "Engineering Head"],
["Title", "Event Manager"],
["Name", "Joe"],
["Name", "Mike"],
["Title", "Fundraising"],
["Title", "VP of Customers"],
["Name", "Mike"],
["Title", "Venture Capital Analyst"],
["Title", "UX Designer"],
["Name", "Mike"],
["Name", "Susan"],
["Name", "Bryan"],
["Name", "Mia"],
["Title", "Songwriter"],
[
],
["Name", "Rose Smith Rose Smith"],
["Title", "Vice President"],
["Title", "Finance and Operations Head"],
["Phone", "316-978-3810"],
["Email", "rose.smith@example.edu"],
["Name", "Rose Smith"],
["Title", "Foundation"],
["Name", "Susan"],
["Title", "Foundation"],
["Title", "Accountant"],
["Title", "Accountant"],
["Title", "Executive"],
["Title", "Director"],
["Title", "Executive"],
["Name", "Ray"],
["Title", "Strategic Planning"],
["Title", "Financial Analyst"],
["Title", "Foundation"],
["Title", "Foundation"],
["Name", "Susan"],
["Title", "member of the board"],
["Title", "board of directors"],
["Title", "president"],
["Title", "board of directors"],
[
],
["Name", "Carl Clark"],
["Title", "Chief "],
["Title", "Operating Officer"],
["Title", "PhD"],
["Phone", "413-534-2745"],
["Email", "Clark_Carl@example.com"],
[
],
["Title", "Board Member"],
["Name", "Taylor Garcia"],
["Phone", "307-733-2164"],
["Phone", "307-733-4568"],
["Email", "Garcia@example.org"],
[
],
], }
Let convert this dictionary into Pandas DataFrame and see what data we have.
= pd.DataFrame(data)
df
df.head()
company_name | info | |
---|---|---|
0 | A | [[Name, David Jones], [Title, CEO], [Phone, 20... |
1 | B | [[Name, Kate Brown], [Title, Senior Lawyer], [... |
2 | C | [[Name, Scott Lamb], [Title, Actuary], [Phone,... |
3 | D | [[Name, Rose Smith Rose Smith], [Title, Vice P... |
4 | E | [[Name, Carl Clark], [Title, Chief ], [Title, ... |
We see that the first column seems okay, but the second one not. Here we have one big list containing smaller two-element lists.
"info"].iloc[2] df[
[['Name', 'Scott Lamb'],
['Title', 'Actuary'],
['Phone', '316-978-3804'],
['Email', 'scott.lamb@example.edu'],
['Name', 'Scott Lamb'],
['Title', 'Senior Officer'],
['Title', 'Application Developer'],
['Title', 'Blockchain Architect'],
['Title', 'Director of External Affairs'],
['Name', 'Scott'],
['Name', 'Scott'],
['Title', 'Director of Medicine'],
['Title', 'Product Owner'],
['Name', 'Mike'],
['Title', 'Domain Expert'],
['Title', 'Growth Hacker'],
['Title', 'Engineering Head'],
['Title', 'Event Manager'],
['Name', 'Joe'],
['Name', 'Mike'],
['Title', 'Fundraising'],
['Title', 'VP of Customers'],
['Name', 'Mike'],
['Title', 'Venture Capital Analyst'],
['Title', 'UX Designer'],
['Name', 'Mike'],
['Name', 'Susan'],
['Name', 'Bryan'],
['Name', 'Mia'],
['Title', 'Songwriter']]
As we figured out the data structure, let define the aim.
We need to unpack lists from the second column and flatten them in tabular format in the way to preserve the order. Meaning that, from the above example, Scott Lamb
has to have title Actuary
and not other titles are allowed. Long story short, we need to flatten list of list and make proper DataFrame from it.
How I approached this problem
The first idea that came to my mind was to use Pandas DataFrame .explode()
method to unpack list of lists, which returned lists containing two elements. After that, I extracted these two elements into two different columns.
= df.explode("info")
df_exploded
df_exploded.head()
company_name | info | |
---|---|---|
0 | A | [Name, David Jones] |
0 | A | [Title, CEO] |
0 | A | [Phone, 207-685-1626] |
0 | A | [Email, djones@example.org] |
1 | B | [Name, Kate Brown] |
# Add two new columns
"tag"] = df_exploded["info"].map(lambda x: x[0])
df_exploded.loc[:,
"result"] = df_exploded["info"].map(lambda x: x[1])
df_exploded.loc[:,
df_exploded.head()
company_name | info | tag | result | |
---|---|---|---|---|
0 | A | [Name, David Jones] | Name | David Jones |
0 | A | [Title, CEO] | Title | CEO |
0 | A | [Phone, 207-685-1626] | Phone | 207-685-1626 |
0 | A | [Email, djones@example.org] | djones@example.org | |
1 | B | [Name, Kate Brown] | Name | Kate Brown |
Despite unpacking the list of lists, that is not the format I wanted. So, I need to do an extra transformation.
= (
df_exploded_final "company_name", "tag"])["result"]
df_exploded.groupby([apply(lambda x: pd.Series(x.values))
.1)
.unstack(
.reset_index()"level_1"], axis=1)
.drop([
)
df_exploded_final.head()
tag | company_name | Name | Phone | Title | |
---|---|---|---|---|---|
0 | A | djones@example.org | David Jones | 207-685-1626 | CEO |
1 | B | Kate.Brown@example.edu | Kate Brown | 316-978-7791 | Senior Lawyer |
2 | B | Darin.White@example.edu | Darin White | 316-978-3887 | Associate Vice President |
3 | C | scott.lamb@example.edu | Scott Lamb | 316-978-3804 | Actuary |
4 | C | NaN | Scott Lamb | NaN | Senior Officer |
It seems we did a good job. However, this approach is prone to errors. Namely, it does not preserve the order of the list values and may assign a different email to a different person. That was not what I need.
So, I decided to use another way to solve this problem. Notably, as the data contained millions of rows, it seemed impossible to be too precise, but I wanted to reduce the error of non-matching cases. To achieve this, I iterated over the values of info
column and converted it to dict of lists, where keys are tags (identifiers) and values are actual employee information.
= []
out
for x in df["info"].tolist():
= defaultdict(list)
groups for g, v in x:
groups[g].append(v)dict(groups))
out.append(
"new_info"] = out
df.loc[:,
"new_info"].iloc[0] df[
{'Name': ['David Jones'],
'Title': ['CEO'],
'Phone': ['207-685-1626'],
'Email': ['djones@example.org']}
That’s a step forward. After this, I was interested in counting the values for each key in dicts for each row. I made small changes in the above code and applied it to the new_info
column.
= []
out
for x in df["new_info"]:
= defaultdict(int)
groups for g, v in x.items():
= len(list(filter(None, v)))
groups[g] dict(groups))
out.append(
"new_info_stats"] = out
df.loc[:,
"new_info_stats"].iloc[0] df[
{'Name': 1, 'Title': 1, 'Phone': 1, 'Email': 1}
As we calculated value counts for each dict, now we need to add three helper columns to the dataset for further usage. These helper columns will help to differentiate matching cases and non-matching cases.
"_max"] = df["new_info_stats"].apply(lambda x: max(x.values()))
df[
"_min"] = df["new_info_stats"].apply(lambda x: min(x.values()))
df[
"max_equal_min"] = pd.Series(np.where((df["_max"] == df["_min"]), 1, 0)) df.loc[:,
The column max_equal_min
is a dummy variable and helps us to differentiate matching and non-matching cases. The value 1 indicates we have a matching case and value 0 - non-matching case. According to this column, I split data into two parts. The first only contains matching examples, and the second will have only non-matching cases.
= df[df["max_equal_min"] > 0].reset_index(drop=True)
df_first
= df[df["max_equal_min"] == 0].reset_index(drop=True) df_second
The pre-processing of the first DataFrame is over and is ready to flatten. To do so, I iterate over new_info
column and transform each row into Pandas DataFrame. After this step, data will be flat.
= []
new_data
for j in df_first["new_info"]:
new_data.append(pd.DataFrame(j))
= (
df_first_final_i =0, sort=False).drop_duplicates().dropna(subset=["Name"]).reset_index(drop=True)
pd.concat(new_data, axis
)
df_first_final_i.head()
Name | Title | Phone | ||
---|---|---|---|---|
0 | David Jones | CEO | 207-685-1626 | djones@example.org |
1 | Kate Brown | Senior Lawyer | 316-978-7791 | Kate.Brown@example.edu |
2 | Darin White | Associate Vice President | 316-978-3887 | Darin.White@example.edu |
Woohoo, it works! However, imagine having millions of rows how slow this approach will be. For this reason, I tried another method and found it much faster. Here it is.
= (
df_first_final_ii "new_info"].apply(pd.Series).apply(lambda x: x.explode()).drop_duplicates().reset_index(drop=True)
df_first[
)
df_first_final_ii.head()
Name | Title | Phone | ||
---|---|---|---|---|
0 | David Jones | CEO | 207-685-1626 | djones@example.org |
1 | Kate Brown | Senior Lawyer | 316-978-7791 | Kate.Brown@example.edu |
2 | Darin White | Associate Vice President | 316-978-3887 | Darin.White@example.edu |
Faster and cleaner solution. But, what about the second DataFrame? It turned out that the above solution did not fit the second DataFrame and gave me an error. The error was ValueError: cannot reindex from a duplicate axis
. Before finding the solution for this error, let take a look at the data.
df_second.head()
company_name | info | new_info | new_info_stats | _max | _min | max_equal_min | |
---|---|---|---|---|---|---|---|
0 | C | [[Name, Scott Lamb], [Title, Actuary], [Phone,... | {'Name': ['Scott Lamb', 'Scott Lamb', 'Scott',... | {'Name': 12, 'Title': 16, 'Phone': 1, 'Email': 1} | 16 | 1 | 0 |
1 | D | [[Name, Rose Smith Rose Smith], [Title, Vice P... | {'Name': ['Rose Smith Rose Smith', 'Rose Smith... | {'Name': 5, 'Title': 17, 'Phone': 1, 'Email': 1} | 17 | 1 | 0 |
2 | E | [[Name, Carl Clark], [Title, Chief ], [Title, ... | {'Name': ['Carl Clark'], 'Title': ['Chief ', '... | {'Name': 1, 'Title': 3, 'Phone': 1, 'Email': 1} | 3 | 1 | 0 |
3 | F | [[Title, Board Member], [Name, Taylor Garcia],... | {'Title': ['Board Member'], 'Name': ['Taylor G... | {'Title': 1, 'Name': 1, 'Phone': 2, 'Email': 1} | 2 | 1 | 0 |
In the second and third row, we have one Name
and three Title
and two Phone
, respectively, for the new_info_stats
column. This may be due to the data entry or extraction reason. Not 100% sure that this is the case, but the likelihood is very high. So, we have to handle this problem properly. One solution is to concatenate strings for the values of Title
and Phone
keys.
def process_info(record: dict) -> dict:
if (
len(record.keys()) == 4
and len(record.get("Name")) == 1
and len(record.get("Title")) > 1
and len(record.get("Email")) == 1
and len(record.get("Phone")) == 1
):"Title"] = [" ".join(record.get("Title"))]
record[
elif (
len(record.keys()) == 4
and len(record.get("Name")) == 1
and len(record.get("Title")) == 1
and len(record.get("Email")) > 1
and len(record.get("Phone")) == 1
):"Email"] = [",".join(record.get("Email"))]
record[
elif (
len(record.keys()) == 4
and len(record.get("Name")) == 1
and len(record.get("Title")) == 1
and len(record.get("Email")) == 1
and len(record.get("Phone")) > 1
):"Phone"] = [",".join(record.get("Phone"))]
record[
else:
pass
return record
"new_info"] = df_second["new_info"].apply(process_info) df_second[
This is a simple logic to check if we are correctly concatenating string. After applying this function, the second DataFrame is ready to flatten. As I mentioned above the good old method did not give me the desired result for this case and then I came up to the following:
def flatten(df, column):
= []
data for i in df[column]:
dict([(k, pd.Series(v)) for k, v in i.items()])))
data.append(pd.DataFrame(
= (
new_df =0, sort=False)
pd.concat(data, axis
.drop_duplicates()=["Name"])
.dropna(subset=["Name", "Title", "Phone", "Email"])
.drop_duplicates(subset=True)
.reset_index(drop
)
return new_df
= flatten(df_second, "new_info")
df_second_final
df_second_final.head()
Name | Title | Phone | ||
---|---|---|---|---|
0 | Scott Lamb | Actuary | 316-978-3804 | scott.lamb@example.edu |
1 | Scott Lamb | Senior Officer | NaN | NaN |
2 | Scott | Application Developer | NaN | NaN |
3 | Scott | Blockchain Architect | NaN | NaN |
4 | Mike | Director of External Affairs | NaN | NaN |
Applied this function to the second DataFrame flattened it, and combining first and second DataFrames will give the final result.
Conclusion
To sum up, that was only a tiny part of this data pre-processing process. However, it was a great journey and lots to learn. What do you think? Did you find a more elegant solution? Please share it in the comments.