Lost in Pandas - Part 2

Data Science
Author

Nodar Okroshiashvili

Published

April 14, 2020

Introduction

I will show you how to fill missing values by the average of its before and after values if our dataframe has the following form:

Problem Statement

How can we perform groupby and fill Nan’s with its preceding and following values?

import pandas as pd
import numpy as np
data = {
    "type": [
        "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
        "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
    ],
    "date": [
        "2018-09", "2018-10", "2018-11", "2018-12", "2019-01", "2019-02", "2019-03",
        "2019-04", "2019-05", "2019-06", "2019-07", "2019-08", "2019-09", "2019-10",
        "2019-11", "2018-09", "2018-10", "2018-11", "2018-12", "2019-01", "2019-02",
        "2019-03", "2019-04", "2019-05", "2019-06", "2019-07", "2019-08", "2019-09",
        "2019-10", "2019-11",
    ],
    "v1": [
        10, np.nan, np.nan, 20, np.nan, np.nan, 30, np.nan, np.nan, 40, np.nan,
        np.nan, 50, np.nan, np.nan, 60, np.nan, np.nan, 70, np.nan, np.nan, 80,
        np.nan, np.nan, 90, np.nan, np.nan, 100, np.nan, np.nan,
    ],
    "v2": [
        10, np.nan, np.nan, 20, np.nan, np.nan, 30, np.nan, np.nan, 40, np.nan,
        np.nan, 50, np.nan, np.nan, 60, np.nan, np.nan, 70, np.nan, np.nan, 80,
        np.nan, np.nan, 90, np.nan, np.nan, 100, np.nan, np.nan,
    ],
}


df = pd.DataFrame(data)

df.head()
type date v1 v2
0 a 2018-09 10.0 10.0
1 a 2018-10 NaN NaN
2 a 2018-11 NaN NaN
3 a 2018-12 20.0 20.0
4 a 2019-01 NaN NaN

We see that values in type and date column are repetitive. Moreover, it does not matter what values are in other two columns, unless it is numeric type. Our aim is to fill these missing values by the average of its before and after value for v1 and v2 columns. We also notice that sorting and then filling won’t give desired result. We need something different. But before we find that solution we need to convert columns in numeric type in order to compute average by using Pandas to_numeric() method with parameter “errors” set to “coerce”, because Pandas DataFrame astype() method won’t work in this case.

We have two solution here. To use both of them let make copy of initial dataframe. The first uses groupby and then applies aggregator with “backward fill” and “forward fill” and then again groupby and then computes mean. The second solution does almost the same as the first but uses apply() method instead of aggregation.

# The first solution

df_first = df.copy(deep=True)

df_first[["v1", "v2"]] = (df_first.groupby("type")[["v1", "v2"]]
                                  .agg(["bfill", "ffill"])
                                  .groupby(level=0, axis=1)
                                  .mean())

df_first
type date v1 v2
0 a 2018-09 10.0 10.0
1 a 2018-10 15.0 15.0
2 a 2018-11 15.0 15.0
3 a 2018-12 20.0 20.0
4 a 2019-01 25.0 25.0
5 a 2019-02 25.0 25.0
6 a 2019-03 30.0 30.0
7 a 2019-04 35.0 35.0
8 a 2019-05 35.0 35.0
9 a 2019-06 40.0 40.0
10 a 2019-07 45.0 45.0
11 a 2019-08 45.0 45.0
12 a 2019-09 50.0 50.0
13 a 2019-10 50.0 50.0
14 a 2019-11 50.0 50.0
15 b 2018-09 60.0 60.0
16 b 2018-10 65.0 65.0
17 b 2018-11 65.0 65.0
18 b 2018-12 70.0 70.0
19 b 2019-01 75.0 75.0
20 b 2019-02 75.0 75.0
21 b 2019-03 80.0 80.0
22 b 2019-04 85.0 85.0
23 b 2019-05 85.0 85.0
24 b 2019-06 90.0 90.0
25 b 2019-07 95.0 95.0
26 b 2019-08 95.0 95.0
27 b 2019-09 100.0 100.0
28 b 2019-10 100.0 100.0
29 b 2019-11 100.0 100.0
# The second solution

df_second = df.copy(deep=True)

g = df_second.groupby(["type"], group_keys=False)[["v1", "v2"]]

df_second[["v1", "v2"]] = (g.ffill() + g.bfill()) / 2

df_second[["v1", "v2"]] = g.apply(lambda x: x.bfill().ffill())


df_second
type date v1 v2
0 a 2018-09 10.0 10.0
1 a 2018-10 15.0 15.0
2 a 2018-11 15.0 15.0
3 a 2018-12 20.0 20.0
4 a 2019-01 25.0 25.0
5 a 2019-02 25.0 25.0
6 a 2019-03 30.0 30.0
7 a 2019-04 35.0 35.0
8 a 2019-05 35.0 35.0
9 a 2019-06 40.0 40.0
10 a 2019-07 45.0 45.0
11 a 2019-08 45.0 45.0
12 a 2019-09 50.0 50.0
13 a 2019-10 50.0 50.0
14 a 2019-11 50.0 50.0
15 b 2018-09 60.0 60.0
16 b 2018-10 65.0 65.0
17 b 2018-11 65.0 65.0
18 b 2018-12 70.0 70.0
19 b 2019-01 75.0 75.0
20 b 2019-02 75.0 75.0
21 b 2019-03 80.0 80.0
22 b 2019-04 85.0 85.0
23 b 2019-05 85.0 85.0
24 b 2019-06 90.0 90.0
25 b 2019-07 95.0 95.0
26 b 2019-08 95.0 95.0
27 b 2019-09 100.0 100.0
28 b 2019-10 100.0 100.0
29 b 2019-11 100.0 100.0

Above two methods gave us desirable results. Let try some plain method to achieve the same. Below is the dry, plain method which fill missing values by backward and forward average. However by using this we introduce bug which will be hard to detect. Let see.

# The third method

df_third = df.copy(deep=True)

df_third[["v1", "v2"]] = (df_third[["v1", "v2"]].ffill() + df_third[["v1", "v2"]].bfill()) / 2

df_third[["v1", "v2"]] = df_third[["v1", "v2"]].bfill().ffill()

df_third
type date v1 v2
0 a 2018-09 10.0 10.0
1 a 2018-10 15.0 15.0
2 a 2018-11 15.0 15.0
3 a 2018-12 20.0 20.0
4 a 2019-01 25.0 25.0
5 a 2019-02 25.0 25.0
6 a 2019-03 30.0 30.0
7 a 2019-04 35.0 35.0
8 a 2019-05 35.0 35.0
9 a 2019-06 40.0 40.0
10 a 2019-07 45.0 45.0
11 a 2019-08 45.0 45.0
12 a 2019-09 50.0 50.0
13 a 2019-10 55.0 55.0
14 a 2019-11 55.0 55.0
15 b 2018-09 60.0 60.0
16 b 2018-10 65.0 65.0
17 b 2018-11 65.0 65.0
18 b 2018-12 70.0 70.0
19 b 2019-01 75.0 75.0
20 b 2019-02 75.0 75.0
21 b 2019-03 80.0 80.0
22 b 2019-04 85.0 85.0
23 b 2019-05 85.0 85.0
24 b 2019-06 90.0 90.0
25 b 2019-07 95.0 95.0
26 b 2019-08 95.0 95.0
27 b 2019-09 100.0 100.0
28 b 2019-10 100.0 100.0
29 b 2019-11 100.0 100.0

At the first glance everything seems ok, but let check the equality of these three dataframes with double equality sign and all method.

df_first == df_second

# The result of the first and the second method are equal
type date v1 v2
0 True True True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True True True
9 True True True True
10 True True True True
11 True True True True
12 True True True True
13 True True True True
14 True True True True
15 True True True True
16 True True True True
17 True True True True
18 True True True True
19 True True True True
20 True True True True
21 True True True True
22 True True True True
23 True True True True
24 True True True True
25 True True True True
26 True True True True
27 True True True True
28 True True True True
29 True True True True
df_third == df_first
type date v1 v2
0 True True True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True True True
9 True True True True
10 True True True True
11 True True True True
12 True True True True
13 True True False False
14 True True False False
15 True True True True
16 True True True True
17 True True True True
18 True True True True
19 True True True True
20 True True True True
21 True True True True
22 True True True True
23 True True True True
24 True True True True
25 True True True True
26 True True True True
27 True True True True
28 True True True True
29 True True True True

At index position 13 and 14 we have False values. That was the reason of inequality. But why do we have these False values? The dry plain method calculated the average of non-missing values sequentially not by type. At index 13 and 14 the average is 50 not 55, because the third method calculated the average between 50 and 60 instead of 50 and NaN. That resulted the bug!

df.iloc[12:16]
type date v1 v2
12 a 2019-09 50.0 50.0
13 a 2019-10 NaN NaN
14 a 2019-11 NaN NaN
15 b 2018-09 60.0 60.0
df_first.iloc[12:16]
type date v1 v2
12 a 2019-09 50.0 50.0
13 a 2019-10 50.0 50.0
14 a 2019-11 50.0 50.0
15 b 2018-09 60.0 60.0
df_third.iloc[12:16]  # Not Correct
type date v1 v2
12 a 2019-09 50.0 50.0
13 a 2019-10 55.0 55.0
14 a 2019-11 55.0 55.0
15 b 2018-09 60.0 60.0