# Lost in Pandas - Part 2

Data Science
Author

Nodar Okroshiashvili

Published

April 14, 2020

# Introduction

I will show you how to fill missing values by the average of its before and after values if our dataframe has the following form:

## Problem Statement

How can we perform `groupby` and fill `Nan`’s with its preceding and following values?

``````import pandas as pd
import numpy as np``````
``````data = {
"type": [
"a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
],
"date": [
"2018-09", "2018-10", "2018-11", "2018-12", "2019-01", "2019-02", "2019-03",
"2019-04", "2019-05", "2019-06", "2019-07", "2019-08", "2019-09", "2019-10",
"2019-11", "2018-09", "2018-10", "2018-11", "2018-12", "2019-01", "2019-02",
"2019-03", "2019-04", "2019-05", "2019-06", "2019-07", "2019-08", "2019-09",
"2019-10", "2019-11",
],
"v1": [
10, np.nan, np.nan, 20, np.nan, np.nan, 30, np.nan, np.nan, 40, np.nan,
np.nan, 50, np.nan, np.nan, 60, np.nan, np.nan, 70, np.nan, np.nan, 80,
np.nan, np.nan, 90, np.nan, np.nan, 100, np.nan, np.nan,
],
"v2": [
10, np.nan, np.nan, 20, np.nan, np.nan, 30, np.nan, np.nan, 40, np.nan,
np.nan, 50, np.nan, np.nan, 60, np.nan, np.nan, 70, np.nan, np.nan, 80,
np.nan, np.nan, 90, np.nan, np.nan, 100, np.nan, np.nan,
],
}

df = pd.DataFrame(data)

type date v1 v2
0 a 2018-09 10.0 10.0
1 a 2018-10 NaN NaN
2 a 2018-11 NaN NaN
3 a 2018-12 20.0 20.0
4 a 2019-01 NaN NaN

We see that values in `type` and `date` column are repetitive. Moreover, it does not matter what values are in other two columns, unless it is numeric type. Our aim is to fill these missing values by the average of its before and after value for `v1` and `v2` columns. We also notice that sorting and then filling won’t give desired result. We need something different. But before we find that solution we need to convert columns in numeric type in order to compute average by using Pandas `to_numeric()` method with parameter “errors” set to “coerce”, because Pandas DataFrame `astype()` method won’t work in this case.

We have two solution here. To use both of them let make copy of initial dataframe. The first uses `groupby` and then applies aggregator with “backward fill” and “forward fill” and then again `groupby` and then computes mean. The second solution does almost the same as the first but uses `apply()` method instead of aggregation.

``````# The first solution

df_first = df.copy(deep=True)

df_first[["v1", "v2"]] = (df_first.groupby("type")[["v1", "v2"]]
.agg(["bfill", "ffill"])
.groupby(level=0, axis=1)
.mean())

df_first``````
type date v1 v2
0 a 2018-09 10.0 10.0
1 a 2018-10 15.0 15.0
2 a 2018-11 15.0 15.0
3 a 2018-12 20.0 20.0
4 a 2019-01 25.0 25.0
5 a 2019-02 25.0 25.0
6 a 2019-03 30.0 30.0
7 a 2019-04 35.0 35.0
8 a 2019-05 35.0 35.0
9 a 2019-06 40.0 40.0
10 a 2019-07 45.0 45.0
11 a 2019-08 45.0 45.0
12 a 2019-09 50.0 50.0
13 a 2019-10 50.0 50.0
14 a 2019-11 50.0 50.0
15 b 2018-09 60.0 60.0
16 b 2018-10 65.0 65.0
17 b 2018-11 65.0 65.0
18 b 2018-12 70.0 70.0
19 b 2019-01 75.0 75.0
20 b 2019-02 75.0 75.0
21 b 2019-03 80.0 80.0
22 b 2019-04 85.0 85.0
23 b 2019-05 85.0 85.0
24 b 2019-06 90.0 90.0
25 b 2019-07 95.0 95.0
26 b 2019-08 95.0 95.0
27 b 2019-09 100.0 100.0
28 b 2019-10 100.0 100.0
29 b 2019-11 100.0 100.0
``````# The second solution

df_second = df.copy(deep=True)

g = df_second.groupby(["type"], group_keys=False)[["v1", "v2"]]

df_second[["v1", "v2"]] = (g.ffill() + g.bfill()) / 2

df_second[["v1", "v2"]] = g.apply(lambda x: x.bfill().ffill())

df_second``````
type date v1 v2
0 a 2018-09 10.0 10.0
1 a 2018-10 15.0 15.0
2 a 2018-11 15.0 15.0
3 a 2018-12 20.0 20.0
4 a 2019-01 25.0 25.0
5 a 2019-02 25.0 25.0
6 a 2019-03 30.0 30.0
7 a 2019-04 35.0 35.0
8 a 2019-05 35.0 35.0
9 a 2019-06 40.0 40.0
10 a 2019-07 45.0 45.0
11 a 2019-08 45.0 45.0
12 a 2019-09 50.0 50.0
13 a 2019-10 50.0 50.0
14 a 2019-11 50.0 50.0
15 b 2018-09 60.0 60.0
16 b 2018-10 65.0 65.0
17 b 2018-11 65.0 65.0
18 b 2018-12 70.0 70.0
19 b 2019-01 75.0 75.0
20 b 2019-02 75.0 75.0
21 b 2019-03 80.0 80.0
22 b 2019-04 85.0 85.0
23 b 2019-05 85.0 85.0
24 b 2019-06 90.0 90.0
25 b 2019-07 95.0 95.0
26 b 2019-08 95.0 95.0
27 b 2019-09 100.0 100.0
28 b 2019-10 100.0 100.0
29 b 2019-11 100.0 100.0

Above two methods gave us desirable results. Let try some plain method to achieve the same. Below is the dry, plain method which fill missing values by backward and forward average. However by using this we introduce bug which will be hard to detect. Let see.

``````# The third method

df_third = df.copy(deep=True)

df_third[["v1", "v2"]] = (df_third[["v1", "v2"]].ffill() + df_third[["v1", "v2"]].bfill()) / 2

df_third[["v1", "v2"]] = df_third[["v1", "v2"]].bfill().ffill()

df_third``````
type date v1 v2
0 a 2018-09 10.0 10.0
1 a 2018-10 15.0 15.0
2 a 2018-11 15.0 15.0
3 a 2018-12 20.0 20.0
4 a 2019-01 25.0 25.0
5 a 2019-02 25.0 25.0
6 a 2019-03 30.0 30.0
7 a 2019-04 35.0 35.0
8 a 2019-05 35.0 35.0
9 a 2019-06 40.0 40.0
10 a 2019-07 45.0 45.0
11 a 2019-08 45.0 45.0
12 a 2019-09 50.0 50.0
13 a 2019-10 55.0 55.0
14 a 2019-11 55.0 55.0
15 b 2018-09 60.0 60.0
16 b 2018-10 65.0 65.0
17 b 2018-11 65.0 65.0
18 b 2018-12 70.0 70.0
19 b 2019-01 75.0 75.0
20 b 2019-02 75.0 75.0
21 b 2019-03 80.0 80.0
22 b 2019-04 85.0 85.0
23 b 2019-05 85.0 85.0
24 b 2019-06 90.0 90.0
25 b 2019-07 95.0 95.0
26 b 2019-08 95.0 95.0
27 b 2019-09 100.0 100.0
28 b 2019-10 100.0 100.0
29 b 2019-11 100.0 100.0

At the first glance everything seems ok, but let check the equality of these three dataframes with double equality sign and `all` method.

``````df_first == df_second

# The result of the first and the second method are equal``````
type date v1 v2
0 True True True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True True True
9 True True True True
10 True True True True
11 True True True True
12 True True True True
13 True True True True
14 True True True True
15 True True True True
16 True True True True
17 True True True True
18 True True True True
19 True True True True
20 True True True True
21 True True True True
22 True True True True
23 True True True True
24 True True True True
25 True True True True
26 True True True True
27 True True True True
28 True True True True
29 True True True True
``df_third == df_first``
type date v1 v2
0 True True True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True True True
9 True True True True
10 True True True True
11 True True True True
12 True True True True
13 True True False False
14 True True False False
15 True True True True
16 True True True True
17 True True True True
18 True True True True
19 True True True True
20 True True True True
21 True True True True
22 True True True True
23 True True True True
24 True True True True
25 True True True True
26 True True True True
27 True True True True
28 True True True True
29 True True True True

At index position 13 and 14 we have `False` values. That was the reason of inequality. But why do we have these `False` values? The dry plain method calculated the average of non-missing values sequentially not by `type`. At index 13 and 14 the average is 50 not 55, because the third method calculated the average between 50 and 60 instead of 50 and NaN. That resulted the bug!

``df.iloc[12:16]``
type date v1 v2
12 a 2019-09 50.0 50.0
13 a 2019-10 NaN NaN
14 a 2019-11 NaN NaN
15 b 2018-09 60.0 60.0
``df_first.iloc[12:16]``
type date v1 v2
12 a 2019-09 50.0 50.0
13 a 2019-10 50.0 50.0
14 a 2019-11 50.0 50.0
15 b 2018-09 60.0 60.0
``df_third.iloc[12:16]  # Not Correct``
type date v1 v2
12 a 2019-09 50.0 50.0
13 a 2019-10 55.0 55.0
14 a 2019-11 55.0 55.0
15 b 2018-09 60.0 60.0