import pandas as pd
import numpy as np
Introduction
I will show you how to fill missing values by the average of its before and after values if our dataframe has the following form:
Problem Statement
How can we perform groupby
and fill Nan
’s with its preceding and following values?
= {
data "type": [
"a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
],"date": [
"2018-09", "2018-10", "2018-11", "2018-12", "2019-01", "2019-02", "2019-03",
"2019-04", "2019-05", "2019-06", "2019-07", "2019-08", "2019-09", "2019-10",
"2019-11", "2018-09", "2018-10", "2018-11", "2018-12", "2019-01", "2019-02",
"2019-03", "2019-04", "2019-05", "2019-06", "2019-07", "2019-08", "2019-09",
"2019-10", "2019-11",
],"v1": [
10, np.nan, np.nan, 20, np.nan, np.nan, 30, np.nan, np.nan, 40, np.nan,
50, np.nan, np.nan, 60, np.nan, np.nan, 70, np.nan, np.nan, 80,
np.nan, 90, np.nan, np.nan, 100, np.nan, np.nan,
np.nan, np.nan,
],"v2": [
10, np.nan, np.nan, 20, np.nan, np.nan, 30, np.nan, np.nan, 40, np.nan,
50, np.nan, np.nan, 60, np.nan, np.nan, 70, np.nan, np.nan, 80,
np.nan, 90, np.nan, np.nan, 100, np.nan, np.nan,
np.nan, np.nan,
],
}
= pd.DataFrame(data)
df
df.head()
type | date | v1 | v2 | |
---|---|---|---|---|
0 | a | 2018-09 | 10.0 | 10.0 |
1 | a | 2018-10 | NaN | NaN |
2 | a | 2018-11 | NaN | NaN |
3 | a | 2018-12 | 20.0 | 20.0 |
4 | a | 2019-01 | NaN | NaN |
We see that values in type
and date
column are repetitive. Moreover, it does not matter what values are in other two columns, unless it is numeric type. Our aim is to fill these missing values by the average of its before and after value for v1
and v2
columns. We also notice that sorting and then filling won’t give desired result. We need something different. But before we find that solution we need to convert columns in numeric type in order to compute average by using Pandas to_numeric()
method with parameter “errors” set to “coerce”, because Pandas DataFrame astype()
method won’t work in this case.
We have two solution here. To use both of them let make copy of initial dataframe. The first uses groupby
and then applies aggregator with “backward fill” and “forward fill” and then again groupby
and then computes mean. The second solution does almost the same as the first but uses apply()
method instead of aggregation.
# The first solution
= df.copy(deep=True)
df_first
"v1", "v2"]] = (df_first.groupby("type")[["v1", "v2"]]
df_first[["bfill", "ffill"])
.agg([=0, axis=1)
.groupby(level
.mean())
df_first
type | date | v1 | v2 | |
---|---|---|---|---|
0 | a | 2018-09 | 10.0 | 10.0 |
1 | a | 2018-10 | 15.0 | 15.0 |
2 | a | 2018-11 | 15.0 | 15.0 |
3 | a | 2018-12 | 20.0 | 20.0 |
4 | a | 2019-01 | 25.0 | 25.0 |
5 | a | 2019-02 | 25.0 | 25.0 |
6 | a | 2019-03 | 30.0 | 30.0 |
7 | a | 2019-04 | 35.0 | 35.0 |
8 | a | 2019-05 | 35.0 | 35.0 |
9 | a | 2019-06 | 40.0 | 40.0 |
10 | a | 2019-07 | 45.0 | 45.0 |
11 | a | 2019-08 | 45.0 | 45.0 |
12 | a | 2019-09 | 50.0 | 50.0 |
13 | a | 2019-10 | 50.0 | 50.0 |
14 | a | 2019-11 | 50.0 | 50.0 |
15 | b | 2018-09 | 60.0 | 60.0 |
16 | b | 2018-10 | 65.0 | 65.0 |
17 | b | 2018-11 | 65.0 | 65.0 |
18 | b | 2018-12 | 70.0 | 70.0 |
19 | b | 2019-01 | 75.0 | 75.0 |
20 | b | 2019-02 | 75.0 | 75.0 |
21 | b | 2019-03 | 80.0 | 80.0 |
22 | b | 2019-04 | 85.0 | 85.0 |
23 | b | 2019-05 | 85.0 | 85.0 |
24 | b | 2019-06 | 90.0 | 90.0 |
25 | b | 2019-07 | 95.0 | 95.0 |
26 | b | 2019-08 | 95.0 | 95.0 |
27 | b | 2019-09 | 100.0 | 100.0 |
28 | b | 2019-10 | 100.0 | 100.0 |
29 | b | 2019-11 | 100.0 | 100.0 |
# The second solution
= df.copy(deep=True)
df_second
= df_second.groupby(["type"], group_keys=False)[["v1", "v2"]]
g
"v1", "v2"]] = (g.ffill() + g.bfill()) / 2
df_second[[
"v1", "v2"]] = g.apply(lambda x: x.bfill().ffill())
df_second[[
df_second
type | date | v1 | v2 | |
---|---|---|---|---|
0 | a | 2018-09 | 10.0 | 10.0 |
1 | a | 2018-10 | 15.0 | 15.0 |
2 | a | 2018-11 | 15.0 | 15.0 |
3 | a | 2018-12 | 20.0 | 20.0 |
4 | a | 2019-01 | 25.0 | 25.0 |
5 | a | 2019-02 | 25.0 | 25.0 |
6 | a | 2019-03 | 30.0 | 30.0 |
7 | a | 2019-04 | 35.0 | 35.0 |
8 | a | 2019-05 | 35.0 | 35.0 |
9 | a | 2019-06 | 40.0 | 40.0 |
10 | a | 2019-07 | 45.0 | 45.0 |
11 | a | 2019-08 | 45.0 | 45.0 |
12 | a | 2019-09 | 50.0 | 50.0 |
13 | a | 2019-10 | 50.0 | 50.0 |
14 | a | 2019-11 | 50.0 | 50.0 |
15 | b | 2018-09 | 60.0 | 60.0 |
16 | b | 2018-10 | 65.0 | 65.0 |
17 | b | 2018-11 | 65.0 | 65.0 |
18 | b | 2018-12 | 70.0 | 70.0 |
19 | b | 2019-01 | 75.0 | 75.0 |
20 | b | 2019-02 | 75.0 | 75.0 |
21 | b | 2019-03 | 80.0 | 80.0 |
22 | b | 2019-04 | 85.0 | 85.0 |
23 | b | 2019-05 | 85.0 | 85.0 |
24 | b | 2019-06 | 90.0 | 90.0 |
25 | b | 2019-07 | 95.0 | 95.0 |
26 | b | 2019-08 | 95.0 | 95.0 |
27 | b | 2019-09 | 100.0 | 100.0 |
28 | b | 2019-10 | 100.0 | 100.0 |
29 | b | 2019-11 | 100.0 | 100.0 |
Above two methods gave us desirable results. Let try some plain method to achieve the same. Below is the dry, plain method which fill missing values by backward and forward average. However by using this we introduce bug which will be hard to detect. Let see.
# The third method
= df.copy(deep=True)
df_third
"v1", "v2"]] = (df_third[["v1", "v2"]].ffill() + df_third[["v1", "v2"]].bfill()) / 2
df_third[[
"v1", "v2"]] = df_third[["v1", "v2"]].bfill().ffill()
df_third[[
df_third
type | date | v1 | v2 | |
---|---|---|---|---|
0 | a | 2018-09 | 10.0 | 10.0 |
1 | a | 2018-10 | 15.0 | 15.0 |
2 | a | 2018-11 | 15.0 | 15.0 |
3 | a | 2018-12 | 20.0 | 20.0 |
4 | a | 2019-01 | 25.0 | 25.0 |
5 | a | 2019-02 | 25.0 | 25.0 |
6 | a | 2019-03 | 30.0 | 30.0 |
7 | a | 2019-04 | 35.0 | 35.0 |
8 | a | 2019-05 | 35.0 | 35.0 |
9 | a | 2019-06 | 40.0 | 40.0 |
10 | a | 2019-07 | 45.0 | 45.0 |
11 | a | 2019-08 | 45.0 | 45.0 |
12 | a | 2019-09 | 50.0 | 50.0 |
13 | a | 2019-10 | 55.0 | 55.0 |
14 | a | 2019-11 | 55.0 | 55.0 |
15 | b | 2018-09 | 60.0 | 60.0 |
16 | b | 2018-10 | 65.0 | 65.0 |
17 | b | 2018-11 | 65.0 | 65.0 |
18 | b | 2018-12 | 70.0 | 70.0 |
19 | b | 2019-01 | 75.0 | 75.0 |
20 | b | 2019-02 | 75.0 | 75.0 |
21 | b | 2019-03 | 80.0 | 80.0 |
22 | b | 2019-04 | 85.0 | 85.0 |
23 | b | 2019-05 | 85.0 | 85.0 |
24 | b | 2019-06 | 90.0 | 90.0 |
25 | b | 2019-07 | 95.0 | 95.0 |
26 | b | 2019-08 | 95.0 | 95.0 |
27 | b | 2019-09 | 100.0 | 100.0 |
28 | b | 2019-10 | 100.0 | 100.0 |
29 | b | 2019-11 | 100.0 | 100.0 |
At the first glance everything seems ok, but let check the equality of these three dataframes with double equality sign and all
method.
== df_second
df_first
# The result of the first and the second method are equal
type | date | v1 | v2 | |
---|---|---|---|---|
0 | True | True | True | True |
1 | True | True | True | True |
2 | True | True | True | True |
3 | True | True | True | True |
4 | True | True | True | True |
5 | True | True | True | True |
6 | True | True | True | True |
7 | True | True | True | True |
8 | True | True | True | True |
9 | True | True | True | True |
10 | True | True | True | True |
11 | True | True | True | True |
12 | True | True | True | True |
13 | True | True | True | True |
14 | True | True | True | True |
15 | True | True | True | True |
16 | True | True | True | True |
17 | True | True | True | True |
18 | True | True | True | True |
19 | True | True | True | True |
20 | True | True | True | True |
21 | True | True | True | True |
22 | True | True | True | True |
23 | True | True | True | True |
24 | True | True | True | True |
25 | True | True | True | True |
26 | True | True | True | True |
27 | True | True | True | True |
28 | True | True | True | True |
29 | True | True | True | True |
== df_first df_third
type | date | v1 | v2 | |
---|---|---|---|---|
0 | True | True | True | True |
1 | True | True | True | True |
2 | True | True | True | True |
3 | True | True | True | True |
4 | True | True | True | True |
5 | True | True | True | True |
6 | True | True | True | True |
7 | True | True | True | True |
8 | True | True | True | True |
9 | True | True | True | True |
10 | True | True | True | True |
11 | True | True | True | True |
12 | True | True | True | True |
13 | True | True | False | False |
14 | True | True | False | False |
15 | True | True | True | True |
16 | True | True | True | True |
17 | True | True | True | True |
18 | True | True | True | True |
19 | True | True | True | True |
20 | True | True | True | True |
21 | True | True | True | True |
22 | True | True | True | True |
23 | True | True | True | True |
24 | True | True | True | True |
25 | True | True | True | True |
26 | True | True | True | True |
27 | True | True | True | True |
28 | True | True | True | True |
29 | True | True | True | True |
At index position 13 and 14 we have False
values. That was the reason of inequality. But why do we have these False
values? The dry plain method calculated the average of non-missing values sequentially not by type
. At index 13 and 14 the average is 50 not 55, because the third method calculated the average between 50 and 60 instead of 50 and NaN. That resulted the bug!
12:16] df.iloc[
type | date | v1 | v2 | |
---|---|---|---|---|
12 | a | 2019-09 | 50.0 | 50.0 |
13 | a | 2019-10 | NaN | NaN |
14 | a | 2019-11 | NaN | NaN |
15 | b | 2018-09 | 60.0 | 60.0 |
12:16] df_first.iloc[
type | date | v1 | v2 | |
---|---|---|---|---|
12 | a | 2019-09 | 50.0 | 50.0 |
13 | a | 2019-10 | 50.0 | 50.0 |
14 | a | 2019-11 | 50.0 | 50.0 |
15 | b | 2018-09 | 60.0 | 60.0 |
12:16] # Not Correct df_third.iloc[
type | date | v1 | v2 | |
---|---|---|---|---|
12 | a | 2019-09 | 50.0 | 50.0 |
13 | a | 2019-10 | 55.0 | 55.0 |
14 | a | 2019-11 | 55.0 | 55.0 |
15 | b | 2018-09 | 60.0 | 60.0 |