How can the groupby object calculate count> 10 and the variance via np. var?

Question

How can the groupby object calculate count> 10 and the variance via np. var?

There is an object groupby formed from a DataFrame. For it, you need to calculate the variance by categories in which the number of observations is >= 10.

Source data:

megretrans_new = megretrans_otr.groupby(['new'])['amount'] #так был создан объект

I try this, but it's the wrong code:

megretrans_new.agg(['count' > 10, np.var])

1

python python-3.x pandas numpy group-by

Author: 0xdb, 2020-08-18

Source

1 answers

score 4 · Accepted Answer

Example:

Creating a frame for the demo:

df = pd.DataFrame({
    "new": np.random.choice(list("abcde"), 100, p=[.3,.25,.2,.2,.05]), 
    "amount": np.random.rand(100)*1000})

In [48]: df
Out[48]:
   new      amount
0    a  469.617984
1    b   87.851712
2    a  795.669208
3    a  954.550734
4    b   34.985337
..  ..         ...
95   a  361.697281
96   d  245.245859
97   d  963.222224
98   b  545.422079
99   a  630.812729

[100 rows x 2 columns]

In [51]: df["new"].value_counts()
Out[51]:
a    30
c    25
b    22
d    19
e     4
Name: new, dtype: int64

Solution 1:

res = df.groupby("new").filter(lambda x: len(x) >= 10).groupby("new")["amount"].var()

Solution 2:

res = df.groupby("new")["amount"].agg(["var", "count"]).query("count >= 10")["var"]

Result:

In [53]: res
Out[53]:
new
a     75210.670184
b     84411.914567
c     72483.512171
d    101631.615241
Name: amount, dtype: float64

UPD: it is worth mentioning that in Pandas, the variance is calculated by default with a degree of freedom of 1 (parameter: ddof=1), whereas in np.var(..., ddof=0). Thus, to get the same values in Pandas as in np.var(), you must explicitly specify ddof=0:

In [72]: df.loc[df["new"]=="a", "amount"].var()
Out[72]: 75210.67018445666

In [73]: np.var(df.loc[df["new"]=="a", "amount"])
Out[73]: 72703.64784497478

In [74]: df.loc[df["new"]=="a", "amount"].var(ddof=0)
Out[74]: 72703.64784497478

And vice versa, to get the same value in np.var() as in Pandas:

In [75]: np.var(df.loc[df["new"]=="a", "amount"], ddof=1)
Out[75]: 75210.67018445666