groupby mean and median
There is a dataframe:
df.head()
title birthday today age
0 Group 1 1997-07-24 2020-05-04 8320 days
1 Group 1 1997-07-24 2020-05-04 8320 days
2 Group 1 1997-07-24 2020-05-04 8320 days
3 Group 1 1997-07-24 2020-05-04 8320 days
4 Group 1 1997-07-24 2020-05-04 8320 days
<class 'pandas.core.frame.DataFrame'>
Int64Index: 329739 entries, 0 to 350675
Data columns (total 4 columns):
title 329739 non-null object
birthday 329739 non-null datetime64[ns]
today 329739 non-null datetime64[ns]
age 329739 non-null timedelta64[ns]
dtypes: datetime64[ns](2), object(1), timedelta64[ns](1)
memory usage: 22.6+ MB
I need to know the maximum, minimum, average, and median age in each course. But the application of
df[['age', 'title']].groupby('title').agg(['max', 'min', 'mean', 'median'])
Returns the error
ValueError: no results
However, the maximum and minimum values are considered, as well as mean and median without grouping:
df['age'].median()
Timedelta('10684 days 00:00:00')
Thank you in advance!
2 answers
The problem is that the timedelta64[ns]
data type is non-numeric and therefore Pandas does not allow aggregation. Therefore, you can first create a virtual numeric column (for example, extract the number of days from timedelta64[ns]
) and then do arithmetic with it:
In [51]: (df
.assign(days=df["age"].dt.days)
.groupby('title')
["days"]
.agg(['max', 'min', 'mean', 'median']))
Out[51]:
max min mean median
title
Group 1 8320 8320 8320 8320
A short solution (one of the possible ones) is to apply the methods of the series:
df[['age', 'title']].groupby('title').agg(['max', 'min', pd.Series.mean, pd.Series.median])
Result:
age max min mean median title Group 1 8320 days 8320 days 8320 days 8320 days
Problem analysis and explanation:
Many methods of the DataFrame
class have a key parameter numeric_only
; the method of the Series
class does not have such a parameter.
Examples of such methods: All that you applied: .max()
, .min()
, .mean()
, .median()
.
The numeric_only
parameter defines the columns to which the method will be accepted. It has 3 possible values:
-
None
(standard value) - same asFalse
, -
True
- only columns of the typeint
,float
andbool
, -
False
- all columns for which the method can work.
When applying methods of the DataFrame
class to objects of the DataFrameGroupBy
type (your case), the individual value of the numeric_only
parameter is used in depending on the specific method, for example in your case:
- for the
.max()
method: standard value (None
), - for the
.min()
method: standard value (None
), - for the method
.mean()
:.mean(numeric_only=True)
- for the method
.median()
:.median(numeric_only=True)
You apply these 4 methods to your 'age'
column, which is of type neither int
, nor float
, nor bool
. This means that it will be excluded from using the .mean()
and .median()
methods, which prevents you from making the complete result in in the form of a table with all 4 columns you need and returns the error
ValueError: no results
How to get around this - a few examples:
-
Apply methods of the
Series
class that work whenever possible-
Right (as I did at the beginning of the answer):
df[['age', 'title']].groupby('title').agg(['max', 'min', pd.Series.mean, pd.Series.median])
-
Using this idea for a function you created or lambda, for example
def my_mean(series): return series.mean() df[['age', 'title']].groupby('title').agg(['max', 'min', my_mean, lambda s: s.median()])
-
-
Using class methods
DataFrame
with the parameternumeric_only=False
to create (and subsequently apply) its own functions, for example:from functools import partial my_mean = partial(pd.DataFrame.mean, numeric_only=False) my_median = partial(pd.DataFrame.median, numeric_only=False) df[['age', 'title']].groupby('title').agg(['max', 'min', my_mean, my_median])
-
Use NumPy functions (after all, this is done in the background by pandas itself - always, and in the previous two paragraphs), but indirectly, so that pandas "does not notice", for example:
def my_mean(x): return np.mean(x) my_median = lambda x: np.median(x) df[['age', 'title']].groupby('title').agg(['max', 'min', my_mean, my_median])