groupby mean and median

Question

groupby mean and median

There is a dataframe:

df.head()

     title    birthday     today      age
0   Group 1  1997-07-24  2020-05-04 8320 days
1   Group 1  1997-07-24  2020-05-04 8320 days
2   Group 1  1997-07-24  2020-05-04 8320 days
3   Group 1  1997-07-24  2020-05-04 8320 days
4   Group 1  1997-07-24  2020-05-04 8320 days

<class 'pandas.core.frame.DataFrame'>
Int64Index: 329739 entries, 0 to 350675
Data columns (total 4 columns):
title       329739 non-null object
birthday    329739 non-null datetime64[ns]
today       329739 non-null datetime64[ns]
age         329739 non-null timedelta64[ns]
dtypes: datetime64[ns](2), object(1), timedelta64[ns](1)
memory usage: 22.6+ MB

I need to know the maximum, minimum, average, and median age in each course. But the application of

df[['age', 'title']].groupby('title').agg(['max', 'min', 'mean', 'median'])

Returns the error

ValueError: no results

However, the maximum and minimum values are considered, as well as mean and median without grouping:

df['age'].median()

Timedelta('10684 days 00:00:00')

Thank you in advance!

2

python pandas анализ-данных group-by aggregate

Author: MarianD, 2020-05-11

Source

2 answers

A short solution (one of the possible ones) is to apply the methods of the series:

df[['age', 'title']].groupby('title').agg(['max', 'min', pd.Series.mean, pd.Series.median])

Result:

                                                age
               max        min       mean     median
  title                 
Group 1  8320 days  8320 days  8320 days  8320 days

Problem analysis and explanation:

Many methods of the DataFrame class have a key parameter numeric_only; the method of the Series class does not have such a parameter.

Examples of such methods: All that you applied: .max(), .min(), .mean(), .median().

The numeric_only parameter defines the columns to which the method will be accepted. It has 3 possible values:

None (standard value) - same as False,
True - only columns of the type int, float and bool,
False - all columns for which the method can work.

When applying methods of the DataFrame class to objects of the DataFrameGroupBy type (your case), the individual value of the numeric_only parameter is used in depending on the specific method, for example in your case:

for the .max() method: standard value (None),
for the .min() method: standard value (None),
for the method .mean(): .mean(numeric_only=True)
for the method .median(): .median(numeric_only=True)

You apply these 4 methods to your 'age' column, which is of type neither int, nor float, nor bool. This means that it will be excluded from using the .mean() and .median() methods, which prevents you from making the complete result in in the form of a table with all 4 columns you need and returns the error

ValueError: no results

How to get around this - a few examples:

Apply methods of the Series class that work whenever possible

Right (as I did at the beginning of the answer):

df[['age', 'title']].groupby('title').agg(['max', 'min',
                                           pd.Series.mean, pd.Series.median])

Using this idea for a function you created or lambda, for example

def my_mean(series):
    return series.mean()

df[['age', 'title']].groupby('title').agg(['max', 'min',
                                           my_mean,
                                           lambda s: s.median()])

Using class methods DataFrame with the parameter numeric_only=False to create (and subsequently apply) its own functions, for example:

from functools import partial

my_mean   = partial(pd.DataFrame.mean,   numeric_only=False)
my_median = partial(pd.DataFrame.median, numeric_only=False)

df[['age', 'title']].groupby('title').agg(['max', 'min', my_mean, my_median])

Use NumPy functions (after all, this is done in the background by pandas itself - always, and in the previous two paragraphs), but indirectly, so that pandas "does not notice", for example:
```
def my_mean(x):
     return np.mean(x)

my_median = lambda x: np.median(x)

df[['age', 'title']].groupby('title').agg(['max', 'min', my_mean, my_median])
```

2

Author: MarianD, 2020-05-13 23:47:15

score 2 · Accepted Answer

The problem is that the timedelta64[ns] data type is non-numeric and therefore Pandas does not allow aggregation. Therefore, you can first create a virtual numeric column (for example, extract the number of days from timedelta64[ns]) and then do arithmetic with it:

In [51]: (df
          .assign(days=df["age"].dt.days)
          .groupby('title')
          ["days"]
          .agg(['max', 'min', 'mean', 'median']))
Out[51]:
          max   min  mean  median
title
Group 1  8320  8320  8320    8320