Using groupby on pandas dataframe

Question

Using groupby on pandas dataframe

Good afternoon.

I don't have much skill with Python, I'm having some doubts. Anyone who can help me, since I thank you.

I opened my csv file in python as follows:

import pandas as pd

caminhoArquivo = r'\\Desktop\Base\dias.csv'

baseDados = pd.read_csv(caminhoArquivo,sep=';',decimal=',',encoding='latin-1')

File example:

Index  |  Nome  |  Dia
  0    | Pedro  |   3
  1    | Pedro  |   3
  2    | Pedro  |   24
  3    | Antonio|   24
  4    | Antonio|   24
  5    | Antonio|   24
  6    | Carlos |   4
  7    | Carlos |   4
  8    | Carlos |   28
  9    |  Jose  |   1
  10   |  Jose  |   2
  11   |  Jose  |   2

I removed the duplicate data using the command:

colunas = ['Nome','Dia']

diaDuplicado = baseDados.drop_duplicates(subset = colunas)

diaDuplicado = diaDuplicado.reset_index()

So it was:

 Index |  index  |  Nome  |  Dia
  0    |    0    | Pedro  |   3
  1    |    2    | Pedro  |   24
  2    |    3    | Antonio|   24
  3    |    6    | Carlos |   4
  4    |    8    | Carlos |   28
  5    |    9    |  Jose  |   1
  6    |    10   |  Jose  |   2

Now my doubt. I needed to group the days by names, to stay this way:

Index |  Nome  |  Dia
  0   | Pedro  |   3, 24
  1   | Antonio|   24
  2   | Carlos |   4, 28
  3   |  Jose  |   1, 2

However, the only solution what I found was:

diasgroup = diaDuplicado.groupby(by=['Nome'])['Dia'].apply(list)

But this way it turns the column "name" into index and gets into an "object"format/type.

Index  |  Dia
Pedro  |  3, 24
Antonio|  24
Carlos |  4, 28   
 Jose  |  1, 2

Could you help me?

4

python pandas group-by

Author: Sidon, 2019-03-25

Source

3 answers

score 2 · Answer 1

Using the function groupby:

df[['Nome', 'Dia']].groupby('Nome').agg(lambda x: list(set(x))).reset_index()
Out[6]:
      Nome      Dia
0  Antonio     [24]
1   Carlos  [4, 28]
2     Jose   [1, 2]
3    Pedro  [24, 3] 
dtype: object

score 1 · Answer 2

TL; DR

Edited
Rereading the question I saw that the author of the question had achieved what he wanted but said that the result is an object and that the names become clues, or something like that. I was wondering if even if it can "navigate" that object and throw the elements into a list, dictionary or qq other variable, wouldn't it answer? I will keep my original answer, in case the requirement has to be a DataFrame but I will put below the code to iterate in the object that is obtained by converting the DataFrame to list, a pandas.core.series.Series, (I will use the code fragment that it uses in the question, to create the object):

# Criando o pandas.core.series.Series
diasgroup = diaDuplicado.groupby(by=['Nome'])['Dia'].apply(list)

# Navegando em diasgroup
for i in diasgroup.items():
    print(i)

Saida:

('Antonio', [24])
('Carlos', [4, 28])
('Jose', [1, 2])
('Pedro', [3, 24])

Henceforth it is for the case where result needs to be a DataFrame:

I'm not sure if pandas presents the data exactly the way vc wants in groupby, but vc can convert to an empty multindex dataframe, which presents something like this:

I'm presenting the photo, because when I can only get this output, in a jupyter notebook when I type the name of df, without the function print, with the print function, the result is this:

Empty DataFrame
Columns: []
Index: [(Antonio, 24), (Carlos, 4), (Carlos, 28), (Jose, 1), (Jose, 2), 
(Pedro, 3), (Pedro, 24)]

See that qq way to navigate the indexes and extract the information you need.

Let's go to the code:

import io
import pandas as pd

s = '''
Nome,Dia
Pedro,3
Pedro,3
Pedro,24
Antonio,24
Antonio,24
Antonio,24
Carlos,4
Carlos,4
Carlos,28
Jose,1
Jose,2
Jose,2
'''

df = pd.read_csv(io.StringIO(s), parse_dates=True)
df = df.drop_duplicates(subset = ['Nome','Dia'])

grouped = df.groupby(['Nome', 'Dia']).sum()

print(grouped)

Output:

Empty DataFrame
Columns: []
Index: [(Antonio, 24), (Carlos, 4), (Carlos, 28), (Jose, 1), (Jose, 2),
(Pedro, 3), (Pedro, 24)]

See working on repl.it.

score 1 · Answer 3

The answer involves a few steps.

Creating the already deduplicated DataFrame:

import pandas as pd    
diaDuplicado = pd.DataFrame(columns=["Nome", "dia"], 
                 data=[["Pedro", 3], 
                       ["Pedro", 24], 
                       ["Antonio", 24], 
                       ["Carlos", 4], 
                       ["Carlos",28], 
                       ["Jose", 1], 
                       ["Jose", 2]])

A DataFrame:

print(diaDuplicado)
      Nome  dia
0    Pedro    3
1    Pedro   24
2  Antonio   24
3   Carlos    4
4   Carlos   28
5     Jose    1
6     Jose    2

In sequence, generating tuples in a series. The reason we generate tuples (and not lists here) is that lists are not hashable , which in practice implies that Pandas cannot deduplicate lists:

d = k.groupby(by=['Nome'])['dia'].apply(tuple)

The result is:

Nome
Antonio      (24,)
Carlos     (4, 28)
Jose        (1, 2)
Pedro      (3, 24)
Name: dia, dtype: object

Merging the two Dataframes by the correct keys:

p = pd.merge(k, d, left_on="Nome", right_index=True)
print(p)

      Nome  dia_x    dia_y
0    Pedro      3  (3, 24)
1    Pedro     24  (3, 24)
2  Antonio     24    (24,)
3   Carlos      4  (4, 28)
4   Carlos     28  (4, 28)
5     Jose      1   (1, 2)
6     Jose      2   (1, 2)

Now just deduplicate again, considering the columns of interest:

colunas = ['Nome','dia_y']
diaDuplicado = p.drop_duplicates(subset = colunas)

What results in the dataframe:

      Nome  dia_x    dia_y
 0    Pedro      3  (3, 24)
 2  Antonio     24    (24,)
 3   Carlos      4  (4, 28)
 5     Jose      1   (1, 2)

Now just convert the column "dia_y" to list and drop the column dia_x and dia_y:

diaDuplicado["DiaLista"] = diaDuplicado["dia_y"].apply(list)
diaLista=diaDuplicado.drop(["dia_x", "dia_y"], axis=1)

What results in the "dialist" DataFrame:

      Nome  DiaLista
0    Pedro   [3, 24]
2  Antonio      [24]
3   Carlos   [4, 28]
5     Jose    [1, 2]