How to perform a loop on a dataframe, removing rows conditionally and restarting the loop on a recalculated dataframe with each removal?
Hello,
I am processing some data and I need my algorithm to perform the comparison between two lines and remove the one that is worse, according to some conditions.
Each time the algorithm performs the removal, it must recalculate the dataframe, since one of the conditions depends on the relationship of the bottom line to the top.
My dataframe has the following format
df = n° de fios 1 diametro 1 n° de fios 2 diametro 2 seção total relação
n11 d11 n21 d21 st1 st2/st1
n12 d12 n22 d22 st2 st3/st2
This continues for several lines.
O what my algorithm does is:
- reads the dataframe of an excel, which has the first 5 columns
- calculates the last column
- adds another column with a rank based on the first column (coming from a dict)
- scroll through the rows, through a loop
for n in range(0, len(df)-1)
- if the relation is less than 1.03, you must remove one of the two rows
- There is a set of conditions that must be met for the row
n+1
to be removed, otherwise removes the linen
(conditions include therank
). - when a row is removed, reset the
index
, remove the columnrank
, add the dataframe to a excel, and apply the loop to the new excel.
My code is as follows:
class Rank_and_drop(object):
def __init__(self): # define os dicts
#exemplo
self.dict_mc_112_2p_cu = {'0.71': 0, '0.75': 1, '0.67': 2, '0.8': 3, '0.85': 4, '0.63': 5, '0.45': 6, '0.475': 7, '0.5': 8, '0.53': 9, '0.56': 10, '0.6': 11}
def recalc_df(self, file, sheet):
# lê a tabela
df_final = pd.read_excel(file, sheet_name=sheet)
# converte os numeros para strings para poder usar o self.dict
df_final = df_final.astype(str)
df_final = df_final.drop(columns=['rsc'])
df_final = df_final.sort_values(by='sc')
df_final = df_final.reset_index(drop=True)
aux = df_final.loc[:, 'sc']
aux = aux.reset_index(drop=True)
aux2 = aux.drop(0)
aux2 = aux2.reset_index(drop=True)
aux2 = aux2.append(aux2.tail(1), ignore_index=True)
for x in range(len(aux)):
aux[x] = (round(float(aux2[x])/float(aux[x]), 3))
aux = aux.rename('rsc')
df_final['rsc'] = aux
df_final = df_final.astype(str)
df_final['rnk'] = df_final['d1'].map(dict_)
return df_final
def loop(self, file, sheet):
df = self.recalc_df(file, sheet)
for n in range(0,len(df)-1):
c_0 = float(df.at[n, 'rsc']) <= 1.03
c_1 = (df.at[n, 'n2'] == '0') and (df.at[n+1, 'n2'] == '0')
c_1_1 = df.at[n+1, 'rnk'] >= df.loc[n, 'rnk']
c_1_2 = float(df.at[n, 'n1']) < float(df.at[n+1, 'n1'])
c_2 = (df.at[n, 'n2'] == '0') and (df.at[n+1, 'n2'] != '0')
c_4 = (df.at[n, 'n2'] != '0') and (df.at[n+1, 'n2'] != '0')
c_4_1 = df.at[n+1, 'rnk'] == df.at[n, 'rnk']
c_4_1_1 = float(df.at[n+1, 'd2']) < float(df.at[n, 'd2'])
c_4_2 = df.at[n+1, 'rnk'] > df.at[n, 'rnk']
if c_0:
if (((c_1) and ((c_1_1) or (c_1_2))) or (c_2) or ((c_4) and ((((c_4_1) and (c_4_1_1)) or (c_4_2))))):
df = df.drop(n+1)
else:
df = df.drop(n)
df = df.reset_index(drop=True)
df = df.drop(columns=['rnk'])
df.to_excel('{0}.xlsx'.format(sheet), index=False, sheet_name=sheet)
self.loop('{0}.xlsx'.format(sheet), sheet)
else:
pass
n = n+1
df.to_excel('{0}.xlsx'.format(sheet), index=False, sheet_name=sheet)
return df
Now the problem is that my loop doesn't work the way I expect it to...
It starts correctly, and to a certain extent even does the expected, however, from a certain point, the algorithm looks like it gives a bugged , and removes several rows it shouldn't, and in the end returns a KeyError
.
Any idea what I can do to accomplish this task, more optimally and without these errors?
Edit:
I have several dataframes that I want to do this, of varying sizes, in the smallest of them (35 lines), the code correctly performs the function, however, returns the KeyError
. In another dataframe, of 690 rows, the problem of removing several rows occurs.
1 answers
Reproducing the defect
Note the following code
import pandas as pd
df = pd.DataFrame(
[
[6, 5, 7, 3],
[1, 7, 8, 9],
[7, 8, 3, 10],
],
columns=list('ABCD'))
# Condição de remoção: uma linha deve ser removida se seu valor na coluna "A" é maior que 5
condição = lambda dados, n: dados.loc[n, "A"] > 5
for n in range(0,len(df)-1):
print("*"*30)
print("Antes:\n%s"%df)
if condição(df, n):
print("Condição verdadeira na linha %s:\n%s\n" % (n, df.loc[[n]]))
df = df.drop(n+1)
else:
print("Condição falsa na linha %s:\n%s\n" % (n, df.loc[[n]]))
df = df.drop(n)
print("Depois:\n%s\n\n"%df)
Running this code generates the following output:
Antes:
A B C D
0 6 5 7 3
1 1 7 8 9
2 7 8 3 10
Condição verdadeira na linha 0:
A B C D
0 6 5 7 3
Depois:
A B C D
0 6 5 7 3
2 7 8 3 10
******************************
Antes:
A B C D
0 6 5 7 3
2 7 8 3 10
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1790, in _validate_key
error()
File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1785, in error
axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1] is not in the [index]'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "outro.py", line 18, in <module>
if condição(df, n):
File "outro.py", line 13, in <lambda>
condição = lambda dados, n: dados.loc[n, "A"] > 5
File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1472, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 870, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 998, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1911, in _getitem_axis
self._validate_key(key, axis)
File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1798, in _validate_key
error()
File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1785, in error
axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1] is not in the [index]'
As can be seen, the first execution of the loop erases row 1. In the second step, one tries to access the same line 1 that was removed, and hence the KeyError
occurs.
You may have developed the algorithm expecting the dataframe to behave similar to lists. Therefore, you could add a
df = df.reset_index(drop=True)
Logo after removing the row to reorder the indexes, but this still would not solve the problem. Note that, your loop must access n-1
dataframe positions, and therefore if during loop execution, the code removes 2
rows from the dataframe, when it arrives at the time of accessing the n-1
position in the dataframe of n-2
rows, another KeyError
would occur.
Solutions
- reconsidering row removal conditions - first of all, one should evaluate if the purpose of the program is correct. In many data sets, the rows are independent of each other, and therefore it is unusual to have to remove a line from the evaluation of other lines.
-
modifying the program to work properly - as we have seen, modifying a
Dataframe
(or lists) while trying to iterate through its indexes is problematic. An alternative would be to construct the list of indexes to be removed, and then remove them with adf.drop
only.
In the above example, it would be enough to replace the for loop with:
índicesParaSeremRemovidos = [n for n in range(len(df)) if condição(df,n)] + [n+1 for n in range(len(df)) if not condição(df,n) and n<3]
df.drop(índicesParaSeremRemovidos, inplace=True)
print(df)
Resulting in
A B C D
1 1 7 8 9