How to perform a loop on a dataframe, removing rows conditionally and restarting the loop on a recalculated dataframe with each removal?

Hello,

I am processing some data and I need my algorithm to perform the comparison between two lines and remove the one that is worse, according to some conditions.

Each time the algorithm performs the removal, it must recalculate the dataframe, since one of the conditions depends on the relationship of the bottom line to the top.

My dataframe has the following format

df = n° de fios 1     diametro 1     n° de fios 2     diametro 2     seção total   relação
           n11             d11             n21             d21           st1       st2/st1
           n12             d12             n22             d22           st2       st3/st2

This continues for several lines.

O what my algorithm does is:

  • reads the dataframe of an excel, which has the first 5 columns
  • calculates the last column
  • adds another column with a rank based on the first column (coming from a dict)
  • scroll through the rows, through a loop for n in range(0, len(df)-1)
  • if the relation is less than 1.03, you must remove one of the two rows
  • There is a set of conditions that must be met for the row n+1 to be removed, otherwise removes the line n (conditions include the rank).
  • when a row is removed, reset the index, remove the column rank, add the dataframe to a excel, and apply the loop to the new excel.

My code is as follows:

class Rank_and_drop(object):
    def __init__(self): # define os dicts
    #exemplo
    self.dict_mc_112_2p_cu = {'0.71': 0, '0.75': 1, '0.67': 2, '0.8': 3, '0.85': 4, '0.63': 5, '0.45': 6, '0.475': 7, '0.5': 8, '0.53': 9, '0.56': 10, '0.6': 11}
def recalc_df(self, file, sheet):
    # lê a tabela
    df_final = pd.read_excel(file, sheet_name=sheet)
    # converte os numeros para strings para poder usar o self.dict
    df_final = df_final.astype(str)

    df_final = df_final.drop(columns=['rsc'])
    df_final = df_final.sort_values(by='sc')
    df_final = df_final.reset_index(drop=True)

    aux = df_final.loc[:, 'sc']
    aux = aux.reset_index(drop=True)

    aux2 = aux.drop(0)
    aux2 = aux2.reset_index(drop=True)
    aux2 = aux2.append(aux2.tail(1), ignore_index=True)

    for x in range(len(aux)):
        aux[x] = (round(float(aux2[x])/float(aux[x]), 3))

    aux = aux.rename('rsc')

    df_final['rsc'] = aux

    df_final = df_final.astype(str)

    df_final['rnk'] = df_final['d1'].map(dict_)

    return df_final

def loop(self, file, sheet):
    df = self.recalc_df(file, sheet)

    for n in range(0,len(df)-1):
        c_0 = float(df.at[n, 'rsc']) <= 1.03

        c_1 = (df.at[n, 'n2'] == '0') and (df.at[n+1, 'n2'] == '0')
        c_1_1 = df.at[n+1, 'rnk'] >= df.loc[n, 'rnk']
        c_1_2 = float(df.at[n, 'n1']) < float(df.at[n+1, 'n1'])
        c_2 = (df.at[n, 'n2'] == '0') and (df.at[n+1, 'n2'] != '0')
        c_4 = (df.at[n, 'n2'] != '0') and (df.at[n+1, 'n2'] != '0')
        c_4_1 = df.at[n+1, 'rnk'] == df.at[n, 'rnk']
        c_4_1_1 = float(df.at[n+1, 'd2']) < float(df.at[n, 'd2'])
        c_4_2 = df.at[n+1, 'rnk'] > df.at[n, 'rnk']

        if c_0:
            if (((c_1) and ((c_1_1) or (c_1_2))) or (c_2) or ((c_4) and ((((c_4_1) and (c_4_1_1)) or (c_4_2))))):
                df = df.drop(n+1)

            else:
                df = df.drop(n)


            df = df.reset_index(drop=True)
            df = df.drop(columns=['rnk'])
            df.to_excel('{0}.xlsx'.format(sheet), index=False, sheet_name=sheet)
            self.loop('{0}.xlsx'.format(sheet), sheet)

        else:
            pass

        n = n+1

    df.to_excel('{0}.xlsx'.format(sheet), index=False, sheet_name=sheet)
    return df 

Now the problem is that my loop doesn't work the way I expect it to... It starts correctly, and to a certain extent even does the expected, however, from a certain point, the algorithm looks like it gives a bugged , and removes several rows it shouldn't, and in the end returns a KeyError.

Any idea what I can do to accomplish this task, more optimally and without these errors?

Edit: I have several dataframes that I want to do this, of varying sizes, in the smallest of them (35 lines), the code correctly performs the function, however, returns the KeyError. In another dataframe, of 690 rows, the problem of removing several rows occurs.

Author: Lucas, 2019-06-03

1 answers

Reproducing the defect

Note the following code

import pandas as pd

df = pd.DataFrame(
    [   
        [6, 5, 7, 3],
        [1, 7, 8, 9],
        [7, 8, 3, 10],
    ],
    columns=list('ABCD'))

# Condição de remoção: uma linha deve ser removida se seu valor na coluna "A" é maior que 5
condição = lambda dados, n: dados.loc[n, "A"] > 5

for n in range(0,len(df)-1):
    print("*"*30)
    print("Antes:\n%s"%df)
    if condição(df, n):
        print("Condição verdadeira na linha %s:\n%s\n" % (n, df.loc[[n]]))
        df = df.drop(n+1)
    else:
        print("Condição falsa na linha %s:\n%s\n" % (n, df.loc[[n]]))
        df = df.drop(n)
    print("Depois:\n%s\n\n"%df)

Running this code generates the following output:

Antes:
   A  B  C   D
0  6  5  7   3
1  1  7  8   9
2  7  8  3  10
Condição verdadeira na linha 0:
   A  B  C  D
0  6  5  7  3

Depois:
   A  B  C   D
0  6  5  7   3
2  7  8  3  10


******************************
Antes:
   A  B  C   D
0  6  5  7   3
2  7  8  3  10
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1790, in _validate_key
    error()
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1] is not in the [index]'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "outro.py", line 18, in <module>
    if condição(df, n):
  File "outro.py", line 13, in <lambda>
    condição = lambda dados, n: dados.loc[n, "A"] > 5
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1472, in __getitem__
    return self._getitem_tuple(key)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 870, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 998, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1911, in _getitem_axis
    self._validate_key(key, axis)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1798, in _validate_key
    error()
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1] is not in the [index]'

As can be seen, the first execution of the loop erases row 1. In the second step, one tries to access the same line 1 that was removed, and hence the KeyError occurs.

You may have developed the algorithm expecting the dataframe to behave similar to lists. Therefore, you could add a

df = df.reset_index(drop=True)

Logo after removing the row to reorder the indexes, but this still would not solve the problem. Note that, your loop must access n-1 dataframe positions, and therefore if during loop execution, the code removes 2 rows from the dataframe, when it arrives at the time of accessing the n-1 position in the dataframe of n-2 rows, another KeyError would occur.

Solutions

  • reconsidering row removal conditions - first of all, one should evaluate if the purpose of the program is correct. In many data sets, the rows are independent of each other, and therefore it is unusual to have to remove a line from the evaluation of other lines.
  • modifying the program to work properly - as we have seen, modifying a Dataframe (or lists) while trying to iterate through its indexes is problematic. An alternative would be to construct the list of indexes to be removed, and then remove them with a df.drop only.

In the above example, it would be enough to replace the for loop with:

índicesParaSeremRemovidos = [n for n in range(len(df)) if condição(df,n)] + [n+1 for n in range(len(df)) if not condição(df,n) and n<3]
df.drop(índicesParaSeremRemovidos, inplace=True)
print(df)

Resulting in

   A  B  C  D
1  1  7  8  9
 1
Author: danielbb, 2019-06-13 15:53:28