An algorithm for finding repeated files?

Question

An algorithm for finding repeated files?

I need to make a program that finds repeated files on my computer, so that the user decides what action to take with these files (e.g. delete copies). For now, I only worry about a binary comparison between files (i.e. the file is only duplicated if it is 100% equal to another)

I know that searching only by file name is insufficient, since the same file may have been saved under another name.

Is there any algorithm for compare the files?

I imagine that generating the checksum of all files and comparing all against all is unproductive as it is not normal to have so many duplicate files. I also imagine that it does not give to use only the file size. And there may be cases where the file is duplicated more than once.

4

algoritmo arquivo

Author: Victor Stafusa, 2014-01-30

Source

4 answers

List All files;

For each file, perform the steps below:

Generate a Hash from the contents of the file and store in a Hash Table;
In the case of a Hash collision, make sure that the file is equal to the files with same Hash, byte by byte. If it is the same, you found a duplicate.

1

Author: pablosaraiva, 2014-01-30 19:14:46

I don't believe it's possible. You would have to compare all the files with each other, and the runtime of the program would grow exponentially relative to the number of files.

You can do something that starts by enumerating the files. Then each file would have to be compared with all the others. An optimization would be to compare the size, then who knows a checksum, and then if they are still equal compare byte by byte.

For few files will work fine, but as the number of files increases the runtime of the algorithm will quickly rise to impractical scales.

0

Author: C. E. Gesser, 2014-01-30 12:33:07

You use basename function, PHP:

$inicio = "file:///C://";    // Você poder alterar o caminho atraves das pastas.
$arquivo = basename($inicio);    
$file = basename($inicio, "Nome");

function stribet($inputstr, $deliLeft, $deliRight) {
    $posLeft = stripos($inputstr, $deliLeft) + strlen($deliLeft);
    $posRight = stripos($inputstr, $deliRight, $posLeft);
    return substr($inputstr, $posLeft, $posRight - $posLeft);
}

Catch content:

$res = file_get_contents($inicio);

Locate:

$x = @$this->stribet($res,'$file','[1]');

Should take files with[1]:

$d = '$this->file($x)';

Function if the file has[1]:

if ($file == "'.$d.'"){
}

May not be accurate, or it may not work, if it does not work just talk. This function can only take file with [1] in the name.

0

Author: Mega, 2014-01-30 13:05:39

score 3 · Accepted Answer

By Parts:

list everything with basic information: location (on disk / directory), name, date and size;
separate files that have the same name (exactly the same, including Case);
similarly files that have the same size (in Bytes);
delete the "non-repeated" (without equal name or size);
Select the "repeated Level 1" (same name, size and date), and apply a checksum to each block separately, mark the really equal;
Select the "repeated Level 2" (name or size and date same), and apply on each block separately a checksum, mark the really equal;
Select the "repeated Level 3" (equal name and size with different date), and apply on each block separately a checksum, mark the really equal ones;
Select the "repeated Level 4" (same name or size with different date), and apply on each block separately a checksum, mark the Really equal;
with the really equal ones, present each block to the user so that the user can define which ones will be eliminated;

I suggest you add some options: that the user can access the location of each file; can open in the default editor for viewing the content; can move the "chosen/repeated" file to a specific folder.

An option that I believe is very useful, when selecting only one file (in the windows environment, for example), can to be used through the context menu (right click of the mouse), so that some repeated file of this one that has been selected is found.

Think only about ignoring the contents of compressed folders, that is, if the repeated file is inside a ZIP/RAR, it will never be evaluated and therefore will never be considered repeated(put this in the instructions for use of your future application). And then send me a copy to test; -)