Ignore duplicates

Stretto · Post by **Stretto** » Mon Sep 28, 2020 6:06 pm

I have a lot of duplicates because I 1. Have some nearly duplicate drives(backup but there are differences). 2. Sometimes use junctions to link different drives.

This unfortunately causes a ton of duplicates in everything. I have over 10M files and probably 1/3 are duplicates.

Can we not have a simple filter to remove duplicates? Basically a check box in the options to remove duplicates where it simply compares files by name and size and if they are the same it will display only one(I don't care, they are duplicates, maybe an option that allows one to quickly expand duplicates).

I can't mess with trying to use everything's options to not scan the offending drives cause it would be a huge pain trying to figure it out and it is simply not worth it when a few lines of code can solve the problem.

Post by **therube** » Tue Sep 29, 2020 3:56 pm

If you know dupes are kept on particular drives, you can exclude those drives

!Z:

.
You can excludes dupes,

!dupe:

or

!sizedupe:

.
(Note that dupe: [sizedupe:] are global in scope.)

You can set up a Filter, say a NO-DUP Filter, that is an Everything filter - except that excludes dupes.
So you can switch to that when you don't want to see them, & then can drop back to the Everything filter when you do want to see them or to ensure that !dupe: isn't hiding something from you.

Stretto · Post by **Stretto** » Fri Oct 02, 2020 12:12 am

therube wrote: ↑Tue Sep 29, 2020 3:56 pm If you know dupes are kept on particular drives, you can exclude those drives
!Z:
.
You can excludes dupes,
!dupe:
or
!sizedupe:
.
(Note that dupe: [sizedupe:] are global in scope.)

You can set up a Filter, say a NO-DUP Filter, that is an Everything filter - except that excludes dupes.
So you can switch to that when you don't want to see them, & then can drop back to the Everything filter when you do want to see them or to ensure that !dupe: isn't hiding something from you.

As I said, excluding drives and folders are not a solution. The drives are too complex in layout and not everything is duplicated. I'd have to spend months to get everything working just to avoid a few extra lines of code? Then if things change like drive letters or new OS I'd have to redo it all? Again, just to avoid adding a good feature to the software?

Ok, I'm new to everything but I see it does have an extensive search logic. I think I was able to remove dupes...

But initially I could not get it to work cause I had regex enabled.

1. Doesn't seem like they work together?
2. Can I set a filter using something like !dupe: then search(from the search bar) on that? It seems if I do that then the issue is that !dupe will return only non-duplicates while dupe only returns duplicates... That isn't really what I want. I want it to return max 1 duplicate.

Hence it almost defeats the point.

That is, if I have 10 files and 3 are dupes and I use !dupe then I will get 7 files and I use dupe then I will get 3 files. I want to get 8 files! All the non-dupes plus 1 of each dupe from each group of dupes.

If everything can't do this then just creates a new command would work such as "unique:" - returns a unique entry, if duplicated in name or content then only one will show.

The dupe command seems like it is meant for finding duplicate filenames, which is handy, but useless for what I want UNLESS somehow I can use some compound macro to combine !dupe and dupe to get what I need. E.g., !dupe+dupe[1]: which would express "all the non-dupes + the first energy of the dupes".

Post by **therube** » Fri Oct 02, 2020 2:56 pm

(Again, not what you're wanting, but, Show exactly one copy of each filename?
And further off the track, Question: Comparing drives/folders for files.

Anyhow, a unique: might not be a bad idea, if it is doable.)

Stretto · Post by **Stretto** » Fri Oct 02, 2020 5:49 pm

therube wrote: ↑Fri Oct 02, 2020 2:56 pm (Again, not what you're wanting, but, Show exactly one copy of each filename?
And further off the track, Question: Comparing drives/folders for files.

Anyhow, a unique: might not be a bad idea, if it is doable.)

It is not hard to do. A few lines of code.

Some way to set an option to ignore duplicates or use unique: command. Possibly something like unique[k]: to select the kth unique item.

I'm not sure how the code is structured by the list of items, if they are in an associative list/dictionary it would be rather simple.

First thing to do is check the file size since if it does not match then no way they can be the same. This speeds up the checking tremendously and if the filesize is stored then it doesn't have to be read from disk.

If they have the exact same file size then one has to check the contents. This can be slow. Simply reading a few bytes at the start and end should be enough, if they match more is read. This can still be slow though if one is comparing very large files that are the same since one would have to compare the entire contents to prove they are identical. Of course it could bail after a few k and just show them as likely duplicates. Alternatively it could do a full compare and store the results in the database and not recompare.

That would get a basic unique: done.

The [k] just selects which one in the list of dupes to return. [0] would be the first which is default, one skips the first so the second, 2 the 3rd, etc. if k > # of dupes then return last. Not the best way but allows one to systematically view the dupes in some order

Now, I do not know if this might cause problems with other commands. Probably one needs a pre-unique and post-unique that runs before and after any other commands but otherwise independent.

Another thing that would be desired is a way to hide files. E.g., an attribute where one can mark a file and say "do not show".

Then one can have a command like "ignoreMarked: or "showMarked:" with the default being toggleable.

It should be a few lines of code, probably no more than 100.

One, of course, can compare filename dupes too but a filename does not represent much. It could be a looser or tighter check but probably should be additional.

I don't need anything perfect, just something that works well enough to reduce most of the duplication I'm seeing. A simple filesize and first cluster content compare would work fine. If it misses some dupes and shows them then it's no big deal. Of course I don't want it to thrash my drive trying to compare a bunch of files either which is why i think having them marked/cached in the database would be necessary.

Thanks.

voidtools forum

Ignore duplicates

Ignore duplicates

Re: Ignore duplicates

Re: Ignore duplicates

Re: Ignore duplicates

Re: Ignore duplicates