dupe-from not working with tolerance

dougbenham · Post by **dougbenham** » Fri Feb 14, 2025 7:43 pm

Set length_dupe_tolerance to non-zero (eg.

/length_dupe_tolerance=1500

).

Query looks like

<"C:\work\Photos\"|E:\> dupe-from:"C:\work\Photos\" dupe:length

When length_dupe_tolerance is 0, the dupe-from keyword properly requires at least one of the dupes to be from "C:\work\Photos\". But after setting length_dupe_tolerance to non-zero, the dupes returned no longer care about the dupe-from keyword.

dougbenham · Post by **dougbenham** » Sat Aug 30, 2025 7:16 pm

This is still an issue.

Post by **void** » Sun Aug 31, 2025 5:49 am

Thank for bringing up the issue again dougbenham,

This is currently a limitation with length_dupe_tolerance.

I am going to abandon the current behavior of length_dupe_tolerance where the distance between the values is <= length_dupe_tolerance and replace it with dividing the values by length_dupe_tolerance to lower the resolution.

Old behavior with a length_dupe_tolerance of 1000 would treat 999 and 1001 the same. (distance is less than <= 1000)
New behavior with a length_dupe_tolerance of 1000 will treat 999 as 0 and 1001 as 1 (different, even though the distance is 2)

With the new behavior, dupe-from: and multiple dupes will work as expected.
It's not feasible to use the old behavior with dupe-from: and multiple dupes.

For now, one work around is to use custom columns:

<"C:\work\Photos\"|E:\> a:=$length:/10000 dupe-from:"C:\work\Photos\" dupe:a

dougbenham · Post by **dougbenham** » Sun Aug 31, 2025 6:03 am

Hmm that would be a bummer because I'm using tolerance for exactly the feature you're saying that you're going to remove.

See this thread viewtopic.php?p=73909#p73909

Is there some kind of workaround where I can get proper length tolerance checking plus all these other cool features you've built?

Post by **therube** » Mon Sep 01, 2025 2:52 pm

I am going to abandon the current behavior of length_dupe_tolerance where the distance between the values is <= length_dupe_tolerance and replace it with dividing the values by length_dupe_tolerance to lower the resolution.

Old behavior with a length_dupe_tolerance of 1000 would treat 999 and 1001 the same. (distance is less than <= 1000)
New behavior with a length_dupe_tolerance of 1000 will treat 999 as 0 and 1001 as 1 (different, even though the distance is 2)

Brain isn't working. Explain that in a a different way, if you would.
(My length_dupe_tolerance works as I need [dupe:length]. My dupe-from: deals with, name, so not lengths.)

Post by **void** » Mon Sep 01, 2025 11:09 pm

I found a way to support dupe-from:/multiple-dupes with the old/current behavior so nothing will change.

old/current find dupes behavior:

Check the length difference between two items, if they are less than or equal to length_dupe_tolerance, add them both as a result.

In more detail:

Everything gathers the length for all items.
Everything sorts by length, so you will have a nice list like:

Code: Select all

C:\TEST\DUPE\Length\0.wav	 00:00 -\
C:\TEST\DUPE\Length\1.wav	 00:00  |
C:\TEST\DUPE\Length\999.wav	 00:00  |
C:\TEST\DUPE\Length\1000.wav	 00:01  |- considered "the same" because the difference between the last item is <= 1000
C:\TEST\DUPE\Length\1001.wav	 00:01  |
C:\TEST\DUPE\Length\1999.wav	 00:01  |
C:\TEST\DUPE\Length\2999.wav	 00:02 -/
C:\TEST\DUPE\Length\4999.wav	 00:04

(using a length_dupe_tolerance of 1000)

Everything walks over the list.
If the difference between the last length and the current length is <= length_dupe_tolerance, the last item and the current item are added to the results.

There's a lot of weirdness with this method, you can end up with a really long range of lengths that are considered duplicate, far exceeding the 1000ms tolerance.
In the example above, using a length_dupe_tolerance of 1000, 0 - 2999 are all considered the same, because there are small enough steps between.
The 0 - 2999 range exceeds the length_dupe_tolerance tolerance.

What I had proposed in the new behavior was to just put lengths into buckets by dividing the length by the length_dupe_tolerance, then using the bucket to find duplicates.

bucket example using length_dupe_tolerance of 1000:

Code: Select all

C:\TEST\DUPE\Length\0.wav	 00:00 (   0 / 1000 == bucket: 0) -\
C:\TEST\DUPE\Length\1.wav	 00:00 (   1 / 1000 == bucket: 0)  |- first dupe group
C:\TEST\DUPE\Length\999.wav	 00:00 ( 999 / 1000 == bucket: 0) -/
C:\TEST\DUPE\Length\1000.wav	 00:01 (1000 / 1000 == bucket: 1) -\
C:\TEST\DUPE\Length\1001.wav	 00:01 (1001 / 1000 == bucket: 1)  |- second dupe group
C:\TEST\DUPE\Length\1999.wav	 00:01 (1999 / 1000 == bucket: 1) -/
C:\TEST\DUPE\Length\2999.wav	 00:02 (2999 / 1000 == bucket: 2)
C:\TEST\DUPE\Length\4999.wav	 00:04 (4999 / 1000 == bucket: 4)

Buckets are easy to calculate and check.
Unfortunately, they can miss really small changes, for example: 999ms and 1000ms, which has a 1ms difference, would not be treated as a dupe because it falls into different buckets.

Users really want the old/current method because it looks at the difference between the last item and the current item, finding all small changes.

Post by **therube** » Tue Sep 02, 2025 4:27 pm

There's a lot of weirdness with this method, you can end up with a really long range of lengths that are considered duplicate, far exceeding the 1000ms tolerance.

I'd have to see that in action to get a feel for what it means, but I'm thinking it is something that would not be wanted.

From (for a moment - a very brief moment) looking at current, my take is...

Everything, Length Tolerance, 900ms

> distinct:size;name dupe:Length !/$recycle !/music !/corrupt

---
group1:
58.52.722
58.52.660
58.52.544
---
group2:
58.52.000
58.51.917
58.51.600
---

(now, that a file is 1 group or another is not particularly important)
so what, it is breaking on the Seconds, so 51, & 51.600 + .900, so that includes 52.000
but then 58.52.xxx starts a new group, which covers .544 .660 & .722
so you end up with 1 52s in group2: & other 52's in group1:
(so be it, no big deal [to me])

so long as something along those lines continues (in newer Everything versions), i'm good
(absolute time, or absolute time+tolerance is not as important as similarly timed files
being grouped together - within reason)
[maybe a tolerance of tolerance is needed

]

In the above, 58.51.600, which does include 58.52.00, thinking I would not want the 58.52.00 to then say 58.52.544 is within tolerance - even though it is, & then all the rest of the 58.52.xxx would also be included (not to mention 58.53.xxx & 58.54.xxx ...) , because they too are also within tolerance... because in the end, you're going to end up with some huge group that from start to end will be way over tolerance, even if within tolerance of each other. That would make things extremely confusing.

dougbenham · Post by **dougbenham** » Tue Sep 02, 2025 4:31 pm

I think I agree with that. Picking a primary/leader for each bucket/group is totally fine. Maybe it will even simplify the logic in the code by doing that.

Post by **void** » Fri Sep 05, 2025 5:43 am

Everything 1.5.0.1397a improves length_dupe_tolerance.

length_dupe_tolerance will now work with distinct:, unique:, dupe-from: and dupe: with multiple properties.

voidtools forum

dupe-from not working with tolerance

dupe-from not working with tolerance

Re: dupe-from not working with tolerance

Re: dupe-from not working with tolerance

Re: dupe-from not working with tolerance

Re: dupe-from not working with tolerance

Re: dupe-from not working with tolerance

Re: dupe-from not working with tolerance

Re: dupe-from not working with tolerance

Re: dupe-from not working with tolerance