dupe-from not working with tolerance

Discussion related to "Everything" 1.5 Alpha.
Post Reply
dougbenham
Posts: 34
Joined: Wed Mar 15, 2023 8:19 pm

dupe-from not working with tolerance

Post by dougbenham »

Set length_dupe_tolerance to non-zero (eg.
/length_dupe_tolerance=1500
).

Query looks like
<"C:\work\Photos\"|E:\> dupe-from:"C:\work\Photos\" dupe:length


When length_dupe_tolerance is 0, the dupe-from keyword properly requires at least one of the dupes to be from "C:\work\Photos\". But after setting length_dupe_tolerance to non-zero, the dupes returned no longer care about the dupe-from keyword.
dougbenham
Posts: 34
Joined: Wed Mar 15, 2023 8:19 pm

Re: dupe-from not working with tolerance

Post by dougbenham »

This is still an issue.
void
Developer
Posts: 19046
Joined: Fri Oct 16, 2009 11:31 pm

Re: dupe-from not working with tolerance

Post by void »

Thank for bringing up the issue again dougbenham,

This is currently a limitation with length_dupe_tolerance.

I am going to abandon the current behavior of length_dupe_tolerance where the distance between the values is <= length_dupe_tolerance and replace it with dividing the values by length_dupe_tolerance to lower the resolution.

Old behavior with a length_dupe_tolerance of 1000 would treat 999 and 1001 the same. (distance is less than <= 1000)
New behavior with a length_dupe_tolerance of 1000 will treat 999 as 0 and 1001 as 1 (different, even though the distance is 2)

With the new behavior, dupe-from: and multiple dupes will work as expected.
It's not feasible to use the old behavior with dupe-from: and multiple dupes.



For now, one work around is to use custom columns:

<"C:\work\Photos\"|E:\> a:=$length:/10000 dupe-from:"C:\work\Photos\" dupe:a
dougbenham
Posts: 34
Joined: Wed Mar 15, 2023 8:19 pm

Re: dupe-from not working with tolerance

Post by dougbenham »

Hmm that would be a bummer because I'm using tolerance for exactly the feature you're saying that you're going to remove.

See this thread viewtopic.php?p=73909#p73909

Is there some kind of workaround where I can get proper length tolerance checking plus all these other cool features you've built?
therube
Posts: 5491
Joined: Thu Sep 03, 2009 6:48 pm

Re: dupe-from not working with tolerance

Post by therube »

I am going to abandon the current behavior of length_dupe_tolerance where the distance between the values is <= length_dupe_tolerance and replace it with dividing the values by length_dupe_tolerance to lower the resolution.

Old behavior with a length_dupe_tolerance of 1000 would treat 999 and 1001 the same. (distance is less than <= 1000)
New behavior with a length_dupe_tolerance of 1000 will treat 999 as 0 and 1001 as 1 (different, even though the distance is 2)
Brain isn't working. Explain that in a a different way, if you would.
(My length_dupe_tolerance works as I need [dupe:length]. My dupe-from: deals with, name, so not lengths.)
void
Developer
Posts: 19046
Joined: Fri Oct 16, 2009 11:31 pm

Re: dupe-from not working with tolerance

Post by void »

I found a way to support dupe-from:/multiple-dupes with the old/current behavior so nothing will change.

old/current find dupes behavior:

Check the length difference between two items, if they are less than or equal to length_dupe_tolerance, add them both as a result.



In more detail:

Everything gathers the length for all items.
Everything sorts by length, so you will have a nice list like:

Code: Select all

C:\TEST\DUPE\Length\0.wav	 00:00 -\
C:\TEST\DUPE\Length\1.wav	 00:00  |
C:\TEST\DUPE\Length\999.wav	 00:00  |
C:\TEST\DUPE\Length\1000.wav	 00:01  |- considered "the same" because the difference between the last item is <= 1000
C:\TEST\DUPE\Length\1001.wav	 00:01  |
C:\TEST\DUPE\Length\1999.wav	 00:01  |
C:\TEST\DUPE\Length\2999.wav	 00:02 -/
C:\TEST\DUPE\Length\4999.wav	 00:04
(using a length_dupe_tolerance of 1000)

Everything walks over the list.
If the difference between the last length and the current length is <= length_dupe_tolerance, the last item and the current item are added to the results.

There's a lot of weirdness with this method, you can end up with a really long range of lengths that are considered duplicate, far exceeding the 1000ms tolerance.
In the example above, using a length_dupe_tolerance of 1000, 0 - 2999 are all considered the same, because there are small enough steps between.
The 0 - 2999 range exceeds the length_dupe_tolerance tolerance.



What I had proposed in the new behavior was to just put lengths into buckets by dividing the length by the length_dupe_tolerance, then using the bucket to find duplicates.

bucket example using length_dupe_tolerance of 1000:

Code: Select all

C:\TEST\DUPE\Length\0.wav	 00:00 (   0 / 1000 == bucket: 0) -\
C:\TEST\DUPE\Length\1.wav	 00:00 (   1 / 1000 == bucket: 0)  |- first dupe group
C:\TEST\DUPE\Length\999.wav	 00:00 ( 999 / 1000 == bucket: 0) -/
C:\TEST\DUPE\Length\1000.wav	 00:01 (1000 / 1000 == bucket: 1) -\
C:\TEST\DUPE\Length\1001.wav	 00:01 (1001 / 1000 == bucket: 1)  |- second dupe group
C:\TEST\DUPE\Length\1999.wav	 00:01 (1999 / 1000 == bucket: 1) -/
C:\TEST\DUPE\Length\2999.wav	 00:02 (2999 / 1000 == bucket: 2)
C:\TEST\DUPE\Length\4999.wav	 00:04 (4999 / 1000 == bucket: 4)
Buckets are easy to calculate and check.
Unfortunately, they can miss really small changes, for example: 999ms and 1000ms, which has a 1ms difference, would not be treated as a dupe because it falls into different buckets.



Users really want the old/current method because it looks at the difference between the last item and the current item, finding all small changes.
therube
Posts: 5491
Joined: Thu Sep 03, 2009 6:48 pm

Re: dupe-from not working with tolerance

Post by therube »

There's a lot of weirdness with this method, you can end up with a really long range of lengths that are considered duplicate, far exceeding the 1000ms tolerance.
I'd have to see that in action to get a feel for what it means, but I'm thinking it is something that would not be wanted.


From (for a moment - a very brief moment) looking at current, my take is...


Everything, Length Tolerance, 900ms


> distinct:size;name dupe:Length !/$recycle !/music !/corrupt

---
group1:
58.52.722
58.52.660
58.52.544
---
group2:
58.52.000
58.51.917
58.51.600
---

(now, that a file is 1 group or another is not particularly important)
so what, it is breaking on the Seconds, so 51, & 51.600 + .900, so that includes 52.000
but then 58.52.xxx starts a new group, which covers .544 .660 & .722
so you end up with 1 52s in group2: & other 52's in group1:
(so be it, no big deal [to me])

so long as something along those lines continues (in newer Everything versions), i'm good
(absolute time, or absolute time+tolerance is not as important as similarly timed files
being grouped together - within reason)
[maybe a tolerance of tolerance is needed ;-)]


In the above, 58.51.600, which does include 58.52.00, thinking I would not want the 58.52.00 to then say 58.52.544 is within tolerance - even though it is, & then all the rest of the 58.52.xxx would also be included (not to mention 58.53.xxx & 58.54.xxx ...) , because they too are also within tolerance... because in the end, you're going to end up with some huge group that from start to end will be way over tolerance, even if within tolerance of each other. That would make things extremely confusing.
dougbenham
Posts: 34
Joined: Wed Mar 15, 2023 8:19 pm

Re: dupe-from not working with tolerance

Post by dougbenham »

I think I agree with that. Picking a primary/leader for each bucket/group is totally fine. Maybe it will even simplify the logic in the code by doing that.
void
Developer
Posts: 19046
Joined: Fri Oct 16, 2009 11:31 pm

Re: dupe-from not working with tolerance

Post by void »

Everything 1.5.0.1397a improves length_dupe_tolerance.

length_dupe_tolerance will now work with distinct:, unique:, dupe-from: and dupe: with multiple properties.
Post Reply