I found a way to support dupe-from:/multiple-dupes with the old/current behavior so nothing will change.
old/current find dupes behavior:
Check the length difference between two items, if they are less than or equal to length_dupe_tolerance, add them both as a result.
In more detail:
Everything gathers the length for all items.
Everything sorts by length, so you will have a nice list like:
Code: Select all
C:\TEST\DUPE\Length\0.wav 00:00 -\
C:\TEST\DUPE\Length\1.wav 00:00 |
C:\TEST\DUPE\Length\999.wav 00:00 |
C:\TEST\DUPE\Length\1000.wav 00:01 |- considered "the same" because the difference between the last item is <= 1000
C:\TEST\DUPE\Length\1001.wav 00:01 |
C:\TEST\DUPE\Length\1999.wav 00:01 |
C:\TEST\DUPE\Length\2999.wav 00:02 -/
C:\TEST\DUPE\Length\4999.wav 00:04
(using a length_dupe_tolerance of 1000)
Everything walks over the list.
If the difference between the last length and the current length is <= length_dupe_tolerance, the last item and the current item are added to the results.
There's a lot of weirdness with this method, you can end up with a really long range of lengths that are considered duplicate, far exceeding the 1000ms tolerance.
In the example above, using a length_dupe_tolerance of 1000, 0 - 2999 are all considered the same, because there are
small enough steps between.
The 0 - 2999 range exceeds the length_dupe_tolerance tolerance.
What I had proposed in the new behavior was to just put lengths into buckets by dividing the length by the length_dupe_tolerance, then using the bucket to find duplicates.
bucket example using length_dupe_tolerance of 1000:
Code: Select all
C:\TEST\DUPE\Length\0.wav 00:00 ( 0 / 1000 == bucket: 0) -\
C:\TEST\DUPE\Length\1.wav 00:00 ( 1 / 1000 == bucket: 0) |- first dupe group
C:\TEST\DUPE\Length\999.wav 00:00 ( 999 / 1000 == bucket: 0) -/
C:\TEST\DUPE\Length\1000.wav 00:01 (1000 / 1000 == bucket: 1) -\
C:\TEST\DUPE\Length\1001.wav 00:01 (1001 / 1000 == bucket: 1) |- second dupe group
C:\TEST\DUPE\Length\1999.wav 00:01 (1999 / 1000 == bucket: 1) -/
C:\TEST\DUPE\Length\2999.wav 00:02 (2999 / 1000 == bucket: 2)
C:\TEST\DUPE\Length\4999.wav 00:04 (4999 / 1000 == bucket: 4)
Buckets are easy to calculate and check.
Unfortunately, they can miss really small changes, for example: 999ms and 1000ms, which has a 1ms difference, would not be treated as a dupe because it falls into different buckets.
Users really want the old/current method because it looks at the difference between the last item and the current item, finding all small changes.