Function to find Partial Name Dupes

raccoon · Post by **raccoon** » Thu Apr 02, 2020 12:51 am

One of the tasks I keep running up against is the ability to locate partial name dupes. This is tricky because my search string isn't necessarily a verbatim string, but rather whether multiple files or folders share the same substring of a given length or position.

Simple Method: Find all records with duplicates matching the first N characters. Eg, All files with the same first 15 characters.

DupeLeft:15

Advanced Method: Allow the user compose a regular expression pattern that defines the parameters of substring length and composition that a record must match to be compared against other records. Portions of the pattern in backref are rendered and matched against other records for dupe comparison, and other portions of the pattern are generic qualifier filtering.

DupeRegex:"^(.{15,})"

DupeRegex:"^(.*)(?:19|20)\d\d"
The above example, any files that contain /(?:19|20)\d\d/ are compared for substring duplication of the portion of the name preceding that number, the /(.*)/ backref, so the number (year) need not necessarily match between duplicates, only the substring to the left of it.

Thoughts?

Post by **void** » Thu Apr 02, 2020 10:42 am

I like the dupeleft: idea.

The DupeRegex: search could work. Although, performing a regex search for each filename would be very slow.

Maybe something like:
regex:"^(.*)(?:19|20)\d\d" dupestartwith:\1
-the result would have to match the first regex search and a duplicate would have to exist that starts with the first captured sub-expression.
-a startwith search for each filename would be instant.

Thank you for the suggestions.

raccoon · Post by **raccoon** » Thu Apr 02, 2020 1:02 pm

Aye, I recognize the regex thing would have to be a multi-pass recursion. Though, I'm not sure that your solution reduces that recursive property as \1 would have to be resolved for each record, and then all records scanned again in kind. Seems basically like my idea, but limiting \1 to the left-side of the string. Maybe some savings if plain string compare is faster than PCRE. But, perhaps, just creating an index of the value of \1\2\3\4\5... for each record is enough, and just fast search / sort / compare / hashtable lookup those.

DupeRegex:"\b(\w{8,})\b"

Records:

Code: Select all

foo documents bar.ext           resolves to:  \1 == documents
baz documents quux.ext          resolves to:  \1 == documents  (match)
aaa raspberries bbb butts.ext   resolves to:  \1 == raspberries
butts ccc raspberries ddd.ext   resolves to:  \1 == raspberries  (match)

I'm not sure it's necessary to attempt to match multiple resolves per entry. ie, no need to pull your hair out over supporting //g patterns.

Another example but with multiple backrefs. We just clobber them together into a single [invalid-file-character...say-colon] delimited string.

DupeRegex:"(\w{6,}) \d{4}.*\.([^.]+)" // 6-or-more word characters, followed by a space, 4 numbers, and a file extension.
Records must match the 6-or-more word characters and have the same file extension.

Code: Select all

doctor voidtools 2020 DVD.mp4    resolves to:  \1:\2:\3:\4:\5:\6:\7:\8:\9 == voidtools:mp4:::::::
mr voidtools 2019 facebook.mp4   resolves to:  \1:\2:\3:\4:\5:\6:\7:\8:\9 == voidtools:mp4:::::::  (match)

Post by **void** » Sat Apr 04, 2020 2:58 am

I'm not sure that your solution reduces that recursive property as \1 would have to be resolved for each record

Everything has this information already from the previous regex: search term. It is stored in the regex matches.
Generating the startwith term from \1 would be instant.

A startwith search is really a lookup in Everything and is instant (it really is only a few instructions).

Currently I have the following search planned for a future release:
path:regex:(.*)\.mp4 fileexists:\1\.jpg

This will return mp4 files where a jpg exists with the same stem in the same folder.
This search is instant.

dupestartwith: would be similar to fileexsits, except it will never match the current file.

I've added dupestartwith: to my TODO list.

raccoon · Post by **raccoon** » Sat Apr 04, 2020 11:42 am

Cool beans, will be fun to get my hands on!

What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).

Post by **void** » Mon Apr 06, 2020 4:08 am

What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).

It might work, I'm not sure where to specify the \1:\2:\3:\4...
It could be done as a search term...

I'm looking into a "duplicate view" mode with the following options:
Show all items.
Show duplicates only.
Show unique items only.
Show only one instance of each value.

These dupe modes would be based of the current sort.

For example, sort by size and select duplicate view -> show duplicates only to show results where only files with the same size exist.

There will be a new column 'regex match 1' to show the captured regex match 1. You will be able to sort by this column.

For example, search for regex:"\b(\w{8,})\b", sort by regex match 1, select duplicate view -> show duplicates only to show files where the captured regex match 1 are all the same.

This has the limitation of only matching one captured regex match.

I already had a solution working for size and name. However, I think I'll rewrite it to support all columns..

raccoon · Post by **raccoon** » Mon Apr 06, 2020 7:30 am

void wrote: ↑Mon Apr 06, 2020 4:08 amIt might work, I'm not sure where to specify the \1:\2:\3:\4...

...

There will be a new column 'regex match 1' to show the captured regex match 1. You will be able to sort by this column.

Basically this, but instead of just 'regex match 1', it will contain all regex matches, and they would be tokenized with a colon delimiter (invalid file character). And within this column, you look for duplicates (or uniques if that's your thing). This way the regex pattern can contain more-than-one back-reference instead of just one. And, as well, the duplicate matching doesn't have to be left-aligned to the file name or file path, since the back-reference(s) may appear anywhere in the pattern.

Here's my example again, with coloring.

DupeRegex:"(\w{6,})\s\d{4}.*\.([^.]+)"

Name                                             | Regex Matches
------------------------------------------- | --------------------------------------------------
doctor voidtools 2020 DVD.mp4       | voidtools:mp4
mr voidtools 2019 facebook.mp4     | voidtools:mp4 <-- look, a duplicate!
Quarterly Report 2008.xls               | Report:xls
Copy of Quarterly Report 2008.xls   | Report:xls <-- look, a duplicate!
Budget Finance Report 2013.xls      | Report:xls <-- look, another duplicate!

All we're really doing here is the same as any old Regex:"pattern" search, but pulling out [all] back references and seeing if any other records share matching back references.

Post by **void** » Mon Apr 06, 2020 7:37 am

I'll add a 'all regex matches' column.

Then you'll be able to search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
sort by 'all regex matches'
select duplicate view -> find duplicates.

I'll consider a duperegex: search which will automate this, ie: the sort by 'all regex matches' would be done behind the scene.

Thanks for the suggestion.

Post by **void** » Mon Mar 15, 2021 12:07 pm

This functionality has been added to Everything 1.5 Alpha

Search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"

Right click the column header, under the Search submenu, click Regular Expression Match 0.
Right click the Regular Expression Match 0 column header and click Find Regular Expression Match 0 Duplicates.

Phlashman · Post by **Phlashman** » Sat Mar 25, 2023 5:32 am

I'm trying to find all files with the pattern
filename (n).ext
I can find these with regex:^(.+)\s[(]\d+[)]\.([^.]+)
as the filename is captured in the first set of brackets and the extension in the second I also wanted any "unnumbered" files matching "filename.ext"
so I tried
regex:^(.+)\s[(]\d+[)]\.([^.]+)|^\1\.\2
however this did not work

I only get all the files matching "filename (n).ext"
Is this task possible, I want to find files that have the (n) at the end of the filename due to copy/paste of files into same directory

Thanks

Post by **void** » Sat Mar 25, 2023 5:53 am

You want to find filename (n).ext where filename.ext exists?

Please try:
regex:^(.+)\s[(]\d+[)]\.([^.]+)$ fileexists:\1\.\2

You want to find filename.ext where filename (2).ext or filename (3).ext or filename (4).ext or ... exists?

Please try:
regex:^(.+)\.([^.]+)$ fileexists:\1" "$2$\.\2 | fileexists:\1" "$3$\.\2 | fileexists:\1" "$4$\.\2 | fileexists:\1" "$5$\.\2 | fileexists:\1" "$6$\.\2 | fileexists:\1" "$7$\.\2 | fileexists:\1" "$8$\.\2 | fileexists:\1" "$9$\.\2

You want to combine both of these?
<regex:^(.+)\s[(]\d+[)]\.([^.]+)$ fileexists:\1\.\2> | <regex:^(.+)\.([^.]+)$ fileexists:\1" "$2$\.\2 | fileexists:\1" "$3$\.\2 | fileexists:\1" "$4$\.\2 | fileexists:\1" "$5$\.\2 | fileexists:\1" "$6$\.\2 | fileexists:\1" "$7$\.\2 | fileexists:\1" "$8$\.\2 | fileexists:\1" "$9$\.\2>

Phlashman · Post by **Phlashman** » Sat Mar 25, 2023 8:00 am

Thanks for the quick reply. I tried putting regex: in front of fileexists: eg.

<regex:^(.+)\.([^.]+)$ regex:fileexists:^$1:" "$\d+$\.$2:$> but that gave nothing, so obviously can't use regex with fileexists: ...

So I used the last example you gave me with the hardcoded number (1) and used

file: online: <regex:^(.+)\s[(]1[)]\.([^.]+)$ fileexists:\1\.\2> | <regex:^(.+)\.([^.]+)$ fileexists:\1" "$1$\.\2>

It found suspect redundant copies, got size and folder matched (ran Dupe on size column).
Also ran pairs through "Beyond Compare" (Binary compare) just to be sure.....

When I confirmed the exact copy "filename (1)", I deleted it. However the partner "filename" is still left behind in the results.
How do I refresh so these disappear? F5 did nothing?

Post by **void** » Sat Mar 25, 2023 8:05 am

Change the search (eg: add a space to the end) to refresh the results.

Phlashman · Post by **Phlashman** » Sun Mar 26, 2023 1:28 am

Refined to DUPE on the "\1.\2" combo using column1

<regex:^(.+)\s[(]2[)]\.([^.]+)$ fileexists:\1\.\2 column1:=$regular-expression-match-1:.$regular-expression-match-2:> | <regex:^(.+)\.([^.]+)$ fileexists:\1" "$2$.\2 column1:=$regular-expression-match-1:.$regular-expression-match-2:> addcolumn:column1 sort:path;name-descending dupe:column1;size

I notice that fileexists uses "\" ahead of "." "(" and ")" character. This is using escape \ which is used in regex? Is it possible in future for fileexists: to
use full regex in future? Then there would be no need to hardcode the 2,3 etc in a chain as in your answer above and use \d ?

Post by **void** » Wed Apr 05, 2023 6:53 am

Everything 1.5.0.1341a makes some improvements to sibling:

Please try the following search:

regex:^(.+)\.([^.]+)$ regex:sibling:$1:\s$\d+$\.$2:

Please note this search is rather slow.

$1: will now be correctly escaped for a regex: search in 1341a+.
regex: will now override ; list syntax in 1341a+.

fileexists: will continue to match an absolute filename in the index.

voidtools forum

Function to find Partial Name Dupes

Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes

Re: Function to find Partial Name Dupes