Function to find Partial Name Dupes

Have a suggestion for "Everything"? Please post it here.
Post Reply
raccoon
Posts: 147
Joined: Thu Oct 18, 2018 1:24 am

Function to find Partial Name Dupes

Post by raccoon » Thu Apr 02, 2020 12:51 am

One of the tasks I keep running up against is the ability to locate partial name dupes. This is tricky because my search string isn't necessarily a verbatim string, but rather whether multiple files or folders share the same substring of a given length or position.

Simple Method: Find all records with duplicates matching the first N characters. Eg, All files with the same first 15 characters.

DupeLeft:15

Advanced Method: Allow the user compose a regular expression pattern that defines the parameters of substring length and composition that a record must match to be compared against other records. Portions of the pattern in backref are rendered and matched against other records for dupe comparison, and other portions of the pattern are generic qualifier filtering.

DupeRegex:"^(.{15,})"

DupeRegex:"^(.*)(?:19|20)\d\d"
The above example, any files that contain /(?:19|20)\d\d/ are compared for substring duplication of the portion of the name preceding that number, the /(.*)/ backref, so the number (year) need not necessarily match between duplicates, only the substring to the left of it.

Thoughts?

void
Site Admin
Posts: 6709
Joined: Fri Oct 16, 2009 11:31 pm

Re: Function to find Partial Name Dupes

Post by void » Thu Apr 02, 2020 10:42 am

I like the dupeleft: idea.

The DupeRegex: search could work. Although, performing a regex search for each filename would be very slow.

Maybe something like:
regex:"^(.*)(?:19|20)\d\d" dupestartwith:\1
-the result would have to match the first regex search and a duplicate would have to exist that starts with the first captured sub-expression.
-a startwith search for each filename would be instant.

Thank you for the suggestions.

raccoon
Posts: 147
Joined: Thu Oct 18, 2018 1:24 am

Re: Function to find Partial Name Dupes

Post by raccoon » Thu Apr 02, 2020 1:02 pm

Aye, I recognize the regex thing would have to be a multi-pass recursion. Though, I'm not sure that your solution reduces that recursive property as \1 would have to be resolved for each record, and then all records scanned again in kind. Seems basically like my idea, but limiting \1 to the left-side of the string. Maybe some savings if plain string compare is faster than PCRE. But, perhaps, just creating an index of the value of \1\2\3\4\5... for each record is enough, and just fast search / sort / compare / hashtable lookup those.

DupeRegex:"\b(\w{8,})\b"

Records:

Code: Select all

foo documents bar.ext           resolves to:  \1 == documents
baz documents quux.ext          resolves to:  \1 == documents  (match)
aaa raspberries bbb butts.ext   resolves to:  \1 == raspberries
butts ccc raspberries ddd.ext   resolves to:  \1 == raspberries  (match)
I'm not sure it's necessary to attempt to match multiple resolves per entry. ie, no need to pull your hair out over supporting //g patterns.

Another example but with multiple backrefs. We just clobber them together into a single [invalid-file-character...say-colon] delimited string.

DupeRegex:"(\w{6,}) \d{4}.*\.([^.]+)" // 6-or-more word characters, followed by a space, 4 numbers, and a file extension.
Records must match the 6-or-more word characters and have the same file extension.

Code: Select all

doctor voidtools 2020 DVD.mp4    resolves to:  \1:\2:\3:\4:\5:\6:\7:\8:\9 == voidtools:mp4:::::::
mr voidtools 2019 facebook.mp4   resolves to:  \1:\2:\3:\4:\5:\6:\7:\8:\9 == voidtools:mp4:::::::  (match)

void
Site Admin
Posts: 6709
Joined: Fri Oct 16, 2009 11:31 pm

Re: Function to find Partial Name Dupes

Post by void » Sat Apr 04, 2020 2:58 am

I'm not sure that your solution reduces that recursive property as \1 would have to be resolved for each record
Everything has this information already from the previous regex: search term. It is stored in the regex matches.
Generating the startwith term from \1 would be instant.

A startwith search is really a lookup in Everything and is instant (it really is only a few instructions).

Currently I have the following search planned for a future release:
path:regex:(.*)\.mp4 fileexists:\1\.jpg

This will return mp4 files where a jpg exists with the same stem in the same folder.
This search is instant.

dupestartwith: would be similar to fileexsits, except it will never match the current file.

I've added dupestartwith: to my TODO list.

raccoon
Posts: 147
Joined: Thu Oct 18, 2018 1:24 am

Re: Function to find Partial Name Dupes

Post by raccoon » Sat Apr 04, 2020 11:42 am

Cool beans, will be fun to get my hands on!

What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).

void
Site Admin
Posts: 6709
Joined: Fri Oct 16, 2009 11:31 pm

Re: Function to find Partial Name Dupes

Post by void » Mon Apr 06, 2020 4:08 am

What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).
It might work, I'm not sure where to specify the \1:\2:\3:\4...
It could be done as a search term...

I'm looking into a "duplicate view" mode with the following options:
Show all items.
Show duplicates only.
Show unique items only.
Show only one instance of each value.

These dupe modes would be based of the current sort.

For example, sort by size and select duplicate view -> show duplicates only to show results where only files with the same size exist.

There will be a new column 'regex match 1' to show the captured regex match 1. You will be able to sort by this column.

For example, search for regex:"\b(\w{8,})\b", sort by regex match 1, select duplicate view -> show duplicates only to show files where the captured regex match 1 are all the same.

This has the limitation of only matching one captured regex match.

I already had a solution working for size and name. However, I think I'll rewrite it to support all columns..

raccoon
Posts: 147
Joined: Thu Oct 18, 2018 1:24 am

Re: Function to find Partial Name Dupes

Post by raccoon » Mon Apr 06, 2020 7:30 am

void wrote:
Mon Apr 06, 2020 4:08 am
It might work, I'm not sure where to specify the \1:\2:\3:\4...

...

There will be a new column 'regex match 1' to show the captured regex match 1. You will be able to sort by this column.
Basically this, but instead of just 'regex match 1', it will contain all regex matches, and they would be tokenized with a colon delimiter (invalid file character). And within this column, you look for duplicates (or uniques if that's your thing). This way the regex pattern can contain more-than-one back-reference instead of just one. And, as well, the duplicate matching doesn't have to be left-aligned to the file name or file path, since the back-reference(s) may appear anywhere in the pattern.

Here's my example again, with coloring.

DupeRegex:"(\w{6,})\s\d{4}.*\.([^.]+)"

Name                                             | Regex Matches
------------------------------------------- | --------------------------------------------------
doctor voidtools 2020 DVD.mp4       | voidtools:mp4
mr voidtools 2019 facebook.mp4     | voidtools:mp4 <-- look, a duplicate!
Quarterly Report 2008.xls               | Report:xls
Copy of Quarterly Report 2008.xls   | Report:xls <-- look, a duplicate!
Budget Finance Report 2013.xls      | Report:xls <-- look, another duplicate!

All we're really doing here is the same as any old Regex:"pattern" search, but pulling out [all] back references and seeing if any other records share matching back references.
Last edited by raccoon on Mon Apr 06, 2020 7:37 am, edited 1 time in total.

void
Site Admin
Posts: 6709
Joined: Fri Oct 16, 2009 11:31 pm

Re: Function to find Partial Name Dupes

Post by void » Mon Apr 06, 2020 7:37 am

I'll add a 'all regex matches' column.

Then you'll be able to search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
sort by 'all regex matches'
select duplicate view -> find duplicates.

I'll consider a duperegex: search which will automate this, ie: the sort by 'all regex matches' would be done behind the scene.

Thanks for the suggestion.

void
Site Admin
Posts: 6709
Joined: Fri Oct 16, 2009 11:31 pm

Re: Function to find Partial Name Dupes

Post by void » Mon Mar 15, 2021 12:07 pm

This functionality has been added to Everything 1.5 Alpha

Search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"

Right click the column header, under the Search submenu, click Regular Expression Match 0.
Right click the Regular Expression Match 0 column header and click Find Regular Expression Match 0 Duplicates.

Post Reply