Function to find Partial Name Dupes
Function to find Partial Name Dupes
One of the tasks I keep running up against is the ability to locate partial name dupes. This is tricky because my search string isn't necessarily a verbatim string, but rather whether multiple files or folders share the same substring of a given length or position.
Simple Method: Find all records with duplicates matching the first N characters. Eg, All files with the same first 15 characters.
DupeLeft:15
Advanced Method: Allow the user compose a regular expression pattern that defines the parameters of substring length and composition that a record must match to be compared against other records. Portions of the pattern in backref are rendered and matched against other records for dupe comparison, and other portions of the pattern are generic qualifier filtering.
DupeRegex:"^(.{15,})"
DupeRegex:"^(.*)(?:19|20)\d\d"
The above example, any files that contain /(?:19|20)\d\d/ are compared for substring duplication of the portion of the name preceding that number, the /(.*)/ backref, so the number (year) need not necessarily match between duplicates, only the substring to the left of it.
Thoughts?
Simple Method: Find all records with duplicates matching the first N characters. Eg, All files with the same first 15 characters.
DupeLeft:15
Advanced Method: Allow the user compose a regular expression pattern that defines the parameters of substring length and composition that a record must match to be compared against other records. Portions of the pattern in backref are rendered and matched against other records for dupe comparison, and other portions of the pattern are generic qualifier filtering.
DupeRegex:"^(.{15,})"
DupeRegex:"^(.*)(?:19|20)\d\d"
The above example, any files that contain /(?:19|20)\d\d/ are compared for substring duplication of the portion of the name preceding that number, the /(.*)/ backref, so the number (year) need not necessarily match between duplicates, only the substring to the left of it.
Thoughts?
Re: Function to find Partial Name Dupes
I like the dupeleft: idea.
The DupeRegex: search could work. Although, performing a regex search for each filename would be very slow.
Maybe something like:
regex:"^(.*)(?:19|20)\d\d" dupestartwith:\1
-the result would have to match the first regex search and a duplicate would have to exist that starts with the first captured sub-expression.
-a startwith search for each filename would be instant.
Thank you for the suggestions.
The DupeRegex: search could work. Although, performing a regex search for each filename would be very slow.
Maybe something like:
regex:"^(.*)(?:19|20)\d\d" dupestartwith:\1
-the result would have to match the first regex search and a duplicate would have to exist that starts with the first captured sub-expression.
-a startwith search for each filename would be instant.
Thank you for the suggestions.
Re: Function to find Partial Name Dupes
Aye, I recognize the regex thing would have to be a multi-pass recursion. Though, I'm not sure that your solution reduces that recursive property as \1 would have to be resolved for each record, and then all records scanned again in kind. Seems basically like my idea, but limiting \1 to the left-side of the string. Maybe some savings if plain string compare is faster than PCRE. But, perhaps, just creating an index of the value of \1\2\3\4\5... for each record is enough, and just fast search / sort / compare / hashtable lookup those.
DupeRegex:"\b(\w{8,})\b"
Records:
I'm not sure it's necessary to attempt to match multiple resolves per entry. ie, no need to pull your hair out over supporting //g patterns.
Another example but with multiple backrefs. We just clobber them together into a single [invalid-file-character...say-colon] delimited string.
DupeRegex:"(\w{6,}) \d{4}.*\.([^.]+)" // 6-or-more word characters, followed by a space, 4 numbers, and a file extension.
Records must match the 6-or-more word characters and have the same file extension.
DupeRegex:"\b(\w{8,})\b"
Records:
Code: Select all
foo documents bar.ext resolves to: \1 == documents
baz documents quux.ext resolves to: \1 == documents (match)
aaa raspberries bbb butts.ext resolves to: \1 == raspberries
butts ccc raspberries ddd.ext resolves to: \1 == raspberries (match)
Another example but with multiple backrefs. We just clobber them together into a single [invalid-file-character...say-colon] delimited string.
DupeRegex:"(\w{6,}) \d{4}.*\.([^.]+)" // 6-or-more word characters, followed by a space, 4 numbers, and a file extension.
Records must match the 6-or-more word characters and have the same file extension.
Code: Select all
doctor voidtools 2020 DVD.mp4 resolves to: \1:\2:\3:\4:\5:\6:\7:\8:\9 == voidtools:mp4:::::::
mr voidtools 2019 facebook.mp4 resolves to: \1:\2:\3:\4:\5:\6:\7:\8:\9 == voidtools:mp4::::::: (match)
Re: Function to find Partial Name Dupes
Everything has this information already from the previous regex: search term. It is stored in the regex matches.I'm not sure that your solution reduces that recursive property as \1 would have to be resolved for each record
Generating the startwith term from \1 would be instant.
A startwith search is really a lookup in Everything and is instant (it really is only a few instructions).
Currently I have the following search planned for a future release:
path:regex:(.*)\.mp4 fileexists:\1\.jpg
This will return mp4 files where a jpg exists with the same stem in the same folder.
This search is instant.
dupestartwith: would be similar to fileexsits, except it will never match the current file.
I've added dupestartwith: to my TODO list.
Re: Function to find Partial Name Dupes
Cool beans, will be fun to get my hands on!
What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).
What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).
Re: Function to find Partial Name Dupes
It might work, I'm not sure where to specify the \1:\2:\3:\4...What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).
It could be done as a search term...
I'm looking into a "duplicate view" mode with the following options:
Show all items.
Show duplicates only.
Show unique items only.
Show only one instance of each value.
These dupe modes would be based of the current sort.
For example, sort by size and select duplicate view -> show duplicates only to show results where only files with the same size exist.
There will be a new column 'regex match 1' to show the captured regex match 1. You will be able to sort by this column.
For example, search for regex:"\b(\w{8,})\b", sort by regex match 1, select duplicate view -> show duplicates only to show files where the captured regex match 1 are all the same.
This has the limitation of only matching one captured regex match.
I already had a solution working for size and name. However, I think I'll rewrite it to support all columns..
Re: Function to find Partial Name Dupes
Basically this, but instead of just 'regex match 1', it will contain all regex matches, and they would be tokenized with a colon delimiter (invalid file character). And within this column, you look for duplicates (or uniques if that's your thing). This way the regex pattern can contain more-than-one back-reference instead of just one. And, as well, the duplicate matching doesn't have to be left-aligned to the file name or file path, since the back-reference(s) may appear anywhere in the pattern.
Here's my example again, with coloring.
DupeRegex:"(\w{6,})\s\d{4}.*\.([^.]+)"
Name | Regex Matches
------------------------------------------- | --------------------------------------------------
doctor voidtools 2020 DVD.mp4 | voidtools:mp4
mr voidtools 2019 facebook.mp4 | voidtools:mp4 <-- look, a duplicate!
Quarterly Report 2008.xls | Report:xls
Copy of Quarterly Report 2008.xls | Report:xls <-- look, a duplicate!
Budget Finance Report 2013.xls | Report:xls <-- look, another duplicate!
All we're really doing here is the same as any old Regex:"pattern" search, but pulling out [all] back references and seeing if any other records share matching back references.
Last edited by raccoon on Mon Apr 06, 2020 7:37 am, edited 1 time in total.
Re: Function to find Partial Name Dupes
I'll add a 'all regex matches' column.
Then you'll be able to search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
sort by 'all regex matches'
select duplicate view -> find duplicates.
I'll consider a duperegex: search which will automate this, ie: the sort by 'all regex matches' would be done behind the scene.
Thanks for the suggestion.
Then you'll be able to search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
sort by 'all regex matches'
select duplicate view -> find duplicates.
I'll consider a duperegex: search which will automate this, ie: the sort by 'all regex matches' would be done behind the scene.
Thanks for the suggestion.
Re: Function to find Partial Name Dupes
This functionality has been added to Everything 1.5 Alpha
Search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
Right click the column header, under the Search submenu, click Regular Expression Match 0.
Right click the Regular Expression Match 0 column header and click Find Regular Expression Match 0 Duplicates.
Search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
Right click the column header, under the Search submenu, click Regular Expression Match 0.
Right click the Regular Expression Match 0 column header and click Find Regular Expression Match 0 Duplicates.