Setting to remove duplicated results from index

Have a suggestion for "Everything"? Please post it here.
Post Reply
sugoro
Posts: 7
Joined: Sat Nov 29, 2014 9:19 pm

Setting to remove duplicated results from index

Post by sugoro »

Use-case: when remapping paths for software like DrivePool (viewtopic.php?t=1572 and viewtopic.php?f=4&p=17171), we will often have "duplicated" results in the database. I say "duplicated" because they are not technically duplicated entries, as they belong to different drives. But, with remapping, they will map to exact duplicates in the db.

For example, we could have two drives mounted to folders Drive1 and Drive2, in a "mirror" configuration, where File.txt is duplicated to both drives. Like so
C:\Drive1\File.txt
C:\Drive2\File.txt

We then remap those, to point to the actual pooled drive, say, at D:\
Then, File.txt is accessed with D:\File.txt

In the db, we'll have D:\File.txt twice.


This setting toggle would remove the duplicates (possibly after sorting) and the db would not contain any exact duplicate entries.


Thanks for reading!
void
Developer
Posts: 15098
Joined: Fri Oct 16, 2009 11:31 pm

Re: Setting to remove duplicated results from index

Post by void »

Is excluding one of the mirror drives possible? eg: C:\Drive2

To exclude a folder in Everything:
  • In Everything, from the Tools menu, click Options.
  • Click the Exclude tab on the left.
  • Click Add Folder....
  • Select c:\drive 2 and click OK.
  • Click OK.
sugoro
Posts: 7
Joined: Sat Nov 29, 2014 9:19 pm

Re: Setting to remove duplicated results from index

Post by sugoro »

void wrote:Is excluding one of the mirror drives possible? eg: C:\Drive2

To exclude a folder in Everything:
  • In Everything, from the Tools menu, click Options.
  • Click the Exclude tab on the left.
  • Click Add Folder....
  • Select c:\drive 2 and click OK.
  • Click OK.

Yes, for simple cases. It won't work very well for more complicated duplication scenarios. I have specific duplication rules, to maximize space (no point duplicating backups that are already stored in another location, offsite).
Some folders are in 4 drives, others in 3, others in 2.

Also, you set rules like "keep 3 copies of this folder's contents" but you usually don't tell the program to "keep those folders in those 3 drives". It will place the files in whichever drive it determines to be the best, and files can be moved to other drives during its balancing routine.

Because of this, there's no to ignore "this folder, on these drives, except this one", since parts if folder will live in different drives, depending on how many drives you have in the pool and your duplication/placement rules.



Thanks for the reply!
dlong500
Posts: 13
Joined: Mon Sep 14, 2020 6:49 pm

Re: Setting to remove duplicated results from index

Post by dlong500 »

@void Adding a feature to hide duplicate full paths would be extremely useful in a complex configuration using a pooling software like DrivePool. Excluding specific disks won't help because DrivePool handles it's own duplication algorithms (disks aren't simple mirrors). But it seems like it should be fairly simple to track duplicated index entries in such a scenario because the full path, size, and date will be exactly the same for duplicate files on drives that have been mapped to a virtual pooled drive.

For example, let's say we have drive P: and drive Q: representing volumes on physical disks, and we remap both of those to a virtual drive X:

If we have a file (test.txt) that exists on:
P:\PoolPart.xxx\test.txt
Q:\PoolPart.xxx\test.txt

the everything index will show:
X:\test.txt
X:\test.txt

Couldn't there a way to be able to detect a duplicated index entry so we could hide one (or more) of the same rows in the GUI?
dlong500
Posts: 13
Joined: Mon Sep 14, 2020 6:49 pm

Re: Setting to remove duplicated results from index

Post by dlong500 »

Some of these posts are obscuring the original issue. The point of this thread is using Everything with a storage pooling software like DrivePool. The FAQ covering duplicated results don't address the issues with pooling software, and while using a folder index can technically be considered a workaround it pretty much defeats the purpose of using Everything since you lose fast NTFS indexing. Remapping the volumes works to aggregate the separate drive indexes so that it correctly shows the virtual pooled drive, so that part is working great, but of course it shows multiple entries for the same file when there are redundant copies in the storage pool.

If a feature could be added to deduplicate the index itself when using remapped volumes that would fix the problem entirely and we wouldn't need to sacrifice the speed of NTFS journaling.
void
Developer
Posts: 15098
Joined: Fri Oct 16, 2009 11:31 pm

Re: Setting to remove duplicated results from index

Post by void »

If Everything is showing duplicated results for a single drive, please see: Duplicated results.

Moved unrelated posts here: Duplicated results
void
Developer
Posts: 15098
Joined: Fri Oct 16, 2009 11:31 pm

Re: Setting to remove duplicated results from index

Post by void »

I have recently added a distinct: search function to Everything 1.5.

Please try including distinct: sort:"full path" in your search.

The distinct: search will list only unique files based on the current sort. (removes duplicated full paths from the results).

It is important to specify the sort with distinct:
You can change the sort after searching. Any duplicated results will remain removed.
Double click DUPE in the status bar to clear the distinct: search.

There is a performance hit with sorting by full path, so combine distinct: with other search parameters for the best performance.

To improve full path sorting performance:
  • In Everything, from the Tools menu, click Options.
  • Click the Properties tab on the left.
  • Click Add....
  • Select Full Path and click OK.
  • Check Fast sort.
  • Click OK.
Please let me know if this search helps.
dlong500
Posts: 13
Joined: Mon Sep 14, 2020 6:49 pm

Re: Setting to remove duplicated results from index

Post by dlong500 »

@void Thanks so much for addressing this issue!

Adding distinct: in front of any search I make DOES seem to resolve my issue with using remapped drives (in the context of using DrivePool with the custom parameters specified in this thread). I see only one line in the index for each file in a DrivePool drive even when there is redundancy specified within DrivePool settings.

However, that appears to be ALL that is necessary. I don't need to use any path sorting, and the performance doesn't seem to suffer either. But if I add a new file within a DrivePool drive that has redundancy/duplication after the search then I see duplicates for the new file in the list. If I double click on "DUPE" in the status bar the duplicate listing for the newly created file goes away too.

Could there be any way to add an option for a permanent "distinct" setting? And also the ability for the distinct option to apply in realtime for any new additions to the index? If that were possible I think everything would work perfectly under my scenario.

I'm happy to provide for clarity/feedback if you want, or to do testing on any new builds.

Thanks again for all that you do!
void
Developer
Posts: 15098
Joined: Fri Oct 16, 2009 11:31 pm

Re: Setting to remove duplicated results from index

Post by void »

Consider adding distinct: sort:full-path to your Everything filter:
  • In Everything, from the Search menu, click Organize filters....
  • Select Everything and click Edit....
  • Change the Search to:
    distinct: sort:full-path
  • Click OK.
  • Click OK.
Or, consider adding a new filter:
  • In Everything, from the Search menu, click Add to filters....
  • Change the Name to:
    Distinct
  • Change the Search to:
    distinct: sort:full-path
  • Click OK.
Filters can be activated from the Search menu, Filter bar (View -> Filters), right clicking the status bar, filter macro or filter keyboard shortcut.
dlong500
Posts: 13
Joined: Mon Sep 14, 2020 6:49 pm

Re: Setting to remove duplicated results from index

Post by dlong500 »

@void, adding distinct to the base filter certainly improves on having to enter it each time, but it also makes the filtering system more complex (every filter would need to include the distinct parameter in addition to whatever other parameters are specified). That's a minor gripe, but still something to consider.

Of more importance is the issue of the distinct parameter only working as a snapshot and not in real time. Any new files that match the search will still show up duplicated on pooled storage with redundant file copies. The search has to be manually refreshed each time new files are created to get new duplicate results to go away. If there could be a way to make distinct operate on any newly indexed results (when monitor changes is active) in addition to the initial snapshot that would resolve the issue.

I guess the bigger issue for me is wondering why anyone would ever want a duplicated result to show up in the index at all in the context of pooled storage with remapped NTFS drives pointing to a single pooled virtual drive. What use case would there be to show completely duplicate lines in the result list? I certainly understand that many people who don't have a complicated storage situation wouldn't want the performance hit of forcing a dedupe operation, but for situations like mine it would greatly reduce the complexity to simply have a single "distinct" option in the Indexes > NTFS section for each physical drive such that it would deduplicate results in real time. This would eliminate the need to mess with filters at all for deduplication and keep the filtering more simplified for other "real" filtering choices. It should of course be disabled by default to avoid deduplication in more simple scenarios when there is no need.
void
Developer
Posts: 15098
Joined: Fri Oct 16, 2009 11:31 pm

Re: Setting to remove duplicated results from index

Post by void »

Thanks for the reply dlong500,

distinct: sort:fullpath is not the best option for deduping pooled storage.
A better solution is needed.


distinct: is not real-time.
There would be a large performance hit for Everything to re-check the distinct state for all duplicates on every single file change.
I will consider adding an option to do this.
dlong500
Posts: 13
Joined: Mon Sep 14, 2020 6:49 pm

Re: Setting to remove duplicated results from index

Post by dlong500 »

void wrote: Wed Sep 08, 2021 11:24 am distinct: sort:fullpath is not the best option for deduping pooled storage.
A better solution is needed.


distinct: is not real-time.
There would be a large performance hit for Everything to re-check the distinct state for all duplicates on every single file change.
I will consider adding an option to do this.
Just pinging this thread again to see if you've thought anymore about a better solution for deduping pooled storage. The app is still useful to me even with duplicate results, but it certainly clutters up the interface and makes it harder to use. I would love some type of real-time optional index dedupe (optional so any performance penalty wouldn't be forced onto users who don't want to use it).
void
Developer
Posts: 15098
Joined: Fri Oct 16, 2009 11:31 pm

Re: Setting to remove duplicated results from index

Post by void »

The Everything Server will now dedupe filenames.
dlong500
Posts: 13
Joined: Mon Sep 14, 2020 6:49 pm

Re: Setting to remove duplicated results from index

Post by dlong500 »

void wrote: Sat Jun 10, 2023 10:03 pm The Everything Server will now dedupe filenames.
Just following up here to say thanks. I've been testing out using v1.5 alpha with Everything Server for a few weeks now to dedupe file paths on a Drivepool volume and it has been working great! The configuration to get everything set up is a bit complex but it does work well.
Post Reply