is there way to find dupes based on 'size on disk' value?

Discussion related to "Everything" 1.5.
Post Reply
bruce656
Posts: 22
Joined: Fri May 13, 2022 3:36 pm

is there way to find dupes based on 'size on disk' value?

Post by bruce656 »

I have two files that are basically identical, and their 'size on disk' values match, but "sizedupe:" doesn't pull them as a match. Is there some search function that would match this value?
void
Developer
Posts: 19870
Joined: Fri Oct 16, 2009 11:31 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by void »

dupe:size-on-disk


dupe:

This will be really slow as Everything will need to gather size on disk for all results.
Combine with other search filters for the best performance.

For example:

c:\folder\ ext:mp4;mp3;jpg dupe:size-on-disk
bruce656
Posts: 22
Joined: Fri May 13, 2022 3:36 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by bruce656 »

void wrote: Tue Sep 16, 2025 11:16 pm
dupe:size-on-disk


dupe:

This will be really slow as Everything will need to gather size on disk for all results.
Combine with other search filters for the best performance.

For example:

c:\folder\ ext:mp4;mp3;jpg dupe:size-on-disk
Ah! Thanks very much. I'll add size on disk to the indexed attributes
void
Developer
Posts: 19870
Joined: Fri Oct 16, 2009 11:31 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by void »

Another way to do this:

add-column:a a:=INT(($size:+4096-1)/4096)*4096 dupe:a


This will put sizes into buckets and you can specify the bucket size.

Typically, the cluster size is 4096, making the above the same as size on disk.
With the advantage of not having to go to disk to gather the size on disk value.
bruce656
Posts: 22
Joined: Fri May 13, 2022 3:36 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by bruce656 »

void wrote: Thu Sep 18, 2025 2:24 am Another way to do this:

add-column:a a:=INT(($size:+4096-1)/4096)*4096 dupe:a


This will put sizes into buckets and you can specify the bucket size.

Typically, the cluster size is 4096, making the above the same as size on disk.
With the advantage of not having to go to disk to gather the size on disk value.
So is there a way to have the results be grouped by size AND play length? As it is, if I sort by size, it returns size dupes that are different lengths (and thus not duplicate files), or I can sort by length, but they aren't the same size. My search is turning up a lot of dupes, but there are still a LOT of false positives.
Top
Herkules97
Posts: 220
Joined: Tue Oct 08, 2019 6:42 am

Re: is there way to find dupes based on 'size on disk' value?

Post by Herkules97 »

bruce656 wrote: Thu Oct 02, 2025 8:52 pm So is there a way to have the results be grouped by size AND play length? As it is, if I sort by size, it returns size dupes that are different lengths (and thus not duplicate files), or I can sort by length, but they aren't the same size. My search is turning up a lot of dupes, but there are still a LOT of false positives.
Top
How well does it fare if you dupe both size-on-disk and play duration?

size-on-diskdupe: lengthdupe: in the same search? Or dupe:size-on-disk;length if you like that method more.

Maybe what you can also do if you find too many uniques among them because they have the same size-on-disk or length, but aren't the same files is to either sort by size on disk or length and then shift+left-click whichever column isn't the one you first sorted with.

If you sorted by size on disk, hold shift and left-click the length column. I don't think there is a tertiary sorting shortcut, but I read recently that you can do that in advanced search or wherever. More than 3 might be possible to code, but EBV does not support that currently. Idk if it's just a "When would you need more than 3" or if both EBV and foobar2000 are limited to 3 because they use a Windows thing that is limited to 3 by MicroSoft.

This wouldn't eliminate uniques, but I also don't see why you would only try to find based on these two. Using timestamps could help, like adding date-modifieddupe or dupe:size-on-disk;length;date-modified. Then secondary sorting (Hold shift + left-click column) date modified?

Size on disk and allocation size de-duplicating can be iffy because they use specific size increments, like 4096 for example. So any file that is smaller than 4096 would have that size-on-disk and thus you'd find any file under 4096b as a duplicate. On my OS device, 400K out of 1.8mil files are currently with size on disk 4096 bytes.

Allocation size is different from size on disk. I think it might be the more accurate one, sorting by size on disk I find many with 0b. But allocation size isn't 0. One file has size 371b, size on disk 0b and allocation size 376b. It is not an empty file. All 3 has to be inaccurate considering there are files where all 3 are 0. I imagine both the file itself existing and the filename occupy bytes. I wonder if there is a property that tells you the full size of a file..Maybe NTFS doesn't support that.
bruce656
Posts: 22
Joined: Fri May 13, 2022 3:36 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by bruce656 »

Herkules97 wrote: Fri Oct 03, 2025 4:16 pm
How well does it fare if you dupe both size-on-disk and play duration?

size-on-diskdupe: lengthdupe: in the same search? Or dupe:size-on-disk;length if you like that method more.
Doesn't really help too much. I guess it's a grouping issue that I'm having? If you sort by length, it will group all videos that have the same length, regardless of size. If you sort by size, it group all videos with the same size, but with different play lengths. If you CTRL left click on a secondary column as you suggested, it will just change the sorting from ascending to descending within each group

Size on disk and allocation size de-duplicating can be iffy because they use specific size increments, like 4096 for example. So any file that is smaller than 4096 would have that size-on-disk and thus you'd find any file under 4096b as a duplicate. On my OS device, 400K out of 1.8mil files are currently with size on disk 4096 bytes.
Yeah, I will use both dupe:size as well as dupe:size on disk and they return different results both which is useful because they each find dupes the other doesn't. The problem I have is that I might have, say, 20 files that all have the same size/size on disk but only ONE pair of dupes within them. but if I could group by size AND play duration, it would ideally only display the one pair, you see? Or I just want to know that for every result that returns with dupe:size, that there is a corresponding dupe:length result, which would produce the duplicate pairs that I'm looking for and eliminate all the false positives.
void
Developer
Posts: 19870
Joined: Fri Oct 16, 2009 11:31 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by void »

So is there a way to have the results be grouped by size AND play length?
dupe:size;length




dupe:size;sha256
will find duplicated content.
(very slow)



Finding duplicates in Everything
Herkules97
Posts: 220
Joined: Tue Oct 08, 2019 6:42 am

Re: is there way to find dupes based on 'size on disk' value?

Post by Herkules97 »

bruce656 wrote: Fri Oct 03, 2025 6:37 pm Yeah, I will use both dupe:size as well as dupe:size on disk and they return different results both which is useful because they each find dupes the other doesn't. The problem I have is that I might have, say, 20 files that all have the same size/size on disk but only ONE pair of dupes within them. but if I could group by size AND play duration, it would ideally only display the one pair, you see? Or I just want to know that for every result that returns with dupe:size, that there is a corresponding dupe:length result, which would produce the duplicate pairs that I'm looking for and eliminate all the false positives.
You have files where some have the same length, some same names and some same sizes.
Regardless of which property you sort for, it will still find different files with the same length, or name or size sorted as if they're duplicates..

Void suggests using a hash system, which if sorted by should ensure you only find the duplicate pairing in the 20 individual files. That could be one way. If there aren't that many files, it would work fine to do it temporarily for only those files instead of adding the property to the entire instance.

You can also add the column for the hash you want and put it at the most-left of the columns. If you don't scroll sideways, you can then scroll the result list and EBV will read the property as you go, as the column is visible. Maybe dupe:sha256 does this automatically without needing to scroll, I'm not trying it because lazy.

Maybe another possibility is using another program for this specific purpose. I have a folder I want to de-duplicate that has lots of small duplicates.
I still haven't come to the point of re-installing it, but it's Duplicate Cleaner Pro. It has found different files counted as duplicates though, so it's not 100%..Maybe it is if you use a heavier hash algo. I think I used md5? Or none at all, my memory is shit.

It's around 800K files, maybe there is a way with EBV to remove all but one duplicate of each by excluding one duplicate for any duplicate groups, then removing it all wholesale. So far I've just done it manually, select a bunch of files and then click on one of them within each dupe line while holding ctrl to de-select one of each duplicate. This way you can also see if EBV thinks different files are duplicates..But I only did this for a folder I already have copied elsewhere and eventually it got so boring I just clicked on the bottom file for each dupe line, without caring about whether or not they're actually all duplicates within those groups. That could be another possibility? Copying the whole file list elsewhere, de-duplicating casually and then using a method to find what might be unique on the original file list. That could be a way to find if your casual de-duplicating removed any uniques or not. I think dupe-from: can be used here, with the original folder being what you are adding after the colon. That can find duplicates, then !dupe-from: maybe finds uniques that you can manually look through if there are any to see if you did it right.

I will probably use that for the 800K files' folder, use duplicate cleaner pro or find a way to hide one duplicate per duplicate group and then wholesale delete using EBV. !dupe-from:[folder path] to find uniques and see if there is anything there. This does require you to use dupe: as well, but you already are.

I don't know what the file list you have is looking like, but if they are actually proper copies and you've not used Windows Explorer's copy process and instead something better like robocopy with the timestamps included, like COPY:DT, then copies should have the same size, date created and date modified at the least. If however you've copied with Windows Explorer, skip date created dupe finding as Explorer refreshes that for the copied file. Unless you're on Windows 11 and they've changed it to work more like robocopy. I use TeraCopy for general copying, since v4.0rc it also copies folder timestamps correctly :).
bruce656
Posts: 22
Joined: Fri May 13, 2022 3:36 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by bruce656 »

You have files where some have the same length, some same names and some same sizes.
Regardless of which property you sort for, it will still find different files with the same length, or name or size sorted as if they're duplicates..
Right, so what I'm asking is if there is a way to group based on TWO criteria, size AND length. I was pretty sure the answer was no, but I just wanted to check.
Void suggests using a hash system, which if sorted by should ensure you only find the duplicate pairing in the 20 individual files. That could be one way. If there aren't that many files, it would work fine to do it temporarily for only those files instead of adding the property to the entire instance.

You can also add the column for the hash you want and put it at the most-left of the columns. If you don't scroll sideways, you can then scroll the result list and EBV will read the property as you go, as the column is visible. Maybe dupe:sha256 does this automatically without needing to scroll, I'm not trying it because lazy.
And hashing is a good suggestion, and I do have the file hash added to my index, but that only checks for EXACT duplicates. Say an mp3 encoded at 320kbps. But if you have the same song encoded at v0, it's still the same song, but a different size. Or even a DIFFERENT encoding of 320kbps. Or from a different mastering from a re-release of the album. The file sizes would all be very close, but different hashes.
Maybe another possibility is using another program for this specific purpose. I have a folder I want to de-duplicate that has lots of small duplicates.
I still haven't come to the point of re-installing it, but it's Duplicate Cleaner Pro. It has found different files counted as duplicates though, so it's not 100%..Maybe it is if you use a heavier hash algo. I think I used md5? Or none at all, my memory is shit.
Yes, I've looked into deduping programs, but it's really not effective for my purpose to the reason about the hashes I explained above. As to WHY I have so many dupe files with different hashes, I can't really say. But they're there, and I can't get rid of them very easily. At this point it's really just me scrolling through the files sorted by length and looking for visual matches...
Herkules97
Posts: 220
Joined: Tue Oct 08, 2019 6:42 am

Re: is there way to find dupes based on 'size on disk' value?

Post by Herkules97 »

bruce656 wrote: Tue Oct 07, 2025 3:40 pm Right, so what I'm asking is if there is a way to group based on TWO criteria, size AND length. I was pretty sure the answer was no, but I just wanted to check.

And hashing is a good suggestion, and I do have the file hash added to my index, but that only checks for EXACT duplicates. Say an mp3 encoded at 320kbps. But if you have the same song encoded at v0, it's still the same song, but a different size. Or even a DIFFERENT encoding of 320kbps. Or from a different mastering from a re-release of the album. The file sizes would all be very close, but different hashes.


Yes, I've looked into deduping programs, but it's really not effective for my purpose to the reason about the hashes I explained above. As to WHY I have so many dupe files with different hashes, I can't really say. But they're there, and I can't get rid of them very easily. At this point it's really just me scrolling through the files sorted by length and looking for visual matches...
I see, yeah Idk then. Maybe you could load them into foobar2000 2.x x64, sort by title and find them that way..That would also reveal their metadata if you'd rather keep any that have more metadata in them like year and such.

Idk how you use music, but I use foobar with the playback stats add-on and it counts per metadata. So I pick the metadata version of a song I like the most and use that to play. If I used other files with worse metadata or none at all, I stop using them when I get something better.

If you've been deleting without a care for metadata, you can still use foobar to maybe find duplicates easier..Unless a bunch of duplicates lack the usual metadata. Much easier if they all have at least artist and title.
therube
Posts: 5723
Joined: Thu Sep 03, 2009 6:48 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by therube »

Just some dupe's that I've accumulated over time... One might catch something that another may not...

dupe:size;!name;dm
dupe:size;!name !/$recycle !/corrupt
distinct:size;name dupe:size !/$recycle !/corrupt
distinct:size;name dupe:Length !/$recycle !/corrupt


Also while you say, 'size on disk', but I gather you are not meaning, size-on-disk.
Herkules97
Posts: 220
Joined: Tue Oct 08, 2019 6:42 am

Re: is there way to find dupes based on 'size on disk' value?

Post by Herkules97 »

therube wrote: Wed Oct 08, 2025 6:59 pm Just some dupe's that I've accumulated over time... One might catch something that another may not...

dupe:size;!name;dm
dupe:size;!name !/$recycle !/corrupt
distinct:size;name dupe:size !/$recycle !/corrupt
distinct:size;name dupe:Length !/$recycle !/corrupt


Also while you say, 'size on disk', but I gather you are not meaning, size-on-disk.
He does say
Ah! Thanks very much. I'll add size on disk to the indexed attributes
It would confusing on his part if he just meant the standard size display, little benefit removing that as one of the properties.
As per his most recent reply, none of the dupe stuff would catch what he's looking for. At best it would have to be something that looks into the audio stream itself for any file. Idk how accurate it is if the bitrate is different, or it's a flac vs mp3. Also Idk if you can even compare the streams like that, or if that would just find a bunch of other songs with similar structures.
At mid, loading them into a music player like foobar2000, granted that they all have at least artist and title.
At worst, listening to every single file and removing as you go.
Or a combination if some lack one or both..Loading them in, sorting by artist..Removing from the playlist any that have none. Sorting by title, removing any that have none.
Then de-dup what remains, select all and then de-select one of each song. All that is selected at the end, right-click for file operations and delete.
therube
Posts: 5723
Joined: Thu Sep 03, 2009 6:48 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by therube »

But what is the benefit of
size-on-disk:
(in this scenario)?
In what manner would that help, a size based upon cluster size?
You have 1,001 .mid (midi) files & all are < 4096 bytes. With size-on-disk:, they'll all have a size-on-disk of 4096. How is that of any benefit ?

length: can have a tolerance factor set (in ms).
size: can be given a range.

But cluster size, I fail to see any value?
Herkules97
Posts: 220
Joined: Tue Oct 08, 2019 6:42 am

Re: is there way to find dupes based on 'size on disk' value?

Post by Herkules97 »

therube wrote: Fri Oct 10, 2025 8:14 pm But what is the benefit of
size-on-disk:
(in this scenario)?
In what manner would that help, a size based upon cluster size?
You have 1,001 .mid (midi) files & all are < 4096 bytes. With size-on-disk:, they'll all have a size-on-disk of 4096. How is that of any benefit ?

length: can have a tolerance factor set (in ms).
size: can be given a range.

But cluster size, I fail to see any value?
A mystery :) I don't delete all copies of something typically, so I don't have experience trying to do this.
I wonder if he has progressed on it..Maybe he found a different solution. Or he's like me and he starts something and then every several days resumes it and then maybe forgets about resuming it some months later.
bruce656
Posts: 22
Joined: Fri May 13, 2022 3:36 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by bruce656 »

therube wrote: Fri Oct 10, 2025 8:14 pm But what is the benefit of
size-on-disk:
(in this scenario)?
In what manner would that help, a size based upon cluster size?
You have 1,001 .mid (midi) files & all are < 4096 bytes. With size-on-disk:, they'll all have a size-on-disk of 4096. How is that of any benefit ?

length: can have a tolerance factor set (in ms).
size: can be given a range.

But cluster size, I fail to see any value?
dupe:size-on-disk
returns different matches than
dupe:size
:shrug:

adding
length[
as per void's suggestion, I get 4,500 matched pairs with
dupe:size;length
, and using
dupe:size-on-disk; length
i get 35,000 matched pairs
bruce656
Posts: 22
Joined: Fri May 13, 2022 3:36 pm

Re: is there way to find dupes based on 'size on disk' value?

Post by bruce656 »

Herkules97 wrote: Sat Oct 11, 2025 7:57 pm
I wonder if he has progressed on it..Maybe he found a different solution. Or he's like me and he starts something and then every several days resumes it and then maybe forgets about resuming it some months later.
It's an ongoing task :laughing: I work on it a little bit each day.
Post Reply