[Suggestion]: Search first-XX-bytes: for ascii-content:

DerekZiemba · Post by **DerekZiemba** » Fri Jul 11, 2025 7:07 pm

Problem:
TypeScript & MPEG-TS both use the `.ts` file extension.
Ideally, you'd be able to differentiate them with something like

first-64-bytes:ascii-content:FFmpeg

.
Unfortunately,

ascii-content:

&

first-XX-bytes:

can't be combined. And

first-XX-bytes:

only supports HEX input making it unwieldy.

Suggestions:

Add support for the
first-XX-bytes:
family of functions to combine with other functions.
ie: ignore-case, ansi-content, ascii-content, text-content, regex, ..., etc.
Add
first-1k-bytes:
,
first-2k-bytes:
, &
first-4k-bytes:
to the `first-XX-bytes:` function family.
- Little to no performance impact. Window's standard Allocation Unit Size is 4kb. So I believe each read operation (which takes orders of magnitude more time time than doing the compares) is always a 4kb block regardless if you're doing
  first-byte:
  (1 compare),
  first-64-bytes:
  (naively 1 AVX256 compare), or
  first-512-bytes:
  (naively 16 AVX256 or 8 AVX512 compares).
- I doubt every MPEG-TS file says "FFmpeg" in the first 64 bytes like in my example file below.
  Therefore it would be nice to be able to scan to the metadata block which occurs in my example from byte 656 to 1383, for the word "MPEG".
  Note: also to be able to throw
  ignore-case:
  in there. ie: something like
  ignore-case:first-4k-bytes:ascii-content:MPEG
  .
  Or better yet, yeet the fast compare performance out the window & regex:
  
  first-4k-bytes:regex:"\b((?i:MPEG-?4)|(?-i:(H\.?|x)264))\b"
  
  image.png
NOTE: I'm aware the correct syntax might be
regex:first-4k-bytes:"\b((?i:MPEG-?4)|(?-i:(H\.?|x)264))\b"
. But figured it's probably easier for `first-xx-bytes:` to see the param is not hexadecimal so must be a function. At a glance the only functions that couldn't be combined in that fashion bcus valid hex are
da:
(shorthand date-accessed:) and
dc:
(shorthand date-created:), and I can't imagine a scenario where you'd use them here anyway.
Also because
ignore-case:first-4k-bytes:ascii-content:MPEG
pretty much has to have `ascii-content:` come last unless you require the param be quoted.

DerekZiemba · Post by **DerekZiemba** » Fri Jul 11, 2025 8:07 pm

In case someone smart comes in here. The example above is just that, an example.
I'm aware it's in essence solved with

size:>5m ext:ts

(types.d.ts is over 4mb & everywhere).
A likely more common case for being able to search start/end by text would be for extensionless scripts

ext:""   size:>64  size:<1m  first-32-bytes:ascii-content:<"#!/bin/bash";"#!/bin/sh";"#!/usr/bin/env bash";"#! /usr/bin/env node">

And because of the variances in, for example spacing, is why it'd be nice to combine other modifiers/function like `ignore-whitespace:`, `ignore-punc:`, or `regex:`.
Yes, that can all largely be avoided by just looking for the shebang, in this specific example. But my ask for this type of functionality is more usable, convenient, & opens up more possibilities. It's inconvenient to figure out #! in hexadecimal is 0x2321. Any text as HEX for that matter.
If for example you just want extensionless node scripts, amongst literally 700k extensionless files in my case, as fast as possible. Very inconvenient to write
first-2-bytes:2321 first-32-bytes:6E6F6465 vs. first-32-bytes:regex:"^#!.+?\b node\b"
Right now that's solved with content:regex:"^#!.+?\b node\b" but what if I indexed the first 32bytes of every extensionless file for speed? For 700k files it'd only be 22MB memory/database size so isn't unrealistic to want to do, especially with
!</.git/|^cache*/|*cache/|/IndexedDB/|/AppData/|/terminfo/|/tzdata/|/zoneinfo/|LICENSE|LOG> bringing it down to just ~75k

I got too caught up in my example so never stated what I'm really interested in is the ability to scan arbitrary byte ranges for text. Not just the first or last. However I figure thats a much harder ask. Existing infrastructure & syntax to express something like that (I think?) is missing and usually the first/last 4k will do.

I saw the first/last 512 byte function families & reasoned they were probably created to optimize performance at a time when 512byte sectors were the norm and is why they stop there. If that was the logic, with modern SSDs using 4kb sector sizes, it'd be a smaller ask to add 1k, 2k, & 4k variants and likely aligns with original goals.

But yeah, to do what I really want to do would be something entirely new. Not sure how to express it in a way that fits with existing syntax but it'd be like:

[content-in-range:1k,2k,regex:"qwerty"]

or

[content-in-range:-500,2k,regex:"qwerty"]

That is

[content-in-range:start,length,needle]

:

start: Index to start from. If negative, then start from end.
length: How many bytes to search. If -1, then search to the end.
Support for "k, m, etc." suffixes (× 1000^n or × 1024^n) to avoid typing large numbers / majority of cases.
When numbers are big but need be precise, support `_` in place of comma separators for readability.
needle in the haystack: the search term, can be modified with function/modifiers like: `ascii-content:`, `utf16-content:`, `text-content:`, `hex:`, `wildcards:`, `number:`, `number-range:`, etc.
By `number:` I'm referring to the binary representation not text. Perhaps having function like `i64:`, `f32:`, would be useful here. The idea being, for example if in publicly available source code without insight into the build process or which version the source corresponds to, I could copy a number constant from source code, then paste it directly in Everything like
ext:exe;dll [content-in-range:0,-1,f64:1.28943695621391310e+01]
.
That is: search from start to end, the file containing the binary form represented by the float64 number provided. If the previewer could display hex & highlight the location too... a man can dream... and also edit it... can really dream.

Basically: Everything takes a human readable form so that human doesn't have to first convert it to hex and then searches for the value as it'd be represented in a file.

ChrisGreaves · Post by **ChrisGreaves** » Sat Jul 12, 2025 11:14 am

I have not full read these two posts, but I like what I think I understand.
Your example was an image file(?) which (files) I don;t understand at all, but some fifteen years ago I was interested in audio files - such as MP3 - and learned about the packets of data within those.

While I often trimmed applause from the start and and of MP3 files, it seems to me that a good way of checking for duplicates (back than!) would have been to isolate, say a thousand, bytes of actual music from the centre of a track, and use that/those packets to match against all packets of a second file. A match of selected packets would suggest a duplication of the track.
I follow with interest ..

I have not used FirdstxxBytes.
Yet!
Cheers, Chris

Post by **void** » Sun Jul 13, 2025 12:44 am

Ideally, you'd be able to differentiate them with something like
first-64-bytes:ascii-content:FFmpeg

Please try the following instead:

content-max-size:64 binary:content:FFmpeg

content-max-size: limits content searching to the first 64 bytes.
binary: treats the content and search as a byte stream.

Other useful searches for viewing/formatting the content:

content-max-size:64 regex:binary:content:(.*FFmpeg.*) addcol:1

content-max-size:64 regex:binary:content:(.*FFmpeg.*) addcol:a a:=UTF82HEX($1:)

content-max-size:
content-offset:
binary:

I will consider a content-in-range search function. (For now, please try content-max-size: and content-offset:)

I will look into treating first-x-bytes as binary (instead of hex)

Thank you for the suggestions.

voidtools forum

[Suggestion]: Search first-XX-bytes: for ascii-content:

[Suggestion]: Search first-XX-bytes: for ascii-content:

Re: [Suggestion]: Search first-XX-bytes: for ascii-content:

Re: [Suggestion]: Search first-XX-bytes: for ascii-content:

Re: [Suggestion]: Search first-XX-bytes: for ascii-content: