[Suggestion]: Search first-XX-bytes: for ascii-content:

Discussion related to "Everything" 1.5.
Post Reply
DerekZiemba
Posts: 54
Joined: Thu Sep 27, 2018 4:46 pm

[Suggestion]: Search first-XX-bytes: for ascii-content:

Post by DerekZiemba »

Problem:
TypeScript & MPEG-TS both use the `.ts` file extension.
Ideally, you'd be able to differentiate them with something like
first-64-bytes:ascii-content:FFmpeg
.
Unfortunately,
ascii-content:
&
first-XX-bytes:
can't be combined. And
first-XX-bytes:
only supports HEX input making it unwieldy.

Suggestions:
  1. Add support for the
    first-XX-bytes:
    family of functions to combine with other functions.
    ie: ignore-case, ansi-content, ascii-content, text-content, regex, ..., etc.
  2. Add
    first-1k-bytes:
    ,
    first-2k-bytes:
    , &
    first-4k-bytes:
    to the `first-XX-bytes:` function family.
    • Little to no performance impact. Window's standard Allocation Unit Size is 4kb. So I believe each read operation (which takes orders of magnitude more time time than doing the compares) is always a 4kb block regardless if you're doing
      first-byte:
      (1 compare),
      first-64-bytes:
      (naively 1 AVX256 compare), or
      first-512-bytes:
      (naively 16 AVX256 or 8 AVX512 compares).
    • I doubt every MPEG-TS file says "FFmpeg" in the first 64 bytes like in my example file below.
      Therefore it would be nice to be able to scan to the metadata block which occurs in my example from byte 656 to 1383, for the word "MPEG".
      Note: also to be able to throw
      ignore-case:
      in there. ie: something like
      ignore-case:first-4k-bytes:ascii-content:MPEG
      .
      Or better yet, yeet the fast compare performance out the window & regex:
      first-4k-bytes:regex:"\b((?i:MPEG-?4)|(?-i:(H\.?|x)264))\b"
      image.png
      image.png (115.4 KiB) Viewed 8671 times
    NOTE: I'm aware the correct syntax might be
    regex:first-4k-bytes:"\b((?i:MPEG-?4)|(?-i:(H\.?|x)264))\b"
    . But figured it's probably easier for `first-xx-bytes:` to see the param is not hexadecimal so must be a function. At a glance the only functions that couldn't be combined in that fashion bcus valid hex are
    da:
    (shorthand date-accessed:) and
    dc:
    (shorthand date-created:), and I can't imagine a scenario where you'd use them here anyway.
    Also because
    ignore-case:first-4k-bytes:ascii-content:MPEG
    pretty much has to have `ascii-content:` come last unless you require the param be quoted.
DerekZiemba
Posts: 54
Joined: Thu Sep 27, 2018 4:46 pm

Re: [Suggestion]: Search first-XX-bytes: for ascii-content:

Post by DerekZiemba »

In case someone smart comes in here. The example above is just that, an example.
I'm aware it's in essence solved with
size:>5m ext:ts
(types.d.ts is over 4mb & everywhere).
A likely more common case for being able to search start/end by text would be for extensionless scripts
ext:"" size:>64 size:<1m first-32-bytes:ascii-content:<"#!/bin/bash";"#!/bin/sh";"#!/usr/bin/env bash";"#! /usr/bin/env node">

And because of the variances in, for example spacing, is why it'd be nice to combine other modifiers/function like `ignore-whitespace:`, `ignore-punc:`, or `regex:`.
Yes, that can all largely be avoided by just looking for the shebang, in this specific example. But my ask for this type of functionality is more usable, convenient, & opens up more possibilities. It's inconvenient to figure out #! in hexadecimal is 0x2321. Any text as HEX for that matter.
If for example you just want extensionless node scripts, amongst literally 700k extensionless files in my case, as fast as possible. Very inconvenient to write
first-2-bytes:2321 first-32-bytes:6E6F6465 vs. first-32-bytes:regex:"^#!.+?\b node\b"
Right now that's solved with content:regex:"^#!.+?\b node\b" but what if I indexed the first 32bytes of every extensionless file for speed? For 700k files it'd only be 22MB memory/database size so isn't unrealistic to want to do, especially with
!</.git/|^cache*/|*cache/|/IndexedDB/|/AppData/|/terminfo/|/tzdata/|/zoneinfo/|LICENSE|LOG>
bringing it down to just ~75k


I got too caught up in my example so never stated what I'm really interested in is the ability to scan arbitrary byte ranges for text. Not just the first or last. However I figure thats a much harder ask. Existing infrastructure & syntax to express something like that (I think?) is missing and usually the first/last 4k will do.

I saw the first/last 512 byte function families & reasoned they were probably created to optimize performance at a time when 512byte sectors were the norm and is why they stop there. If that was the logic, with modern SSDs using 4kb sector sizes, it'd be a smaller ask to add 1k, 2k, & 4k variants and likely aligns with original goals.

But yeah, to do what I really want to do would be something entirely new. Not sure how to express it in a way that fits with existing syntax but it'd be like:
[content-in-range:1k,2k,regex:"qwerty"]
or
[content-in-range:-500,2k,regex:"qwerty"]

That is
[content-in-range:start,length,needle]
:
  • start: Index to start from. If negative, then start from end.
  • length: How many bytes to search. If -1, then search to the end.
  • Support for "k, m, etc." suffixes (× 1000^n or × 1024^n) to avoid typing large numbers / majority of cases.
    When numbers are big but need be precise, support `_` in place of comma separators for readability.
  • needle in the haystack: the search term, can be modified with function/modifiers like: `ascii-content:`, `utf16-content:`, `text-content:`, `hex:`, `wildcards:`, `number:`, `number-range:`, etc.
    By `number:` I'm referring to the binary representation not text. Perhaps having function like `i64:`, `f32:`, would be useful here. The idea being, for example if in publicly available source code without insight into the build process or which version the source corresponds to, I could copy a number constant from source code, then paste it directly in Everything like
    ext:exe;dll [content-in-range:0,-1,f64:1.28943695621391310e+01]
    .
    That is: search from start to end, the file containing the binary form represented by the float64 number provided. If the previewer could display hex & highlight the location too... a man can dream... and also edit it... can really dream.
Basically: Everything takes a human readable form so that human doesn't have to first convert it to hex and then searches for the value as it'd be represented in a file.
ChrisGreaves
Posts: 821
Joined: Wed Jan 05, 2022 9:29 pm

Re: [Suggestion]: Search first-XX-bytes: for ascii-content:

Post by ChrisGreaves »

I have not full read these two posts, but I like what I think I understand.
Your example was an image file(?) which (files) I don;t understand at all, but some fifteen years ago I was interested in audio files - such as MP3 - and learned about the packets of data within those.

While I often trimmed applause from the start and and of MP3 files, it seems to me that a good way of checking for duplicates (back than!) would have been to isolate, say a thousand, bytes of actual music from the centre of a track, and use that/those packets to match against all packets of a second file. A match of selected packets would suggest a duplication of the track.
I follow with interest ..

I have not used FirdstxxBytes.
Yet!
Cheers, Chris
void
Developer
Posts: 19870
Joined: Fri Oct 16, 2009 11:31 pm

Re: [Suggestion]: Search first-XX-bytes: for ascii-content:

Post by void »

Ideally, you'd be able to differentiate them with something like
first-64-bytes:ascii-content:FFmpeg
Please try the following instead:

content-max-size:64 binary:content:FFmpeg


content-max-size: limits content searching to the first 64 bytes.
binary: treats the content and search as a byte stream.



Other useful searches for viewing/formatting the content:

content-max-size:64 regex:binary:content:(.*FFmpeg.*) addcol:1


content-max-size:64 regex:binary:content:(.*FFmpeg.*) addcol:a a:=UTF82HEX($1:)


content-max-size:
content-offset:
binary:



I will consider a content-in-range search function. (For now, please try content-max-size: and content-offset:)

I will look into treating first-x-bytes as binary (instead of hex)

Thank you for the suggestions.
Post Reply