Searching 50000 documents by content in memory?

Discussion related to "Everything" 1.5 Alpha.
Post Reply
aaathemtheyzzz
Posts: 10
Joined: Wed Mar 04, 2020 10:23 pm

Searching 50000 documents by content in memory?

Post by aaathemtheyzzz »

Reading that content index are kept in memory make me ask why this is a sane feature consider my case. I think that is a waste of memory eg. when running a game, or low end computer. In my current 1.4.1.935 (x86) version both the service and the UI instance, everything helds 85MB in task manager's mem usage column.

I'm unable to seek successfully a whole phrase or part of it in forum search or google search. My phrase is "Allow Indexing service to index this disk for fast file searching" in WinXP drive properties or "Allow files on this drive to have contents indexed in addition to file properties" in Win10 drive properties. What relationship have these windows system options to everything search engine if they had any?

In my opinion if everything's content search is implemented like common web search like google's and forum's, that will have poor results. At least a word index (aka position) is needed for every word for every document with a special SQL query which will find, not only documents having those search words at random places, like web search does, but also in sequence or close to each other.

Example
Basic 3 columns table:
1st field document's index (number, duplicates allowed key), 2nd field word's index (number, duplicates allowed key), 3rd field order of that word inside the document.

Second basic table with 2 columns:
1st field document's index unique key, 2nd field the document path or URL or URI of that document index.

Third basic table with 2 columns containing all knowing words (eg. words acquired at run time by everything engine):
1st word's index number, 2nd the word itself. Both fields unique keys, 2nd is also a primary key. All words need to be simplified and filtered eg only upper case or only lower case without diacritics, camel case, snake case, without punctuation etc. Space doesn't need to have index but full stop (.) may have an index in case of representation of a constructed phrase to the user.

For words which are not really words like dates, automobile pids, licenses numbers, product serials, GUIDs, telephone numbers, long strings, etc. an additional table with those "words" like the previous one with 2nd field to be memo which will be omitted in cases of GDPR, or for simplicity or efficiency.
Attachments
allow indexing xp.png
allow indexing xp.png (2.88 KiB) Viewed 69025 times
void
Developer
Posts: 17068
Joined: Fri Oct 16, 2009 11:31 pm

Re: Searching 50000 documents by content in memory?

Post by void »

Thank you for your feedback aaathemtheyzzz,
What relationship have these windows system options to everything search engine if they had any?
None.
Everything does not use the Windows Search.
Everything will ignore the Not Content Indexed attribute.

Set the content include/exclude filters to limit which files are content indexed in Everything from Tools -> Options -> Content.

Indexing file content is intended for user documents where you want instant searching.

For example, you might like to index your documents folder only:
  • In Everything, from the Tools menu, click Options.
  • Click the Content tab on the left.
  • Set include only folders to a semicolon delimited list of folders, for example:
    c:\users\<my user name>\Documents
  • Set include only files to a semicolon delimited list of files to include, for example:
    *.doc;*.txt
  • Click OK.
If you don't need the instant searching, you can still content search in Everything without using content indexing.
It will just be much slower.
At least a word index (aka position) is needed for every word for every document
Content is not broken up into words.
I will consider a word-database for Everything.

Content is indexed as a single block of text.
Everything will perform your search on the entire file content as text.

Use double quotes to match an exact phrase.

For example: content:"abc 123"
matches the literal phrase: abc 123 (including the space)

Use the regex: search modifier for more control, for example:
regex:content:\babc.123


I am working on an option to search your Windows search index from within Everything.
horst.epp
Posts: 1455
Joined: Fri Apr 04, 2014 3:24 pm

Re: Searching 50000 documents by content in memory?

Post by horst.epp »

void wrote: Tue Apr 27, 2021 11:20 am ...
...
I am working on an option to search your Windows search index from within Everything.
For me that would be an absolute killer feature of Everything.
Now I'm waiting ... :)
NotNull
Posts: 5502
Joined: Wed May 24, 2017 9:22 pm

Re: Searching 50000 documents by content in memory?

Post by NotNull »

aaathemtheyzzz wrote: Tue Apr 27, 2021 4:37 am low end computer.
In my experience, the indexing of Windows Search on computers with low end specs is a disaster. It takes too many resources and even manages to interrupt the music I play, while the system is virtually doing nothing (never mess with my music! :))

Everything is much, *much* friendlier on resources in that regard (I disabled Windows search and indexing on those low end systems).
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »

I suggest adding to the statistics window a breakdown that includes size stats for content and extra properties.

As v1.5 adds many more features and can easily become a memory beast, it would be useful to know the overhead of each feature...

Also, and this maybe overshooting, adding a user option to load the content index on demand from disk, e.g when typing content:
This has the advantage of having a much larger content index without consistent memory footprint but the disadvantage of not having as an instantaneous search as we have now...
(Although on modern ssds, loading 500MB should take no more than 1 second, probably less... )
NotNull
Posts: 5502
Joined: Wed May 24, 2017 9:22 pm

Re: Searching 50000 documents by content in memory?

Post by NotNull »

aviasd wrote: Wed Apr 28, 2021 8:30 pm Also, and this maybe overshooting, adding a user option to load the content index on demand from disk, e.g when typing content:
You mean the content: search as it was in Everything 1.4?

content: in 1.5 will search only in indexed content, but with notindexed:content: you can search from disk instead of the database.

I created a macro allcontent: for that. It searches in content-indexed files as well as not content indexed files.

Macros-1.5a.csv (in %APPDATA%\Everything)

Code: Select all

Name,Search
allcontent<%1>,< content:%1: | notindexed:content:%1:> 
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »

NotNull wrote: Wed Apr 28, 2021 8:57 pm
You mean the content: search as it was in Everything 1.4?
No, I meant to allow the user to load the content that's already indexed on everything.db from disk into memory on demand as opposed to having it in memory all the time...



Macros-1.5a.csv (in %APPDATA%\Everything)

Code: Select all

Name,Search
allcontent<%1>,< content:%1: | notindexed:content:%1:> 
That's a nice macro :) thanks
As you were posting this, I posted a relevant suggestion which touches the same issue that i guess caused you to write that macro :lol:
NotNull
Posts: 5502
Joined: Wed May 24, 2017 9:22 pm

Re: Searching 50000 documents by content in memory?

Post by NotNull »

aviasd wrote: Wed Apr 28, 2021 9:07 pm No, I meant to allow the user to load the content that's already indexed on everything.db from disk into memory on demand as opposed to having it in memory all the time...
Thanks for explaining! Upon re-reading I should have understood that the first time...

I had the same idea and *think* I even suggested it (more than 10 minutes ago, so not sure ;)): Just keep content in a separate unstructured NoSQL database on disk to be read when needed.
It does have another disadvantage though: when you are editing - let's say - a text document and save it, the entire database needs to be rewritten on disk (when scripting, I save every couple of minutes..).
Workaround would be to keep all new/changed content in RAM and flush it to disk after x hours. But still: lots of dsikwrites.
aaathemtheyzzz
Posts: 10
Joined: Wed Mar 04, 2020 10:23 pm

Re: Searching 50000 documents by content in memory?

Post by aaathemtheyzzz »

NotNull wrote: Wed Apr 28, 2021 7:08 pm
aaathemtheyzzz wrote: Tue Apr 27, 2021 4:37 am low end computer.
In my experience, the indexing of Windows Search on computers with low end specs is a disaster. It takes too many resources and even manages to interrupt the music I play, while the system is virtually doing nothing (never mess with my music! :))

Everything is much, *much* friendlier on resources in that regard (I disabled Windows search and indexing on those low end systems).
I'm too disable the Windows search because I usually search from within total commander or having winkey-W assign to everything search "new window hotkey" which is must more quicker than to find the windozes magnify glass. Also with total commander I can search inside the results, thus eliminate further in the process those results does not much my interest. Consider that I don't know regex (most people don't even heard that word).
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »

NotNull wrote: Wed Apr 28, 2021 10:29 pm
I had the same idea and *think* I even suggested it (more than 10 minutes ago, so not sure ;)): Just keep content in a separate unstructured NoSQL database on disk to be read when needed.
It does have another disadvantage though: when you are editing - let's say - a text document and save it, the entire database needs to be rewritten on disk (when scripting, I save every couple of minutes..).
Workaround would be to keep all new/changed content in RAM and flush it to disk after x hours. But still: lots of dsikwrites.
Yes, there will be some IO penalty for indexed content for machines with intense writes.

I'm not sure there would be a need for seperate db structure as everything.db looks segmented already.

It also seems like content indexing is kept compressed while on disk and inflated upon reading to memory: everything 1.5 does not have a compress database option and 211MB worth of files to be content indexed takes an additional 55Mb on everything.db while on memory there's an almost exact 211MB increase, so there's some more CPU overhead there.
Edit:
Stats are wrong here:
211MB of files take 67MB more of everything.db and 173MB in memory.

The problem could be addressed in various ways:
Once the content index is already loaded to memory:
A user command to flush indexed content manually, flushing on idle IO threshold, flushing after a timeout -
None of those suggestions will solve the additional IO problem fully.

Your suggestion to buffer changes in memory and flush them once would still require loading the current index back to memory for inflate,re-sort, deflate etc before writing back ( among a few more issues that I can think of )

So it's kinda complex problem, that's why I felt it's over-reaching

Note: I do feel there's a need to delay immediate indexing of property/content on creation. I.E: if one would like to index sha-256 property for example, every new short lived tmp file or rapidly changing files would be reindexed constantly which has a huge IO/CPU penalty in that scenario.
I encountered this issue on a regular basis with everything 1.4 when deleting huge directory trees while everything was running: everything would go a bit crazy trying to keep up - taking half the cpu and makes the system/delete operations a bit sluggish.
My workaround was to close everything before performing those operations ( or actually killing it sometimes ). Haven't had the chance on everything 1.5 to delete huge trees, so I dunno if this was fixed.
Last edited by aviasd on Thu Apr 29, 2021 8:09 am, edited 6 times in total.
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »

aaathemtheyzzz wrote: Thu Apr 29, 2021 6:47 am
I'm too disable the Windows search because I usually search from within total commander or having winkey-W assign to everything search "new window hotkey" which is must more quicker than to find the windozes magnify glass. Also with total commander I can search inside the results, thus eliminate further in the process those results does not much my interest. Consider that I don't know regex (most people don't even heard that word).
On recent TC versions there's option of integrating everything into TC functionality: for searching and folder size calculations, which produces much faster results.
For search: on the search window, press the everything checkbox
For folder sizes: Preferences-> Operations -> Everything ("Index folder size" should be ON ,in everything)
aaathemtheyzzz
Posts: 10
Joined: Wed Mar 04, 2020 10:23 pm

Re: Searching 50000 documents by content in memory?

Post by aaathemtheyzzz »

aviasd wrote: Wed Apr 28, 2021 9:07 pm
NotNull wrote: Wed Apr 28, 2021 8:57 pm No, I meant to allow the user to load the content that's already indexed on everything.db from disk into memory on demand as opposed to having it in memory all the time...
I will explain my opinion that content index must not be in the memory. Back in 2002 on an ordinary pc I have broken to words whole newspapers, for every newspaper each for every edition each for every day in the year and the search results was instant. The whole database was reside in the disk. Now that we have SSD disks and much more capable computers, why we need to overload the main memory? After all, as content changes is so easy to reflect back and update the content index almost immediately. That's isn't a lot of writes because in a relation database world the basic elements are not the content but the pointers (aka indexes) that point to the content. So a 1000 word document is worth of 1000 integers. Those 1000 integers needed only if we want to keep the whole sequence of that document (which is more useful in searching), but if we keep unique words we only need lets say 150 integers.

PS. Forgot to mention that in my 2002 project when the user changes the content, a whole backup copy was also held so one could see easily how the content was in successive steps until its final version. A feature that even today does not exist.
aaathemtheyzzz
Posts: 10
Joined: Wed Mar 04, 2020 10:23 pm

Re: Searching 50000 documents by content in memory?

Post by aaathemtheyzzz »

void wrote: Tue Apr 27, 2021 11:20 am Content is not broken up into words.
I will consider a word-database for Everything.
You don't need to fill-up the word database as each user may have different language or different documents to index. One may have literature and another may have books about physics. Candidates for me was SQLite FTS3/4/5, Extensible Storage Engine, jet blue or Microsoft Jet Database Engine, jet red.

I have used access 2003 mdb to draw the tables definitions and make my SQL parameter queries that I can call from code because the killer interface query can clear things in my head. Then I write some code loops which use almost everywhere recordset.seek "=" to get a representation of my stored document really fast.
That is because I need to see that document area beforehand I need to open the whole document and go to that place inside.
I spend most of my time in the code to write string manipulation preprocess filtering functions that categorize and simplify each token-word. So in case of "to be or not to be" only those exact words are stored. Regex pretty much is not needed to filter things as those things are already clear. Of course I try to index some w3schools.com pages as well to see if my code peek up some other documents not having any relation to the w3schools world and the results was great but I got ill and abandon the project. Initially I was try to use python to do string preprocess but stuck with python's Unicode and debugging crap which almost the same as VB6 standard file read functions.
Content is indexed as a single block of text.
Everything will perform your search on the entire file content as text.
How you index the whole block of text as a whole? Are you using some crc/hash/bm25 algorithm which give a long hash value, which represents the document, so that can be searched against the hash produced by the search query?
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »

NotNull wrote: Wed Apr 28, 2021 7:08 pm
In my experience, the indexing of Windows Search on computers with low end specs is a disaster. It takes too many resources and even manages to interrupt the music I play, while the system is virtually doing nothing (never mess with my music! :))

Everything is much, *much* friendlier on resources in that regard (I disabled Windows search and indexing on those low end systems).
Unfortunately on most machines, I cannot disable this disaster as outlook still uses the windows indexing service to search for emails.
I cannot say how much grief I had with trying to get this service to work as it should over the years.

I find it very titillating to my grief/irony bone that an alpha feature of a single developer can outperform a feature that existed well over a decade with virtualy infinite resources behind it in terms of speed, user friendlessness,hogginess and *reliability*. All that, still in the alpha stage...I don't know if I'm happy or sad... :|
aaathemtheyzzz
Posts: 10
Joined: Wed Mar 04, 2020 10:23 pm

Re: Searching 50000 documents by content in memory?

Post by aaathemtheyzzz »

Test run 1.5a x86 without any options changed except the content index on a real system.

Everything-1.5a.db is 101MB and Everything-1.5a.db.tmp is 13MB.

Try it a couple of times with the same results at XP 32bit 2GB RAM and 4GB swap file.

The 2nd MessageBox appears 1-2 seconds after the 1st MessageBox. Is it a thread issue?
I delete Everything-1.5a.db and Everything-1.5a.db.tmp and rerun the Everything 1.5a but the databases was not then created.
Attachments
eveything15.png
eveything15.png (62.97 KiB) Viewed 68801 times
horst.epp
Posts: 1455
Joined: Fri Apr 04, 2014 3:24 pm

Re: Searching 50000 documents by content in memory?

Post by horst.epp »

aaathemtheyzzz wrote: Thu Apr 29, 2021 12:43 pm Test run 1.5a x86 without any options changed except the content index on a real system.

Everything-1.5a.db is 101MB and Everything-1.5a.db.tmp is 13MB.

Try it a couple of times with the same results at XP 32bit 2GB RAM and 4GB swap file.

The 2nd MessageBox appears 1-2 seconds after the 1st MessageBox. Is it a thread issue?
I delete Everything-1.5a.db and Everything-1.5a.db.tmp and rerun the Everything 1.5a but the databases was not then created.
Sorry but on a 32bit XP this is for me an expected result.
32bit programs are limited to 2GB RAM and even that is not fully usable.
void
Developer
Posts: 17068
Joined: Fri Oct 16, 2009 11:31 pm

Re: Searching 50000 documents by content in memory?

Post by void »

How you index the whole block of text as a whole?
Everything indexes content as raw text, it is not hashed.
You can search for an exact phrase or partial text in file content with Everything.
The 2nd MessageBox appears 1-2 seconds after the 1st MessageBox. Is it a thread issue?
Everything uses multiple threads to index content.
The out of memory condition has most likely occur in more than one thread.

Everything will store indexed content in memory for the best search performance.
Please try the x64 version or reduce the number of files that are content indexed.
aaathemtheyzzz
Posts: 10
Joined: Wed Mar 04, 2020 10:23 pm

Re: Searching 50000 documents by content in memory?

Post by aaathemtheyzzz »

horst.epp wrote: Thu Apr 29, 2021 1:15 pm Sorry but
Ability to sorry is restricted by but.
horst.epp wrote: Thu Apr 29, 2021 1:15 pm but on a 32bit XP this is for me an expected result.
For me it not. But is not an excuse to use it when it suits us, even then malloc returns null if it fails.
Back in 1993 we had the same problem when a user draw complicated shapes (maps). We solve the problem by writing structures to disk. We demand memory for our application only. We don't demand extra memory to store user actions, other than some handles point to the real structures which had the data.
horst.epp wrote: Thu Apr 29, 2021 1:15 pm 32bit programs are limited to 2GB RAM and even that is not fully usable.
On the other hand we can ask customers to switch to protected mode, which on most (but not all) cases will solve the problem. We don't, perhaps because we have the dignity and admit that this memory exaction issue is our software problem and not a real barrier impose by the system.
Even now some games load the whole level into memory, when only a portion used at a time. On contrast look on GTA San Andeas. Its engine load the environment when it needed.
Another example, VLC player developers constantly abuse users in later versions. In VLC 2.2.1 the CPU usage is 35% for a H265 video when by latest versions the CPU usage goes up to 98% dropping frames etc. Not to mention memory hog crap no one ever use called "UI function libraries". All for the same h265 video at the same pc and os, just a different policy applied. Terry Pratchett vs sell off to evil.
aaathemtheyzzz
Posts: 10
Joined: Wed Mar 04, 2020 10:23 pm

Re: Searching 50000 documents by content in memory?

Post by aaathemtheyzzz »

void wrote: Fri Apr 30, 2021 5:15 am Everything will store indexed content in memory for the best search performance.
Can you please try the same operation on a disk? The performance might not be that low as you think.
Another reason to have the index on the disk is for next gen search abilities. I mean long after 1.5 is completely ready.
void wrote: Fri Apr 30, 2021 5:15 am .. or reduce the number of files that are content indexed.
How I will know beforehand by how much to reduce them? One wrong cucumber solution might be to run multiple everything instances each for every file group set so that no one instance requires more than, lets say 300MB. If that will be possible by playing with registry (eg. multiple services set), then again, all of RAM will constantly swap to pagefile disk so that the pc will be unusable.
On contrast Win10 x86 build-in search does not require all of my memory as the index does not reside in memory.
NotNull
Posts: 5502
Joined: Wed May 24, 2017 9:22 pm

Re: Searching 50000 documents by content in memory?

Post by NotNull »

aaathemtheyzzz wrote: Fri Apr 30, 2021 4:50 pm
horst.epp wrote: Thu Apr 29, 2021 1:15 pm 32bit programs are limited to 2GB RAM and even that is not fully usable.
On the other hand we can ask customers to switch to protected mode, which on most (but not all) cases will solve the problem. We don't, perhaps because we have the dignity and admit that this memory exaction issue is our software problem and not a real barrier impose by the system.
Sorry, but ... ;)

A 32-bit operating system uses 32 bits memory addresses, meaning 2^^32 addresses = 4GB of RAM.
Typically 2BG of that is reserved for the systemn itself and 2 GB is available in userspace. There are ways to increase the amount of RAM available to userspace [1], but as you didn't mention them, I assume you use the typical settings. And that means that Everything has max 2GB available and is likely running out of RAM (the 300MB/600MB are likely in use by other programs), causing memory allocation errors.

But your original idea - reading indexed content from a disk-based database instead of RAM - might be worth investigating.


[1] You can reconfigure Windows to let the system use less RAM, giving up to 3GB (?) to userspace.
You can also use a memory mapping technology that tells the CPU to use 36-bit memory addresses, giving you 2^^36 = 64GB to address (requires the CPU to support that). Can't remember what it was called (almost 20 years ago ...)

EDIT: It is called Physical Address Extension
NotNull
Posts: 5502
Joined: Wed May 24, 2017 9:22 pm

Re: Searching 50000 documents by content in memory?

Post by NotNull »

aviasd wrote: Thu Apr 29, 2021 6:53 am I'm not sure there would be a need for seperate db structure as everything.db looks segmented already.
I don't know the layout of the Everything database, but I bet it is a relational database with several separate tables, that are linked together ("joined").
Neither do I know how the content is stored in the database, but a relational database structure is less suited for that.
There are databases that are optimized for saving and retrieving unstructured data. They use a key-value structure, where value can be an entire document. That is what Google uses for indexing webpages.


[...]So it's kinda complex problem, [...]
Agreed. Other disk-basd databases have the same challenges. I (wildly) guess they have solved or at least mitigated that issue.
For example: from what I remeber, Google uses 200MB 'blocks' with content, so not the entire database needs to be rewritten.
That could be stored next to Everything.db: 'block1.db', etc.

I think we are very lucky that void was able to include content indexing too, next to all the other new/improved features.
And now that it is there, we want even more! :)


I do feel there's a need to delay immediate indexing of property/content on creation. I.E: if one would like to index sha-256 property for example, every new short lived tmp file or rapidly changing files would be reindexed constantly which has a huge IO/CPU penalty in that scenario.
+1
aaathemtheyzzz
Posts: 10
Joined: Wed Mar 04, 2020 10:23 pm

Re: Searching 50000 documents by content in memory?

Post by aaathemtheyzzz »

NotNull wrote: Fri Apr 30, 2021 6:47 pm A 32-bit operating system uses 32 bits memory addresses, meaning 2^^32 addresses = 4GB of RAM.
Typically ...
I really don't understand why some people/agents/AI trolls/whatever push that agenda so all or their explanations is so dogmatic/bias.
The 4GB issue with x86 XP is only a MS imposed barrier on XP while on 2003 server all memory is available to the os. Each process can have its own 2GB address space although there is not even ONE program that is that big. So some top criminal minds push/paid/threat/whatever the industry and its employees to make code/apps/games/libraries that require a ton of memory. They fill the memory with user data like oversized textures, sounds, fonts, linked lists, b-trees, registry and a bunch or other crap which rarely used regularly. Then they invent terms like trashing all because they refuse to make things clear right from the start. That user data most of the time should be on the disk with proper efficiency to load into memory when needed. So that's why there is no link-list/b-tree win32/kernel api which goes straight to the disk. Instead every programmer has to write some kind of link-list directory/sort variation algorithm which almost always resides in memory, whatever it is a game, desktop publishing, sound processing, database etc. And for that reason every application had its own format when dumps its link-list/b-trees/etc.internal structures (which is nothing more than series of malloc) to a file. They even abuse their own rules and write whole data sequence to the registry. So no. I don't understand why when one is criminal/evil and thus replicate its bias evil mind state inside a computer, we need to follow his steps. Maybe he had mother or religion issues. Why we need to follow his steps to corruption?
NotNull wrote: Fri Apr 30, 2021 6:47 pm But your original idea - reading indexed content from a disk-based database instead of RAM - might be worth investigating.
Thanks, the vb code I have might be need investigation if one need to further optimize the code in its equivalent to C++/whatever.
NotNull wrote: Fri Apr 30, 2021 6:47 pm You can reconfigure Windows to let the system use less RAM, giving up to 3GB (?) to userspace.
I can use the /3GB switch but that its not a real or good solution. Even it work for 50000 docs what about the 80000 docs? Evey program must work with the least possible MB of memory. The whole Open Office Writer 4.1.9 alone requires 127MB (12MB+ is spelling) and goes up to 167MB when open a 48MB document.
horst.epp
Posts: 1455
Joined: Fri Apr 04, 2014 3:24 pm

Re: Searching 50000 documents by content in memory?

Post by horst.epp »

aaathemtheyzzz wrote: Fri Apr 30, 2021 11:49 pm
NotNull wrote: Fri Apr 30, 2021 6:47 pm A 32-bit operating system uses 32 bits memory addresses, meaning 2^^32 addresses = 4GB of RAM.
Typically ...
I really don't understand why some people/agents/AI trolls/whatever push that agenda so all or their explanations is so dogmatic/bias.
The 4GB issue with x86 XP is only a MS imposed barrier on XP while on 2003 server all memory is available to the os. Each process can have its own 2GB address space although there is not even ONE program that is that big. So some top criminal minds push/paid/threat/whatever the industry and its employees to make code/apps/games/libraries that require a ton of memory. They fill the memory with user data like oversized textures, sounds, fonts, linked lists, b-trees, registry and a bunch or other crap which rarely used regularly. Then they invent terms like trashing all because they refuse to make things clear right from the start. That user data most of the time should be on the disk with proper efficiency to load into memory when needed. So that's why there is no link-list/b-tree win32/kernel api which goes straight to the disk. Instead every programmer has to write some kind of link-list directory/sort variation algorithm which almost always resides in memory, whatever it is a game, desktop publishing, sound processing, database etc. And for that reason every application had its own format when dumps its link-list/b-trees/etc.internal structures (which is nothing more than series of malloc) to a file. They even abuse their own rules and write whole data sequence to the registry. So no. I don't understand why when one is criminal/evil and thus replicate its bias evil mind state inside a computer, we need to follow his steps. Maybe he had mother or religion issues. Why we need to follow his steps to corruption?
NotNull wrote: Fri Apr 30, 2021 6:47 pm But your original idea - reading indexed content from a disk-based database instead of RAM - might be worth investigating.
Thanks, the vb code I have might be need investigation if one need to further optimize the code in its equivalent to C++/whatever.
NotNull wrote: Fri Apr 30, 2021 6:47 pm You can reconfigure Windows to let the system use less RAM, giving up to 3GB (?) to userspace.
I can use the /3GB switch but that its not a real or good solution. Even it work for 50000 docs what about the 80000 docs? Evey program must work with the least possible MB of memory. The whole Open Office Writer 4.1.9 alone requires 127MB (12MB+ is spelling) and goes up to 167MB when open a 48MB document.
Call me a troll but all of this is not necessary with todays hardware and OS versions.
I know the details of memory managment because all of our servers used the /PAE switch since it beginning.
Everything may be changed to store the content in a file but thats already available with Windows indexing
which runs fine on actual hardware.
For me the benefit of Everything content is the fast reponse for a smaller set of files to index.
Also void said there will be a way to query Windows search from Everything.

Additionaly there is the fact that SSD drives should not be written if it can be avoided.
void
Developer
Posts: 17068
Joined: Fri Oct 16, 2009 11:31 pm

Re: Searching 50000 documents by content in memory?

Post by void »

The goal of content indexing in Everything is to avoid going to disk.

I understand Everything will not work for large amounts of text content.
This will need to be made clear in the UI.

Everything 1.5.0.1258a adds support for searching your Windows Index with the systemindex: search function.

Advanced Query Syntax is supported.

Examples:
systemindex:foo
systemindex:"my content"
systemindex:kind:music
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »


Macros-1.5a.csv (in %APPDATA%\Everything)

Code: Select all

Name,Search
allcontent<%1>,< content:%1: | notindexed:content:%1:> 
Now this macro can be extended:

Code: Select all

Macro: allcontent<search>
Search: <content:search: |  systemindex:search: | notindexed:content:search: >
or

Code: Select all

/define allcontent<search>=<content:search: | notindexed:content:search: | systemindex:search:> 
BTW: Is the or shirt circuited in this scenario? I.E if content: finds the text in some file, will that same file be searched using systemindex: ?
void
Developer
Posts: 17068
Joined: Fri Oct 16, 2009 11:31 pm

Re: Searching 50000 documents by content in memory?

Post by void »

BTW: Is the or short circuited in this scenario? I.E if content: finds the text in some file, will that same file be searched using systemindex: ?
Yes, but I wouldn't call it short circuiting, because..

Searches are re-ordered by search weight.

indexed content: has a small weight (200) and will be performed first.
systemindex: has a medium weight (20,000) and will be performed second.
not-indexed-content: has a huge weight (200,000) and will be performed last.

If indexed-content: finds a match, systemindex: will not be called.
If systemindex: finds a match, notindexed:content: will not be called.

Disable search reordering.

Documentation on search functions and weights is currently being written.
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »

void wrote: Wed May 12, 2021 9:58 am
BTW: Is the or short circuited in this scenario? I.E if content: finds the text in some file, will that same file be searched using systemindex: ?
Yes, but I wouldn't call it short circuiting, because..

Searches are re-ordered by search weight.

indexed content: has a small weight (200) and will be performed first.
systemindex: has a medium weight (20,000) and will be performed second.
not-indexed-content: has a huge weight (200,000) and will be performed last.

If indexed-content: finds a match, system index: will not be called.
If systemindex: finds a match, notindexed:content: will not be called.

Disable search reordering.

Documentation on search functions and weights is currently being written.
Huh, that's good to know.

I've had a crash using the combinations of all three filters:
/define allcontent<search>=<content:search: | notindexed:content:search: | systemindex:search:>
then searching:

allcontent:שלום documents\
I've sent the logs/dumps via email.

Note: I suspect there maybe a related issue to property indexing ?? after the crash, reopening everything reindexed properties even though it was completed successfully to index, before the crash..

If need be, I'll open a new bug thread.
NotNull
Posts: 5502
Joined: Wed May 24, 2017 9:22 pm

Re: Searching 50000 documents by content in memory?

Post by NotNull »

aviasd wrote: Fri May 21, 2021 1:55 pm Note: I suspect there maybe a related issue to property indexing ?? after the crash, reopening everything reindexed properties even though it was completed successfully to index, before the crash..
When the properties were indexed, the database in RAM was updated. When Everything crashes, it does not have an opportunity to write this state to disk.
When you start Everything again, the (old) database is read from disk to RAM, causing the properties to be re-indexed.
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »

NotNull wrote: Fri May 21, 2021 3:09 pm
When the properties were indexed, the database in RAM was updated. When Everything crashes, it does not have an opportunity to write this state to disk.
When you start Everything again, the (old) database is read from disk to RAM, causing the properties to be re-indexed.
That's a shame, I've assumed that long-running operations are committed to disk or that a rebuild triggers a commit action...
Scanning properties after every upgrade (that needs db rebuild) takes about 20 minutes on my system...

Anyways, if anyone is looking: /update saves to disk without reopening everything, now need to remember doing it... - a crash might not be the only reason why everything would close abnormally.
/update Does not seem to be saving to disk. I Could not find the command that does...
Can anyone help with that?
void
Developer
Posts: 17068
Joined: Fri Oct 16, 2009 11:31 pm

Re: Searching 50000 documents by content in memory?

Post by void »

/update is the command line option.

eg:
Everything.exe -update

/savedb is the search command.

eg: In Everything, type in the following search and press ENTER:
/savedb


Everything 1.5 will save your database to disk daily at 4am.
If this scheduled save is missed, Everything will save your database to disk the next time you close an Everything search window.

The latest alpha build will also force a rebuild due to newer database format.



db_auto_save_type
db_save_on_rebuild
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »

void wrote: Sun May 23, 2021 10:22 am
/savedb is the search command.

eg: In Everything, type in the following search and press ENTER:
/savedb
Ah, thanks,
I guess it's a 1.5 command since it does not work on 1.4 and it didn't appear Here
The latest alpha build will also force a rebuild due to newer database format.



db_auto_save_type
db_save_on_rebuild
Cool 8-)
void
Developer
Posts: 17068
Joined: Fri Oct 16, 2009 11:31 pm

Re: Searching 50000 documents by content in memory?

Post by void »

I guess it's a 1.5 command since it does not work on 1.4 and it didn't appear Here
Type in the following search and press ENTER:
about:about

Everything 1.5 Search Commands
void
Developer
Posts: 17068
Joined: Fri Oct 16, 2009 11:31 pm

Re: Searching 50000 documents by content in memory?

Post by void »

allcontent:שלום documents\
I've sent the logs/dumps via email.
Thanks for the debug logs and crash dump aviasd,

Everything 1.5.0.1262a fixes a buffer overflow when searching content as ANSI.
aviasd
Posts: 135
Joined: Sat Oct 07, 2017 2:18 am

Re: Searching 50000 documents by content in memory?

Post by aviasd »

void wrote: Fri May 28, 2021 7:05 am
Everything 1.5.0.1262a fixes a buffer overflow when searching content as ANSI.
Thanks, V1.5.0.1262a fixes the bug above.
Type in the following search and press ENTER:
about:about

Everything 1.5 Search Commands
Also, Thanks for that..
Claudio Salvio
Posts: 13
Joined: Tue Jul 02, 2013 9:45 pm

Re: Searching 50000 documents by content in memory?

Post by Claudio Salvio »

void wrote: Fri May 07, 2021 5:26 am I understand Everything will not work for large amounts of text content.
...
Everything 1.5.0.1258a adds support for searching your Windows Index with the systemindex: search function.
...
Advanced Query Syntax is supported.
...
Dear Void,
Thank you🙏 very much for adding this functionality, I think it is the right way to add even more value to an excellent product like Search Everything (VSE).
From my point of view, VSE was and is unparalleled when it comes to non-content search.
However, fast content search requires indexing, and this is something that Windows Search (WS), originally known as Indexing Service (IS), has been doing for many years.

I consider WS to be another of those gems that Microsoft (MS) left unpolished and without understanding its real value.

I have been using WS/IS for many years, I think since the late 1990s.
It is true that throughout that time I have had to deal with inconsistencies, lack of documentation, shortages of free ifilters, high cost of paid ifilters, performance issues and other problems.
But it is also true that it has provided a great service to me as well as to many of the users whose systems I administered.

It seems to me that it can also be said that many of these problems were solved by MS, were solved by external factors (performance) or can be solved in other ways (for example with third party tools).

Greetings,
Claudio

I🧡Everything!
Post Reply