Learning about content indexing

ChrisGreaves · Post by **ChrisGreaves** » Mon Mar 20, 2023 6:19 pm

THIS IS A LOW-PRIORITY TOPIC.
I am not using content indexing, but am trying to learn how to use it.
I have re-read two topics High DB File and Failure at the end of Indexing and Enable content indexing?
My standalone Laptop is using 50% of a 850 GB data partition. SSD. Has 16GB RAM (A long way from my first 640k XT chassis)
My most useful data is held in 16,400 documents and templates, many of which house VBA code.

I plan to experiment with content indexing on files: ext:doc;dot , and relax: I don't have a problem now. I am merely learning.

: ContentIndexing_01.png (31.83 KiB) Viewed 3218 times

I am working from this dialogue box.
When I click APPLY there is a delay of about 30 seconds and "4.93 GB" appears in the lower-right status box.
At this point I suspect that all is well, and my 16,400 documents have their content indexed in RAM
(1) It seems that a month ago I knew how to “Enable Content Indexing” but now my mind is blank. How do I enable content indexing?
(2) What is indexed? Is it the entire content of my 16,400 documents? Is that the 4,93 GB? If so that suggests about 300KB per document, which seems high to me. I will do further analysis if required.
(3) At this point that 4.93 GB is stored in 16 GB of Ram, correct?
(4) Is there a way to store the content index from RAM to HDD (for those without an SSD) which might save time tomorrow morning. More to the point, a stored index could be reclaimed as a stable platform of data for future use.
(5) Is there but one content index in RAM, or when I have another shot in 30 minutes time, will I then have two content indexes in RAM?

I have some comments on the dialogue box, but I think I should hold off on those until I have a better grasp of what Content Indexing is.

Why should I be bothered with this? Because I have a 900 GB cumulative backup drive, (FAT which perhaps should be converted to NTFS), and that has files that go back fifteen years or so,
In the back of my mind is my searchable library of VBA code. I have an application which scours a hard drive for every scrap of VBA code (DOC, DOT, XLS, XLA, TXT, BAS, CLS, FRM and even text digests from forums) and loads it into a humungous string which can be searched by the InStr() function. A work of love, and I am curious about using Everything to replace my application.

Thanks, Chris

horst.epp · Post by **horst.epp** » Tue Mar 21, 2023 8:37 am

1: I don't understand whats the problem, your screenshot shows where you enable/disable content indexing.
2. All *.doc files in the whole T:\ tree
3. Yes, but the content index is of course much smaller than the data (see the size of your database)
4. No, but you should enable Windows indexer for large amount of data, you can query it from Everything using si:
5. You never have more than one database in RAM for one Everything instance, its automatically updated all the time.

Post by **void** » Tue Mar 21, 2023 9:00 am

Please note that setting the maximum size is per file, not the entire content index size.

(1) It seems that a month ago I knew how to “Enable Content Indexing” but now my mind is blank. How do I enable content indexing?

Tools -> Options -> Content -> Index file content.

(2) What is indexed? Is it the entire content of my 16,400 documents? Is that the 4,93 GB? If so that suggests about 300KB per document, which seems high to me. I will do further analysis if required.

Any files in your index that match the specified filters, eg: *.doc on the T: drive.
Only the text content is indexed.
For example, if you have a word document that has a 2 MB image and a couple paragraphs of text, only the text is indexed.
Minimal RAM will be used to store the text.

(3) At this point that 4.93 GB is stored in 16 GB of Ram, correct?

No, that is the file size, the content index will be much smaller.
The indexed content size can be obtained from Tools -> Debug -> Statistics -> File data size.
(This size includes other data, such as the filename, file size and date modified information)

(4) Is there a way to store the content index from RAM to HDD (for those without an SSD) which might save time tomorrow morning. More to the point, a stored index could be reclaimed as a stable platform of data for future use.

No, this defeats the purpose of content indexing in Everything.
If Everything needs to access that database on disk, you might as well disable content indexing and just read the content directly from disk.
On modern hardware, there would be no noticeable performance between accessing the Everything database from disk vs directly reading the content from disk.

(5) Is there but one content index in RAM, or when I have another shot in 30 minutes time, will I then have two content indexes in RAM?

Everything only has one content index.
The single content index is always stored in RAM.
If you rebuild your database, Everything will keep the old content index and new content index at the same time to avoid going back to disk.
Once the new index is built, the old content index is freed.

Why should I be bothered with this?

Generally, content indexing is not recommended.
If you have a good SSD, reading content directly from disk is going to be very fast.

If you want instant content searching for a few files, please consider content indexing. (An example might be a couple 100MB of source code files)

Another use would be to index content on volumes that are often offline.

Consider using Windows indexing for content and accessing that information in Everything with si:

ChrisGreaves · Post by **ChrisGreaves** » Tue Mar 21, 2023 1:52 pm

void wrote: ↑Tue Mar 21, 2023 9:00 amPlease note that setting the maximum size is per file, not the entire content index size.

Thank you. Might the dialogue box say this? /Maximum Size/Maximum Size of a file/ or similar.

(1) It seems that a month ago I knew how to “Enable Content Indexing” but now my mind is blank. How do I enable content indexing?Tools -> Options -> Content -> Index file content.

Thank you. I think I picked up “enabling” by reading the Forums. In my ignorant state I was so led to hunt for the word “Enable” in the dialogue box

(2) What is indexed? Is it the entire content of my 16,400 documents? Is that the 4,93 GB? If so that suggests about 300 KB per document, which seems high to me. I will do further analysis if required.
Any files in your index that match the specified filters, eg: *.doc on the T: drive. Only the text content is indexed. For example, if you have a word document that has a 2 MB image and a couple paragraphs of text, only the text is indexed. A minimal amount of RAM will be used to store the text.

Thanks too for this. So my “T:\ *,doc” will obtain all MSWord documents on my data partition.
I understand that {images} will be avoided.
What about VBA code? Is that considered to be “text” or is it ignored because it is part of a VBA project (or any other internal construct, such as the strings within {SEQ} field codes and many others?
Shorter question The text that gets indexed is equivalent to that that would be carried forward with a Copy/Paste via Notepad.exe? (I think the answer to this question is “yes”)

(3) At this point that 4.93 GB is stored in 16 GB of Ram, correct?
No, that is the file size, the content index will be much smaller. The indexed content size can be obtained from Tools -> Debug -> Statistics -> File data size.

“the file size”, which file?
I understand that the content index will be much smaller – each unique word appears only once, correct?
I think that “01d95b561165e21a” is your internal identifier to this index.
For reference, this morning’s search of “T:\ *.doc” says 15,5497 objects; 4.66 GB” I don’t see anything like that in the debug, and between you and me I am not in the business of decoding debug data from Everything

so I am ready to back off a bit at this time :wipes brow with relief:

Code: Select all

Database
Location:	C:\Users\Chris077\AppData\Local\Everything\Everything-1.5a.db
Indexed file properties:	Name, Path, Size, Date Modified, Content
Indexed folder properties:	Name, Path, Size, Date Modified
Fast sorts:	Name, Path, Size, Date Modified
Folder count:	74,113
File count:	484,419
Total item count:	558,532
FAT index count:	2
NTFS index count:	1
ReFS index count:	0
Network drive index count:	0
Folder index count:	0
File list index count:	0
Network index count:	0
Total index count:	3
Folder data size:	4,233,642 bytes
File data size:	165,557,761 bytes
Total data size:	169,791,403 bytes
Average folder data size:	57 bytes
Average file data size:	341 bytes
Folder index size:	2,371,616 bytes
File index size:	15,501,408 bytes
Total index size:	17,873,024 bytes
Total size:	187,664,427 bytes
Folders created:	11,037
Folders modified:	1,018,500
Folders deleted:	4,897
Folders moved:	235
Files created:	122,198
Files modified:	493,901
Files deleted:	77,666
Files moved:	20,440
Journal
Enabled:	Yes
ID:	01d95b561165e21a
Size:	103,442 bytes
Max size:	1,048,576 bytes
First item ID:	1
Next item ID:	637
Item count:	636
Build
Count:	20
Total duration:	01:14
Minimum duration:	0.010010 seconds
Maximum duration:	21.450872 seconds
Average duration:	3.729881 seconds
Last duration:	1.266843 seconds
Last build date:	03-20-2023 3:32 PM
Last rebuild reason:	Property structure changed.
Update
Count:	171,731
Total duration:	00:11
Minimum duration:	0.000000 seconds
Maximum duration:	0.117208 seconds
Average duration:	0.000069 seconds
Last duration:	0.007821 seconds
Last update date:	03-21-2023 9:52 AM
Load
Count:	22
Total duration:	00:03
Minimum duration:	0.073363 seconds
Maximum duration:	0.415929 seconds
Average duration:	0.137348 seconds
Last duration:	0.415929 seconds
Last load date:	03-21-2023 9:43 AM
Save
Count:	34
Total duration:	00:06
Minimum duration:	0.004985 seconds
Maximum duration:	0.553827 seconds
Average duration:	0.185798 seconds
Last duration:	0.334519 seconds
Last save date:	03-21-2023 7:58 AM
Next scheduled save date:	03-22-2023 4:00 AM
Total bytes written:	1,255,276,680
Query
Count:	568
Total duration:	00:04
Minimum duration:	0.000002 seconds
Maximum duration:	0.981268 seconds
Average duration:	0.007662 seconds
Last duration:	0.000745 seconds
Last query date:	03-21-2023 10:47 AM
Total result count:	41,130,285
Maximum result count:	1,473,811
Average result count:	72,412
Last result count:	15,497
Sort
Count:	7
Total duration:	00:00
Minimum duration:	0.000453 seconds
Maximum duration:	0.035992 seconds
Average duration:	0.007058 seconds
Last duration:	0.001832 seconds
Last sort date:	03-18-2023 3:25 PM
FAT Index
Volume name:	\\?\Volume{130b3437-919d-4d7f-ac7b-923bf5507013}
Path:	F:
Root:
Include only:
Drive type:	Fixed
Label:	7092187927
Index number:	0
Date indexed:	03-20-2023 2:37 PM
Out of date:	No
Online:	No
Disk device index:
Multithreaded:	Separate device thread
Folder count:	1,730
File count:	9,471
Last rescan date:	03-21-2023 9:51 AM
Last rescan successful:	No
Last successful rescan date:	03-20-2023 2:37 PM
Last successful rescan duration:	00:00:01.416
Next scheduled rescan date:	03-22-2023 3:00 AM
FAT Index
Volume name:	\\?\Volume{2fbb81a2-c283-11ed-83dd-141333b95f06}
Path:	T:
Root:
Include only:
Drive type:	Fixed
Label:
Index number:	1
Date indexed:	03-18-2023 4:15 PM
Out of date:	No
Online:	Yes
Disk device index:
Multithreaded:	Separate device thread
Folder count:	24,919
File count:	240,394
Last rescan date:	03-21-2023 9:52 AM
Last rescan successful:	Yes
Last successful rescan date:	03-21-2023 9:52 AM
Last successful rescan duration:	00:00:11.735
Next scheduled rescan date:	03-22-2023 3:00 AM
NTFS Index
Volume name:	\\?\Volume{dba064f0-1ec8-4e09-a933-1c3148a49947}
Path:	C:
Root:
Include only:
Drive type:	Fixed
Label:	Windows
Index number:	2
Date indexed:	03-18-2023 4:15 PM
Out of date:	No
Online:
Disk device index:	1
Multithreaded:	Separate device thread
Folder count:	47,464
File count:	234,554
USN Journal ID:	01d959c8c41f95d2
Next USN:	000000000001a068

(4) Is there a way to store the content index from RAM to HDD (for those without an SSD) which might save time tomorrow morning. More to the point, a stored index could be reclaimed as a stable platform of data for future use.
No, this defeats the purpose of content indexing in Everything. If Everything needs to access that database on disk, you might as well disable content indexing and just read the content directly from disk. There would be no noticeable performance between accessing the Everything database from disk vs. directly reading the content from disk.

For the user with a HDD performance MIGHT be an issue. For that same user with 50,000 MSWord documents, performance MIGHT be an issue.
I understand (but have not yet used Windows indexes) that for truly massive domains, “si” is preferable.
But what about the user who wants a stable testing platform? They want to set up a content index (at some production expense), and then return to that content index time and time again, perhaps over a period of several weeks while they perform analysis on content?
I have a program that converts and cleans 10,000 documents over night. In the past I would have loved a snapshot of the content so that the client and I could go back and forth on creating rules for fixing text.
I was thinking of something that is essential a binary dump to disk, and the ability to reload that dump, so that I could continue analysis day after day.
I have no need for this now, but am trying to understand Content indexing, is all.

(5) Is there but one content index in RAM, or when I have another shot in 30 minutes time, will I then have two content indexes in RAM?
Everything only has one content index. The single content index is always stored in RAM. If you rebuild your database, Everything will keep the old content index and new content index at the same time to avoid going back to disk. Once the new index is built, the old content index is freed.

Thank you; understood!

Why should I be bothered with this?
Generally, content indexing is not recommended. If you have a good SSD, reading content directly from disk is going to be very fast. If you want instant content searching for a few files, please consider content indexing. (An example might be a couple 100 MB of source code files) Another use would be to index content on volumes that are often offline.

Thank you. So a threshold region might be 100 MB 200 MB; anything below that use the plain garden-variety of “Content:”; above that region, start using Windows “si”. I don’t hold you to that range, but I think a user needs a rough guideline as to when to, and when not to.

(later) Is the performance hit likely to be gradual ("Uh-oh! Better start using "si"") or is a user possibly going to face some sort of error, such as a crash of Everything?

(later still) Should the user make use of "si" always, as being (I imagine) simpler to set up whatever it is just once, and use one consistent technique?

I don’t apologize for the length of these posts. I have gained greatly in my understanding of how this all works; even better, I (with my SSD!) have learned not to worry about “content indexing” at this time.
Many, many thanks.
Chris

ChrisGreaves · Post by **ChrisGreaves** » Tue Mar 21, 2023 6:43 pm

horst.epp wrote: ↑Tue Mar 21, 2023 8:37 am 1: I don't understand whats the problem, your screenshot shows where you enable/disable content indexing.
2. All *.doc files in the whole T:\ tree
3. Yes, but the content index is of course much smaller than the data (see the size of your database)
4. No, but you should enable Windows indexer for large amount of data, you can query it from Everything using si:
5. You never have more than one database in RAM for one Everything instance, its automatically updated all the time.

Hi Horst. You will see some replies to your questions in my reply to Void.
My experience with indexing applications is that sometimes the index size is Greater than the source data, for good reason: Back in 1997 I wrote an application to scour a hard drive soaking up every detected scrap of VBA code: DOC, DOT, XLS of course and BAS, CLS and FRM files, but also TXT files and daily digests from technical forums. The index is a humongous string (37 MB is the largest I am using right now). Speed of access (searching the library) and display was critical, so the lines of VBA code were stored as-is with the auxiliary pointers added.
The search mechanism was based on the Word VBA Instr() function.

Cheers, Chris

horst.epp · Post by **horst.epp** » Tue Mar 21, 2023 9:04 pm

ChrisGreaves wrote: ↑Tue Mar 21, 2023 6:43 pm Hi Horst. You will see some replies to your questions in my reply to Void.
My experience with indexing applications is that sometimes the index size is Greater than the source data, for good reason: Back in 1997 I wrote an application to scour a hard drive soaking up every detected scrap of VBA code: DOC, DOT, XLS of course and BAS, CLS and FRM files, but also TXT files and daily digests from technical forums. The index is a humongous string (37 MB is the largest I am using right now). Speed of access (searching the library) and display was critical, so the lines of VBA code were stored as-is with the auxiliary pointers added.
The search mechanism was based on the Word VBA Instr() function.
Cheers, Chris

I have some dir tree indexed with the file types
*.doc;*.docx;*.txt;*.xls;*.xlsx;*.odt;*.txt;*.eml;*.csv;*.md;*.ini;*.pdf;*.pptx;*.ppt
The tree is about 820MB
The database which contains this and the rest of indexed files, folders and some properties
makes a database size of about 75MB.

This is my importand private data, the rest ist indexed by the Windows indexer.

tuska · Post by **tuska** » Tue Mar 21, 2023 11:07 pm

Here is another example:

1 Content
1.1

Index file content
1.2 Include only folders: D:\;C:\Everything\;C:\totalcmd
1.3 Include only files: *.csv;*.doc;*.docx;*.eml;*.ini;*.ion;*.md;*.odt;*.pdf;*.pps;*.ppt;*.pptx;*.txt;*.xls;*.xlsm;*.xlsx;*.xlt;*.xltm;*.xltx
1.3.1 Maximum size: 0 MB

2 Properties
2.1

Include files
2.2 Include only files:
2.3 Maximum size: 0 MB
2.4

Fast Sort
2.4.1 Authors: *.doc;*.docx;*.dot;*.dotx;*.pdf;*.ppt;*.pptx;*.xls;*.xlsm;*.xlsx;*.xlt;*.xltm;*.xltx
2.4.2 Comment: *.doc;*.docx;*.dot;*.dotx;*.jpg;*.pdf;*.ppt;*.pptx;*.xls;*.xlsm;*.xlsx;*.xlt;*.xltm;*.xltx
2.4.3 Container Filenames: *.zip
2.4.4 Date Taken: D:\**.jpg
2.4.5 Media Created: D:\**.3gp;D:\**.amr;D:\**.asf;D:\**.asx;D:\**.avi;D:\**.flv;D:\**.ifo;D:\**.mkv;D:\**.mov;D:\**.mp4;D:\**.mpg;D:\**.swf;D:\**.ts;D:\**.vob;D:\**.webm;D:\**.wm;D:\**.wmv
2.4.6 Subject: *.doc;*.docx;*.dot;*.dotx;*.pdf;*.ppt;*.pptx;*.xls;*.xlsm;*.xlsx;*.xlt;*.xltm;*.xltx
2.4.7 Tags: *.doc;*.docx;*.dot;*.dotx;*.pdf;*.ppt;*.pptx;*.xls;*.xlsm;*.xlsx;*.xlt;*.xltm;*.xltx
2.4.8 Title: *.doc;*.docx;*.dot;*.dotx;*.pdf;*.ppt;*.pptx;*.xls;*.xlsm;*.xlsx;*.xlt;*.xltm;*.xltx

3 Tools – Debug – Statistics
viewtopic.php?p=40442#p40442
Content indexing in Everything is designed for indexing less than 1GB of text.
3.1 File data size: 1 215 441 571 bytes
3.2 File index size:       60 040 064 bytes

4 Everything 1.5.0.1340a (x64)
4.1 Name/Ext         Size                          Date
      Everything.db 1 227 761 511 bytes 21.03.2023 23:29

5 Task-Manager
5.1 RAM: 21%
5.2 Everything: RAM 1 249,2 MB
5.3 Everything Service: RAM 1,3 MB

6 Windows Search (si:)
viewtopic.php?p=38346#p38346
Example & Windows -> Setup – Indexing Options (Example only!)

__________________________________________________________________________________
'Everything' 1.5.0.1340a (x64) | Windows 11 Pro (x64) Version 22H2 (OS build Build 22621.1413)
Processor Intel(R) Core(TM) i5-12600K 12th Gen, 3.70 GHz, 10 Cores, 16 Logical Processors
Installed RAM 32.0 GB (31.8 GB usable) | Everything <=> Total Commander

ChrisGreaves · Post by **ChrisGreaves** » Wed Mar 29, 2023 1:59 pm

tuska wrote: ↑Tue Mar 21, 2023 11:07 pm Here is another example:

Thank you Tuska, and I apologize for my tardy response.
I have saved these examples and will work through them one by one!
Cheers, Chris

tuska · Post by **tuska** » Wed Mar 29, 2023 2:43 pm

ChrisGreaves wrote: ↑Wed Mar 29, 2023 1:59 pm
tuska wrote: ↑Tue Mar 21, 2023 11:07 pm Here is another example:
Thank you Tuska, and I apologize for my tardy response.
I have saved these examples and will work through them one by one!
Cheers, Chris

Hi,
There is nothing to excuse.
If you work through this occasionally, then I recommend that you check the RAM consumption
in the task manager from time to time.

doskoi · Post by **doskoi** » Thu May 11, 2023 5:48 am

Thanks to this thread, I understand content index.
Please tell us about the future development policy for content option.

Q: Are there any plans to work with an external OOS that specializes in content indexes?

I'm currently using DocFetcher to browse my 5000 pages of personal social media backup content locally.
It's a little old OOS, but it's useful because you can index and search text content without html tags.
With the current specification of “everything”, every time you print a search with the content: option, the index task is performed every time, so unfortunately it is not practical.
My "search" UX on windows is only "everything", so I'm looking for a way to coexist.

1.5 is great!
We would like to thank the developers and members for their support.

Post by **void** » Thu May 11, 2023 8:30 am

Thank you for your feedback.

Everything can search the system index with si:

Does Windows have access to DocFetcher content index?

I will consider a search command to query your DocFetcher content index. (if possible)
Thank you for the suggestion.

doskoi · Post by **doskoi** » Thu May 11, 2023 3:10 pm

Mr. VOID, thank you for the prompt reply from the author himself.
Yes, DocFetcher is multi-platform and available for Windows.
DocFetcher is software that allows me to create a permanent index simply by specifying a folder, even if I am ignorant and lacking in technology.

but,,, DocFetcher is currently an unmaintained application, so another OSS that stores the index would be nice.

I think, Elasticsearch is a modern OSS that can be used with "Everything".

Post by **therube** » Thu May 11, 2023 3:48 pm

(Heh.
I look at, https://docfetcher.sourceforge.net/en/index.html, & I see simple, clean.
I look at, https://www.elastic.co/, https://www.elastic.co/elastic-stack/, & I see busy, cloud, more of "just what we need", heh - but not for me!)

voidtools forum

Learning about content indexing

Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing

Re: Learning about content indexing