Warm tip: This article is reproduced from serverfault.com, please click


发布于 2020-11-27 04:32:20

I have close to 10K JSON files (very small). I would like to provide search functionality. Since these JSON files are fixed for specific release, I am thinking to pre-index files and load index during startup of website. I don't want to use external search engine.

I am searching for libraries to support this. lucene.Net is one popular library. I am not sure whether this library supports loading pre-index data.

  • Index JSON documents and store index results (probably in single file), save to file storage service like S3 - Console app.
  • Load index file and respond to queries. - ASP.NET core app

I am not sure this is possible or not. What are the possible options available?

NightOwl888 2020-11-28 09:38:01

Since S3 is not a .NET-specific technology and Lucene.NET is a line-by-line port of Lucene, you can expand your search to include Lucene-related questions. There is an answer here that points to an S3 implementation meant for Lucene that could be ported to .NET. But, by the author's own admission, performance of the implementation is not great.

NOTE: I don't consider this to be a duplicate question due to the fact that the answer most appropriate to you is not the accepted answer, since you explicitly stated you don't want to use an external solution.

There are a couple of implementations for Lucene.NET that use Azure instead of AWS here and here. You may be able to get some ideas that help you to create a more optimal solution for S3, but creating your own Directory implementation is a non-trivial task.

Can IndexReader read index file from in-memory string?

It is possible to use a RAMDirectory, which has a copy constructor that moves the entire index from disk into memory. The copy constructor is only useful if your files are on disk, though. You could potentially read the files from S3 and put them into RAMDirectory. This option is fast for small indexes but will not scale if your index is growing over time. It is also not optimized for high-traffic websites that have multiple concurrent threads performing searches.

From the documentation:

Warning: This class is not intended to work with huge indexes. Everything beyond several hundred megabytes will waste resources (GC cycles), because it uses an internal buffer size of 1024 bytes, producing millions of byte[1024] arrays. This class is optimized for small memory-resident indexes. It also has bad concurrency on multithreaded environments.

It is recommended to materialize large indexes on disk and use MMapDirectory, which is a high-performance directory implementation working directly on the file system cache of the operating system, so copying data to heap space is not useful.

When you call the FSDirectory.Open() method, it chooses a directory that is optimized for the current operating system. In most cases it returns MMapDirectory, which is an implementation that uses the System.IO.MemoryMappedFiles.MemoryMappedFile class under the hood with multiple views. This option will scale much better if the size of the index is large or if there are many concurrent users.

To use Lucene.NET's built-in index file optimizations, you must put the index files in a medium that can be read like a normal file system. Rather than trying to roll a Lucene.NET solution that uses S3's APIs, you might want to check into using S3 as a file system instead. Although, I am not sure how that would perform compared to a local file system.