Warm tip: This article is reproduced from serverfault.com, please click

Solr: Document size inexplicably large

发布于 2020-11-27 11:20:48

I updated to Solr 8.4.0 (from 6.x) on a test server and reindexed (this is an index of a complicated Moodle system, mainly lots of very small documents). It worked initially, but later ran out of disk space so I deleted everything and tried indexing a smaller subset of the data, but it still ran out of disk space.

Looking at the segment info chart, the first segment seems reasonable:

Segment _2a1u: #docs: 603,564 #dels: 1 size: 5,275,671,226 bytes age: 2020-11-25T22:10:05.023Z source: merge

That's 8,740 bytes per document - a little high but not too bad.

Segment _28ow: #docs: 241,082 #dels: 31 size: 5,251,034,504 bytes age: 2020-11-25T18:33:59.636Z source: merge

21,781 bytes per document

Segment _2ajc: #docs: 50,159 #dels: 1 size: 5,222,429,424 bytes age: 2020-11-25T23:29:35.391Z source: merge

104,117 bytes per document!

And it gets worse, looking at the small segments near the end:

Segment _2bff: #docs: 2 #dels: 0 size:23,605,447 bytes age: 2020-11-26T01:36:02.130Z source: flush

None of our search documents will have anywhere near that much text.

On our production Solr 6.6 server, which has similar but slightly larger data (some of it gets replaced with short placeholder text in the test server for privacy reasons) the large 5GB-ish segments contain between 1.8 million and 5 million documents.

Does anyone know what could have gone wrong here? We are using Solr Cell/Tika and I'm wondering if somehow it started storing the whole files instead of just the extracted text?

Questioner
sam marshall
Viewed
0
sam marshall 2020-12-01 22:51:46

It turns out that a 10MB English language PowerPoint file being indexed, with mostly pictures and only about 50 words of text in the whole thing, is indexed (with metadata turned off) as nearly half a million terms most of which are Chinese characters. Presumably, Tika has incorrectly extracted some of the binary content of the PowerPoint file as if it were text.

I was only able to find this by reducing the index by trial and error until there are only a handful of documents in it (3 documents but using 13MB disk space), then Luke 'Overview' tab let me see that one field (called solr_filecontent in my schema) which contains the indexed Tika results has 451,029 terms. Then, clicking 'Show top terms' shows a bunch of Chinese characters.

I am not sure if there is a less laborious way than trial and error to find this problem, e.g. if there's any way to find documents that have a large number of terms associated. (Obviously, it could be a very large PDF or something that legitimately has that many terms, but in this case, it isn't.) This would be useful as even if there are only a few such instances across our system, they could contribute quite a lot to overall index size.

As for resolving the problem:

1 - I could hack something to stop it indexing this individual document (which is used repeatedly in our test data otherwise I probably wouldn't have noticed) but presumably the problem may occur in other cases as well.

2 - I considered excluding the terms in some way but we do have a small proportion of content in various languages, including Chinese, in our index so even if there is a way to configure it to only use ASCII text or something, this wouldn't help.

3 - My next step was to try different versions to see what happens with the same file, in case it is a bug in specific Tika versions. However, I've tested with a range of Solr versions - 6.6.2, 8.4.0, 8.6.3, and 8.7.0 - and the same behaviour occurs on all of them.

So my conclusion from all this is that:

  • Contrary to my intitial assumption that this was related to the version upgrade, it isn't actually worse now than it was in the older Solr version.

  • In order to get this working now I will probably have to do a hack to stop it indexing that specific PowerPoint file (which occurs frequently on our test dataset). Presumably the real dataset wouldn't have too many files like that otherwise it would already have run out of disk space there...