其他-Solr: Document size inexplicably large

sam marshall 2020-12-01 22:51:46

It turns out that a 10MB English language PowerPoint file being indexed, with mostly pictures and only about 50 words of text in the whole thing, is indexed (with metadata turned off) as nearly half a million terms most of which are Chinese characters. Presumably, Tika has incorrectly extracted some of the binary content of the PowerPoint file as if it were text.

I was only able to find this by reducing the index by trial and error until there are only a handful of documents in it (3 documents but using 13MB disk space), then Luke 'Overview' tab let me see that one field (called solr_filecontent in my schema) which contains the indexed Tika results has 451,029 terms. Then, clicking 'Show top terms' shows a bunch of Chinese characters.

I am not sure if there is a less laborious way than trial and error to find this problem, e.g. if there's any way to find documents that have a large number of terms associated. (Obviously, it could be a very large PDF or something that legitimately has that many terms, but in this case, it isn't.) This would be useful as even if there are only a few such instances across our system, they could contribute quite a lot to overall index size.

As for resolving the problem:

1 - I could hack something to stop it indexing this individual document (which is used repeatedly in our test data otherwise I probably wouldn't have noticed) but presumably the problem may occur in other cases as well.

2 - I considered excluding the terms in some way but we do have a small proportion of content in various languages, including Chinese, in our index so even if there is a way to configure it to only use ASCII text or something, this wouldn't help.

3 - My next step was to try different versions to see what happens with the same file, in case it is a bug in specific Tika versions. However, I've tested with a range of Solr versions - 6.6.2, 8.4.0, 8.6.3, and 8.7.0 - and the same behaviour occurs on all of them.

So my conclusion from all this is that:

Contrary to my intitial assumption that this was related to the version upgrade, it isn't actually worse now than it was in the older Solr version.
In order to get this working now I will probably have to do a hack to stop it indexing that specific PowerPoint file (which occurs frequently on our test dataset). Presumably the real dataset wouldn't have too many files like that otherwise it would already have run out of disk space there...

sam marshall 2020-12-10 14:16:27

Just to note, it turns out this problem occurs only when you upload the file in one of the two different ways shown in the documentation. I have filed issues.apache.org/jira/browse/SOLR-15039 about it.

Solr: Document size inexplicably large

热门帖子

热门github