Solr is giving wrong FIeld Length

MatsLindh 2020-02-10 05:20

The titleLength handler uses the norms stored for the fields - these are mapped to a lookup table of floats with 256 possible values. These values are not expected to be exact (since the length of a field can be larger than 256), but to map the whole space of 2^31 integer values into a single byte.

This also include any index time boosts, so if a field is boosted when you're indexing it (for example by a Nutch plugin), this will be reflected in the norm stored for the field. You can't rely on titleLength to be an exact number of terms stored for the field for that document, but it represents the "boost" for the field.

Vedanshu 2020-02-11 21:55:42

But if you see the length of title in the response most of them are of similar length ("Stack Overflow en español", "Hot Questions - Stack Exchange", "Code Review Stack Exchange") but are given different score (2621.44, 655.36, 1820.4445). Why is that happening ?

Vedanshu 2020-02-11 23:58:16

Also, how to know if there the field is getting boosted by Nutch plugin or not ?

MatsLindh 2020-02-12 04:03:42

The titleLength is based on the norm stored for the document - that's a one byte float, only capable of storing 256 different values. Those values are probably just a single value away from each other (i.e. 655.36, 1820.4445, 2621.44 could be stored as 0x78, 0x79, 0x7A in byte values). I'm not familiar with Nutch, but if you use any sort of scoring based on link structure, etc., I'm guessing that will affect it. The reason for the different score can be that Nutch ar boosting them differently.

Vedanshu 2020-02-12 10:25:51

Is there any workaround to get approx field length like any function query or something ? Because I can't rely on FieldLengthFeature, as it is giving results way out.

MatsLindh 2020-02-12 16:00:39

Depends on your use case - if you need it for a specific field and would like the actual character length of the field, I'd use an update processor to index the field length as a separate field.

Related issues

Configuration of schema.xml for nutch and solr

Solr: Document size inexplicably large

Can I access solr dataimporter.request variables in a javascript transformer

Apache solr distributes search using shards are not working (shardsWhitelist)

Solr: Searching with/without spaces in keywords

Solr indexing some of fields in a Tuple

Solr Security : authentication/authorization Not recognized

Why are Lucene/Elasticsearch prefix queries slower than term queries?

solr query with white space

Apache Solr extract, highlight HTML elements based on query, filter query terms