I've got my feature list as follows:
[
{
"store": "myfeature_store",
"name" : "titleLength",
"class" : "org.apache.solr.ltr.feature.FieldLengthFeature",
"params" : {
"field":"title"
}
}
]
When I search for the following query:
curl -g 'http://localhost:8983/solr/nutch/select?indent=on&q=python&wt=json&fl=title,score,[features%20efi.query=python%20store=myfeature_store]'
I'm getting following results:
{
"responseHeader":{
"status":0,
"QTime":8,
"params":{
"q":"python",
"indent":"on",
"fl":"title,score,[features efi.query=python store=myfeature_store]",
"wt":"json"}},
"response":{"numFound":793,"start":0,"maxScore":0.33828905,"docs":[
{
"title":"Newest 'python' Questions - Stack Overflow",
"score":0.33828905,
"[features]":"titleLength=1820.4445"},
{
"title":"Newest 'python-3.x' Questions - Stack Overflow",
"score":0.14434122,
"[features]":"titleLength=5349.8774"},
{
"title":"Geographic Information Systems Stack Exchange",
"score":0.08331977,
"[features]":"titleLength=1820.4445"},
{
"title":"Stack Overflow em Português",
"score":0.08331977,
"[features]":"titleLength=1820.4445"},
{
"title":"Stack Overflow en español",
"score":0.07460209,
"[features]":"titleLength=2621.44"},
{
"title":"Hot Questions - Stack Exchange",
"score":0.06534503,
"[features]":"titleLength=655.36"},
{
"title":"Code Review Stack Exchange",
"score":0.05356382,
"[features]":"titleLength=1820.4445"},
{
"title":"Software Recommendations Stack Exchange",
"score":0.05356382,
"[features]":"titleLength=1820.4445"},
{
"title":"Raspberry Pi Stack Exchange",
"score":0.042962566,
"[features]":"titleLength=1820.4445"},
{
"title":"Welcome to The Apache Software Foundation!",
"score":0.042862184,
"[features]":"titleLength=455.1111"}]
}}
As one can see the titleLength
is completely coming wrong. For example for the last result, the title is Welcome to The Apache Software Foundation!
, the titleLength
should be 5 but it's coming 455.1111. Where might be the problem ?
The titleLength
handler uses the norms stored for the fields - these are mapped to a lookup table of floats with 256 possible values. These values are not expected to be exact (since the length of a field can be larger than 256), but to map the whole space of 2^31
integer values into a single byte.
This also include any index time boosts, so if a field is boosted when you're indexing it (for example by a Nutch plugin), this will be reflected in the norm stored for the field. You can't rely on titleLength
to be an exact number of terms stored for the field for that document, but it represents the "boost" for the field.
But if you see the length of title in the response most of them are of similar length ("Stack Overflow en español", "Hot Questions - Stack Exchange", "Code Review Stack Exchange") but are given different score (2621.44, 655.36, 1820.4445). Why is that happening ?
Also, how to know if there the field is getting boosted by Nutch plugin or not ?
The
titleLength
is based on the norm stored for the document - that's a one byte float, only capable of storing 256 different values. Those values are probably just a single value away from each other (i.e. 655.36, 1820.4445, 2621.44 could be stored as 0x78, 0x79, 0x7A in byte values). I'm not familiar with Nutch, but if you use any sort of scoring based on link structure, etc., I'm guessing that will affect it. The reason for the different score can be that Nutch ar boosting them differently.Is there any workaround to get approx field length like any function query or something ? Because I can't rely on
FieldLengthFeature
, as it is giving results way out.Depends on your use case - if you need it for a specific field and would like the actual character length of the field, I'd use an update processor to index the field length as a separate field.