我的功能列表如下:
[
{
"store": "myfeature_store",
"name" : "titleLength",
"class" : "org.apache.solr.ltr.feature.FieldLengthFeature",
"params" : {
"field":"title"
}
}
]
当我搜索以下查询时:
curl -g 'http://localhost:8983/solr/nutch/select?indent=on&q=python&wt=json&fl=title,score,[features%20efi.query=python%20store=myfeature_store]'
我得到以下结果:
{
"responseHeader":{
"status":0,
"QTime":8,
"params":{
"q":"python",
"indent":"on",
"fl":"title,score,[features efi.query=python store=myfeature_store]",
"wt":"json"}},
"response":{"numFound":793,"start":0,"maxScore":0.33828905,"docs":[
{
"title":"Newest 'python' Questions - Stack Overflow",
"score":0.33828905,
"[features]":"titleLength=1820.4445"},
{
"title":"Newest 'python-3.x' Questions - Stack Overflow",
"score":0.14434122,
"[features]":"titleLength=5349.8774"},
{
"title":"Geographic Information Systems Stack Exchange",
"score":0.08331977,
"[features]":"titleLength=1820.4445"},
{
"title":"Stack Overflow em Português",
"score":0.08331977,
"[features]":"titleLength=1820.4445"},
{
"title":"Stack Overflow en español",
"score":0.07460209,
"[features]":"titleLength=2621.44"},
{
"title":"Hot Questions - Stack Exchange",
"score":0.06534503,
"[features]":"titleLength=655.36"},
{
"title":"Code Review Stack Exchange",
"score":0.05356382,
"[features]":"titleLength=1820.4445"},
{
"title":"Software Recommendations Stack Exchange",
"score":0.05356382,
"[features]":"titleLength=1820.4445"},
{
"title":"Raspberry Pi Stack Exchange",
"score":0.042962566,
"[features]":"titleLength=1820.4445"},
{
"title":"Welcome to The Apache Software Foundation!",
"score":0.042862184,
"[features]":"titleLength=455.1111"}]
}}
可以看到,titleLength
这完全是错误的。例如,对于最后一个结果,标题为Welcome to The Apache Software Foundation!
,titleLength
应为5,但即将到来455.1111。问题可能出在哪里?
的titleLength
处理程序使用存储的字段的规范-这些被映射到与256个可能值浮标的查找表。这些值不是精确的(因为字段的长度可以大于256),而是将整数值的整个空间映射2^31
到单个字节中。
这还包括任何索引时间提升,因此,如果在对字段进行索引时(例如,通过Nutch插件)对字段进行了提升,则这将反映在该字段存储的规范中。您不能依赖于titleLength
为该文档的字段存储确切数量的术语,但是它代表了该字段的“提升”。
但是,如果您在响应中看到标题的长度,则大多数标题的长度都差不多(“ Stack Overflow enespañol”,“ Hot Questions-Stack Exchange”,“ Code Review Stack Exchange”),但是得分却不同(2621.44、655.36 ,1820.4445)。为什么会这样呢?
另外,如何知道Nutch插件是否在推动该领域?
在
titleLength
基于存储文档的规范-这是一个一个字节浮点,仅能够存储256个不同的值。这些值可能只是彼此相距一个值(即655.36、1820.4445、2621.44可以字节值存储为0x78、0x79、0x7A)。我对Nutch不熟悉,但是如果您使用基于链接结构等的任何评分方式,我想这会对其产生影响。得分不同的原因可能是Nutch ar对他们的提升有所不同。有没有任何变通方法来获得近似字段长度,如任何函数查询或类似的东西?因为我不能依靠
FieldLengthFeature
,因为它给结果提供了出路。取决于您的用例-如果您需要特定的字段并且想要该字段的实际字符长度,我将使用更新处理器将字段长度编入索引作为单独的字段。