资讯专栏INFORMATION COLUMN

elasticsearch学习笔记高级篇(十一)——多字段搜索(下)

desdik / 412人阅读

摘要:它将所有的字段视为一个大的字段,然后在任一字段中搜索每个词条。以字段为中心通过查询得到这个就是规则。换言之所有的词必须出现在任意字段中。使用相比较于,可以在查询期间对个别字段进行加权。

承接上一篇博客 https://segmentfault.com/a/11...

4、most_fields查询

most_fields是以字段为中心,这就使得它会查询最多匹配的字段。
假设我们有一个让用户搜索地址。其中有两个文档如下:

PUT /test_index/_create/1
{
    "street":   "5 Poland Street",
    "city":     "Poland",
    "country":  "United W1V",
    "postcode": "W1V 3DG"
}

PUT /test_index/_create/2
{
    "street":   "5 Poland Street W1V",
    "city":     "London",
    "country":  "United Kingdom",
    "postcode": "3DG"
}

使用most_fields进行查询:

GET /test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "street": "Poland Street W1V"
          }
        },
        {
          "match": {
            "city": "Poland Street W1V"
          }
        },
        {
          "match": {
            "country": "Poland Street W1V"
          }
        },
        {
          "match": {
            "postcode": "Poland Street W1V"
          }
        }
      ]
    }
  }
}

我们发现对每个字段重复查询字符串很快就会显得冗长,此时用multi_match进行简化如下:

GET /test_index/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "most_fields", 
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

结果:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.3835402,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.3835402,
        "_source" : {
          "street" : "5 Poland Street",
          "city" : "Poland",
          "country" : "United W1V",
          "postcode" : "W1V 3DG"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.99938464,
        "_source" : {
          "street" : "5 Poland Street W1V",
          "city" : "London",
          "country" : "United Kingdom",
          "postcode" : "3DG"
        }
      }
    ]
  }
}

如果用best_fields,那么doc2会在doc1的前面

GET /test_index/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "best_fields", 
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

结果:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.99938464,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.99938464,
        "_source" : {
          "street" : "5 Poland Street W1V",
          "city" : "London",
          "country" : "United Kingdom",
          "postcode" : "3DG"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "street" : "5 Poland Street",
          "city" : "Poland",
          "country" : "United W1V",
          "postcode" : "W1V 3DG"
        }
      }
    ]
  }
}
使用most_fields存在的问题

(1)它被设计用来找到匹配任意单词的多数字段,而不是找到跨越所有字段的最匹配的单词
(2)它不能使用operator或者minimum_should_match参数来减少低相关度结果带来的长尾效应
(3)每个字段的词条频度是不同的,会互相干扰最终得到较差的排序结果

5、全字段查询使用copy_to参数

上面那说了most_fields的问题,下面就来解决一下这个问题,解决这个问题的第一种方式就是使用copy_to参数。
我们可以用copy_to将多个field组合成一个field
建立如下索引:

DELETE /test_index
PUT /test_index
{
  "mappings": {
    "properties": {
      "street": {
        "type": "text",
        "copy_to": "full_address"
      },
      "city": {
        "type": "text",
        "copy_to": "full_address"
      },
      "country": {
        "type": "text",
        "copy_to": "full_address"
      },
      "postcode": {
        "type": "text",
        "copy_to": "full_address"
      },
      "full_address": {
        "type": "text"
      }
    }
  }
}

插入之前的数据:

PUT /test_index/_create/1
{
    "street":   "5 Poland Street",
    "city":     "Poland",
    "country":  "United W1V",
    "postcode": "W1V 3DG"
}

PUT /test_index/_create/2
{
    "street":   "5 Poland Street W1V",
    "city":     "London",
    "country":  "United Kingdom",
    "postcode": "3DG"
}

查询:

GET /test_index/_search
{
  "query": {
    "match": {
      "full_address": "Poland Street W1V"
    }
  }
}

结果:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.68370587,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.68370587,
        "_source" : {
          "street" : "5 Poland Street",
          "city" : "Poland",
          "country" : "United W1V",
          "postcode" : "W1V 3DG"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5469647,
        "_source" : {
          "street" : "5 Poland Street W1V",
          "city" : "London",
          "country" : "United Kingdom",
          "postcode" : "3DG"
        }
      }
    ]
  }
}

我们可以发现这样变成一个字段full_address之后,就可以解决most_fields的问题了。

5、cross_fields查询

解决most_fields的问题的第二种方式就是使用cross_fields查询。
如果我们在索引文档之前都能够使用_all或是提前定义好copy_to的话,那就没什么问题。但是,Elasticsearch同时也提供了一个搜索期间的解决方案就是使用cross_fields查询。cross_fields采用了一种以词条为中心的方法,这种方法和best_fields以及most_fields采用的以字段为中心的方法有很大的区别。它将所有的字段视为一个大的字段,然后在任一字段中搜索每个词条。
下面解释一下以字段为中心和以词条为中心的区别。

以字段为中心

通过查询:

GET /test_index/_validate/query?explain
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "best_fields",
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

得到:

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test_index",
      "valid" : true,
      "explanation" : "((postcode:poland postcode:street postcode:w1v) | (country:poland country:street country:w1v) | (city:poland city:street city:w1v) | (street:poland street:street street:w1v))"
    }
  ]
}

((postcode:poland postcode:street postcode:w1v) |
(country:poland country:street country:w1v) |
(city:poland city:street city:w1v) |
(street:poland street:street street:w1v))
这个就是规则。
将operator设置成and就变成
((+postcode:poland +postcode:street +postcode:w1v) |
(+country:poland +country:street +country:w1v) |
(+city:poland +city:street +city:w1v) |
(+street:poland +street:street +street:w1v))
标识四个词条都需要出现在相同的字段中

以词条为中心

通过查询

GET /test_index/_validate/query?explain
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "cross_fields", 
      "operator": "and", 
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

得到:

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test_index",
      "valid" : true,
      "explanation" : "+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])"
    }
  ]
}

+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])
这个是规则。换言之所有的词必须出现在任意字段中。
cross_fields类型首先会解析查询字符串来得到一个词条列表,然后在任一字段中搜索每个词条。通过混合字段的倒排文档频度来解决词条频度问题。从而完美结局了most_fields的问题。
使用cross_fields相比较于copy_to,可以在查询期间对个别字段进行加权。
示例:

GET /test_index/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "cross_fields", 
      "fields": ["street^2", "city", "country", "postcode"]
    }
  }
}

这样street字段的boost就是2,其它字段都为1

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/34528.html

相关文章

  • elasticsearch学习笔记高级(四)——在案例中实战使用terms搜索个值以及搜索

    摘要:格式描述格式格式对于,如果和语句联系起来的话,那么就相当于准备数据为帖子字段增加字段搜索为或的帖子输出搜索中包含的帖子优化搜索结果,仅仅搜索中只包含的帖子更新数据增加字段包含的标签数量 格式描述: term格式: term: { FIELD: { value: VALUE } terms格式: terms: { ...

    libin19890520 评论0 收藏0
  • elasticsearch学习笔记高级(十)——字段搜索(上)

    摘要:文档在字段中包含了以及,但是在字段中没有出现任何搜索的单词。取则会将所有匹配的子句一视同仁。查询查询提供了一个简便的方法用来对多个字段执行相同的查询。默认情况下,该查询以类型执行,它会为每个字段生成一个查询,然后将这些查询包含在一个查询中。 只有一个简单的match子句的查询是很少见的。我们经常需要在一个或者多个字段中查询相同的或者不同的查询字符串,意味着我们需要能够组合多个子查询以及...

    shiweifu 评论0 收藏0
  • elasticsearch学习笔记高级(九)——shard场景相关度分数不准确问题

    摘要:当一个搜索包含的请求到这个的时候,应该会这么计算相关度分数。也许相关度很高的排在了后面,分数不高,而相关度很低的排在了前面,分数很高。如果说数据分布均匀的话,其实就没有因为不准确导致相关度分数不准确的问题。 场景分析: 在某个shard中,有很多个document包含了title中有java这个关键字,比如说10个doc的title中包含了java。 当一个搜索title包含java的...

    GT 评论0 收藏0
  • elasticsearch学习笔记高级(六)——在案例中如果通过手动控制全文检索结果的精准度

    摘要:准备数据为帖子数据增加标题字段搜索标题中包含或的这个就跟之前的那个不一样了。是负责进行全文检索的。在满足的基础上,中的条件,不匹配也是可以的,但是如果匹配的更多,那么的就会更高。 准备数据: POST /forum/_bulk { index: { _id: 1 }} { articleID : XHDK-A-1293-#fJ3, userID : 1, hidden: false, ...

    pekonchan 评论0 收藏0
  • elasticsearch学习笔记高级(五)——在案例中实战基于range filter来进行范围

    摘要:格式类似于中的大于等于小于等于之类的范围筛选准备数据为帖子数据增加浏览量的字段搜索浏览量在之间的帖子搜索发帖日期在最近个月的帖子准备一条数据,之前时间比较老了 格式: range: { FIELD: { gte: 10, lte: 20 } 类似于SQL中的between、大于等于、小于等于之类的范围筛选 准备数据: PO...

    darkbug 评论0 收藏0

发表评论

0条评论

最新活动
阅读需要支付1元查看
<