Searching by date in Elasticsearch
I recently indexed some documents in Elasticsearch at work and had issues retrieving what I wanted by date. Googling didn't get me very useful results, except the official documentation. I thought it was worth sharing what wasn't obvious to me by reading the documentation.
Let's start a single-node Elasticsearch cluster for test:
!docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.6.0
Indexing documents in Elasticsearch¶
Like in a previous blog post, I'll use the Python Elasticsearch client.
from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch()
Let's first check the cluster is alive:
es.cat.health()
Here is the list of messages we want to index:
messages = [
{"date": "Fri, 11 Oct 2019 10:30:00 +0200",
"subject": "Beautiful is better than ugly"
},
{"date": "Wed, 09 Oct 2019 11:36:05 +0200",
"subject": "Explicit is better than implicit"
},
{"date": "Thu, 10 Oct 2019 19:16:25 +0200",
"subject": "Simple is better than complex"
},
{"date": "Fri, 01 Nov 2019 18:12:00 +0200",
"subject": "Complex is better than complicated"
},
{"date": "Wed, 09 Oct 2019 21:30:10 +0200",
"subject": "Flat is better than nested"
},
{"date": "Wed, 01 Jan 2020 09:23:00 +0200",
"subject": "Sparse is better than dense"
},
{"date": "Wed, 15 Jan 2020 14:06:07 +0200",
"subject": "Readability counts"
},
{"date": "Sat, 01 Feb 2020 12:00:00 +0200",
"subject": "Now is better than never"
},
]
Let's index those messages. Note that we delete the index first to make sure it doesn't exist when running this notebook several times.
es.indices.delete(index="test-index", ignore_unavailable=True)
for id_, message in enumerate(messages):
es.index(index="test-index", id=id_, body=message, refresh=True)
es.indices.get_mapping(index="test-index")
Looking at the mapping, we see that the date
field was indexed as text
and not date
datatype. Formatting the field to the isoformat should help.
for message in messages:
message["date"] = datetime.strptime(message["date"], "%a, %d %b %Y %H:%M:%S %z").isoformat()
messages
es.indices.delete(index="test-index", ignore_unavailable=True)
for id_, message in enumerate(messages):
es.index(index="test-index", id=id_, body=message, refresh=True)
es.indices.get_mapping(index="test-index")
This looks better. The date field was properly recognized thanks to the date detection that is enabled by default.
Searching¶
We can first check that simple queries work as expected. Note that I'll use the query string syntax. I find it more natural and easier to integrate in a web application search box.
es.search(index="test-index", q="complex")
Let's define a function that just returns the list of hits.
def search(query):
return es.search(index="test-index", q=query)["hits"]["hits"]
search("complex")
Let's now try to search by date to retrieve the messages from the 9th of October 2019.
search("20191009")
Nothing... The date format is probably not recognized.
search("2019-10-09")
So we have to use -
. OK, let's try to retrieve all messages from January 2020.
search("2020-01")
That's not really what we expected. There is a message the 15th of January. This shows that 2020-01
is in fact equivalent to 2020-01-01
. This would be the same with 2020
.
search("date:2020")
To get the full month, we have to use a range query.
search("[2020-01-01 TO 2020-01-31]")
Which is equivalent to:
search("[2020-01 TO 2020-02}")
Note that }
, in the range query, excludes the 1st of February. Using ]
would give us an additional message:
search("[2020-01 TO 2020-02]")
Another way to retrieve messages from a specific period is to use date math:
search("2020-01\|\|\/M")
search("date:2020\|\|\/y")
This is a nice solution but it's not super easy to make occasional users remember the syntax, especially the quoting of the |
and /
characters. Range queries are probably more natural.
One thing that could be nice is if both 2019-10-09
and 20191009
were recognized. This is possible by adding the format we want to accept in the mapping.
Let's recreate the index with the new mapping.
mapping = {
"date": {
"type": "date",
"format": "strict_date_optional_time||yyyyMMdd||yyyyMM",
},
"subject": {"type": "text", "fields": {"keyword": {"type": "keyword"}}},
}
es.indices.delete(index="test-index", ignore_unavailable=True)
es.indices.create(index="test-index", body={"mappings": {"dynamic": "strict", "properties": mapping}})
for id_, message in enumerate(messages):
es.index(index="test-index", id=id_, body=message, refresh=True)
search("20191009")
search("2019-10-09")
search("date:[202002 TO now]")
search("date:[2020-02 TO now]")
As seen above, both formats work now.
Conclusion¶
- The mapping is used when indexing new documents. It's also used by the search. Define in the mapping all the date formats you want the search to support (not only the ones required to ingest documents).
- A year
2020
or month2020-01
is converted to the first day of the year/month:2020-01-01
. - To search by period, use either date math
2020-01\|\|\/M
or a range query[2020-01-01 TO 2020-01-31]
Comments
Comments powered by Disqus