Searching by date in Elasticsearch

I recently indexed some documents in Elasticsearch at work and had issues retrieving what I wanted by date. Googling didn't get me very useful results, except the official documentation. I thought it was worth sharing what wasn't obvious to me by reading the documentation.

Let's start a single-node Elasticsearch cluster for test:

In [1]:
!docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.6.0
b7c18b6079414f728d2dbacd8c913fbb212026bc513808e03e75e7a81eda0753

Indexing documents in Elasticsearch

Like in a previous blog post, I'll use the Python Elasticsearch client.

In [2]:
from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch()

Let's first check the cluster is alive:

In [3]:
es.cat.health()
Out[3]:
'1583959014 20:36:54 docker-cluster green 1 1 0 0 0 0 0 17 1.2s 100.0%\n'

Here is the list of messages we want to index:

In [4]:
messages = [
    {"date": "Fri, 11 Oct 2019 10:30:00 +0200",
    "subject": "Beautiful is better than ugly"
    },
    {"date": "Wed, 09 Oct 2019 11:36:05 +0200",
    "subject": "Explicit is better than implicit"
    },
    {"date": "Thu, 10 Oct 2019 19:16:25 +0200",
    "subject": "Simple is better than complex"
    },
    {"date": "Fri, 01 Nov 2019 18:12:00 +0200",
    "subject": "Complex is better than complicated"
    },
    {"date": "Wed, 09 Oct 2019 21:30:10 +0200",
    "subject": "Flat is better than nested"
    },
    {"date": "Wed, 01 Jan 2020 09:23:00 +0200",
    "subject": "Sparse is better than dense"
    },
    {"date": "Wed, 15 Jan 2020 14:06:07 +0200",
    "subject": "Readability counts"
    },
    {"date": "Sat, 01 Feb 2020 12:00:00 +0200",
    "subject": "Now is better than never"
    },
]

Let's index those messages. Note that we delete the index first to make sure it doesn't exist when running this notebook several times.

In [5]:
es.indices.delete(index="test-index", ignore_unavailable=True)
for id_, message in enumerate(messages):
    es.index(index="test-index", id=id_, body=message, refresh=True)
In [6]:
es.indices.get_mapping(index="test-index")
Out[6]:
{'test-index': {'mappings': {'properties': {'date': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'subject': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}}}

Looking at the mapping, we see that the date field was indexed as text and not date datatype. Formatting the field to the isoformat should help.

In [7]:
for message in messages:
    message["date"] = datetime.strptime(message["date"], "%a, %d %b %Y %H:%M:%S %z").isoformat()
messages
Out[7]:
[{'date': '2019-10-11T10:30:00+02:00',
  'subject': 'Beautiful is better than ugly'},
 {'date': '2019-10-09T11:36:05+02:00',
  'subject': 'Explicit is better than implicit'},
 {'date': '2019-10-10T19:16:25+02:00',
  'subject': 'Simple is better than complex'},
 {'date': '2019-11-01T18:12:00+02:00',
  'subject': 'Complex is better than complicated'},
 {'date': '2019-10-09T21:30:10+02:00',
  'subject': 'Flat is better than nested'},
 {'date': '2020-01-01T09:23:00+02:00',
  'subject': 'Sparse is better than dense'},
 {'date': '2020-01-15T14:06:07+02:00', 'subject': 'Readability counts'},
 {'date': '2020-02-01T12:00:00+02:00', 'subject': 'Now is better than never'}]
In [8]:
es.indices.delete(index="test-index", ignore_unavailable=True)
for id_, message in enumerate(messages):
    es.index(index="test-index", id=id_, body=message, refresh=True)
es.indices.get_mapping(index="test-index")
Out[8]:
{'test-index': {'mappings': {'properties': {'date': {'type': 'date'},
    'subject': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}}}

This looks better. The date field was properly recognized thanks to the date detection that is enabled by default.

Searching

We can first check that simple queries work as expected. Note that I'll use the query string syntax. I find it more natural and easier to integrate in a web application search box.

In [9]:
es.search(index="test-index", q="complex")
Out[9]:
{'took': 140,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 2, 'relation': 'eq'},
  'max_score': 1.2398099,
  'hits': [{'_index': 'test-index',
    '_type': '_doc',
    '_id': '2',
    '_score': 1.2398099,
    '_source': {'date': '2019-10-10T19:16:25+02:00',
     'subject': 'Simple is better than complex'}},
   {'_index': 'test-index',
    '_type': '_doc',
    '_id': '3',
    '_score': 1.2398099,
    '_source': {'date': '2019-11-01T18:12:00+02:00',
     'subject': 'Complex is better than complicated'}}]}}

Let's define a function that just returns the list of hits.

In [10]:
def search(query):
    return es.search(index="test-index", q=query)["hits"]["hits"]
In [11]:
search("complex")
Out[11]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '2',
  '_score': 1.2398099,
  '_source': {'date': '2019-10-10T19:16:25+02:00',
   'subject': 'Simple is better than complex'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '3',
  '_score': 1.2398099,
  '_source': {'date': '2019-11-01T18:12:00+02:00',
   'subject': 'Complex is better than complicated'}}]

Let's now try to search by date to retrieve the messages from the 9th of October 2019.

In [12]:
search("20191009")
Out[12]:
[]

Nothing... The date format is probably not recognized.

In [13]:
search("2019-10-09")
Out[13]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '1',
  '_score': 1.0,
  '_source': {'date': '2019-10-09T11:36:05+02:00',
   'subject': 'Explicit is better than implicit'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '4',
  '_score': 1.0,
  '_source': {'date': '2019-10-09T21:30:10+02:00',
   'subject': 'Flat is better than nested'}}]

So we have to use -. OK, let's try to retrieve all messages from January 2020.

In [14]:
search("2020-01")
Out[14]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '5',
  '_score': 1.0,
  '_source': {'date': '2020-01-01T09:23:00+02:00',
   'subject': 'Sparse is better than dense'}}]

That's not really what we expected. There is a message the 15th of January. This shows that 2020-01 is in fact equivalent to 2020-01-01. This would be the same with 2020.

In [15]:
search("date:2020")
Out[15]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '5',
  '_score': 1.0,
  '_source': {'date': '2020-01-01T09:23:00+02:00',
   'subject': 'Sparse is better than dense'}}]

To get the full month, we have to use a range query.

In [16]:
search("[2020-01-01 TO 2020-01-31]")
Out[16]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '5',
  '_score': 1.0,
  '_source': {'date': '2020-01-01T09:23:00+02:00',
   'subject': 'Sparse is better than dense'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '6',
  '_score': 1.0,
  '_source': {'date': '2020-01-15T14:06:07+02:00',
   'subject': 'Readability counts'}}]

Which is equivalent to:

In [17]:
search("[2020-01 TO 2020-02}")
Out[17]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '5',
  '_score': 1.0,
  '_source': {'date': '2020-01-01T09:23:00+02:00',
   'subject': 'Sparse is better than dense'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '6',
  '_score': 1.0,
  '_source': {'date': '2020-01-15T14:06:07+02:00',
   'subject': 'Readability counts'}}]

Note that }, in the range query, excludes the 1st of February. Using ] would give us an additional message:

In [18]:
search("[2020-01 TO 2020-02]")
Out[18]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '5',
  '_score': 1.0,
  '_source': {'date': '2020-01-01T09:23:00+02:00',
   'subject': 'Sparse is better than dense'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '6',
  '_score': 1.0,
  '_source': {'date': '2020-01-15T14:06:07+02:00',
   'subject': 'Readability counts'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '7',
  '_score': 1.0,
  '_source': {'date': '2020-02-01T12:00:00+02:00',
   'subject': 'Now is better than never'}}]

Another way to retrieve messages from a specific period is to use date math:

In [19]:
search("2020-01\|\|\/M")
Out[19]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '5',
  '_score': 1.0,
  '_source': {'date': '2020-01-01T09:23:00+02:00',
   'subject': 'Sparse is better than dense'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '6',
  '_score': 1.0,
  '_source': {'date': '2020-01-15T14:06:07+02:00',
   'subject': 'Readability counts'}}]
In [20]:
search("date:2020\|\|\/y")
Out[20]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '5',
  '_score': 1.0,
  '_source': {'date': '2020-01-01T09:23:00+02:00',
   'subject': 'Sparse is better than dense'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '6',
  '_score': 1.0,
  '_source': {'date': '2020-01-15T14:06:07+02:00',
   'subject': 'Readability counts'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '7',
  '_score': 1.0,
  '_source': {'date': '2020-02-01T12:00:00+02:00',
   'subject': 'Now is better than never'}}]

This is a nice solution but it's not super easy to make occasional users remember the syntax, especially the quoting of the | and / characters. Range queries are probably more natural.

One thing that could be nice is if both 2019-10-09 and 20191009 were recognized. This is possible by adding the format we want to accept in the mapping.

Let's recreate the index with the new mapping.

In [21]:
mapping = {
    "date": {
        "type": "date",
        "format": "strict_date_optional_time||yyyyMMdd||yyyyMM",
    },
    "subject": {"type": "text", "fields": {"keyword": {"type": "keyword"}}},
}
es.indices.delete(index="test-index", ignore_unavailable=True)
es.indices.create(index="test-index", body={"mappings": {"dynamic": "strict", "properties": mapping}})
for id_, message in enumerate(messages):
    es.index(index="test-index", id=id_, body=message, refresh=True)
In [22]:
search("20191009")
Out[22]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '1',
  '_score': 1.0,
  '_source': {'date': '2019-10-09T11:36:05+02:00',
   'subject': 'Explicit is better than implicit'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '4',
  '_score': 1.0,
  '_source': {'date': '2019-10-09T21:30:10+02:00',
   'subject': 'Flat is better than nested'}}]
In [23]:
search("2019-10-09")
Out[23]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '1',
  '_score': 1.0,
  '_source': {'date': '2019-10-09T11:36:05+02:00',
   'subject': 'Explicit is better than implicit'}},
 {'_index': 'test-index',
  '_type': '_doc',
  '_id': '4',
  '_score': 1.0,
  '_source': {'date': '2019-10-09T21:30:10+02:00',
   'subject': 'Flat is better than nested'}}]
In [24]:
search("date:[202002 TO now]")
Out[24]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '7',
  '_score': 1.0,
  '_source': {'date': '2020-02-01T12:00:00+02:00',
   'subject': 'Now is better than never'}}]
In [25]:
search("date:[2020-02 TO now]")
Out[25]:
[{'_index': 'test-index',
  '_type': '_doc',
  '_id': '7',
  '_score': 1.0,
  '_source': {'date': '2020-02-01T12:00:00+02:00',
   'subject': 'Now is better than never'}}]

As seen above, both formats work now.

Conclusion

  • The mapping is used when indexing new documents. It's also used by the search. Define in the mapping all the date formats you want the search to support (not only the ones required to ingest documents).
  • A year 2020 or month 2020-01 is converted to the first day of the year/month: 2020-01-01.
  • To search by period, use either date math 2020-01\|\|\/M or a range query [2020-01-01 TO 2020-01-31]
In [ ]:
 

Comments

Comments powered by Disqus