Covid-19 and its implications are the topic number one for quite some time. The situation naturally provides us a new dataset which any data analyst can process with different tools.
Despite this is not a happy dataset and one I would rather not be part of, it is also an opportunity to use tools, learn new applications, and hope that anything we do will help in some way to other people around us. Whether it is a scientist looking for a different approach to solve the problem or a student learning new tools using interesting data right now, everyone can benefit. Because I believe that we learn by doing ’things’, I am presenting a complete hands-on example based on Slovakia’s data. The same methodology can be applied for similar use cases or just as a proof of concept when needed.
All over the world, many people, organizations, and companies provide different datasets about the Covid-19 outbreak. Accessibility and quality or
completeness depend on each source. Because I live in Slovakia, I focus on local data. Unfortunately, we have no official complete and
machine-readable data source provided by government authorities.
Since the beginning of the outbreak, we had only a few sources available, which were technically merged into one source over time. Government webpage dedicated to
covid-19 . Data are provided by
National health information centre . Access to any machine-readable and open data source is impossible, or I am not aware of it. I consider this as unfortunate. Another source is news that scrapes the official data and possibly enhances with other reports from their resources. This approach does not make for a good and reliable data source - primarily if not published freely. Everyone keeps data for themselves as far as I know about (please correct me if I am wrong).
Because there is no official data source, I decided to put together my own and open it to the public. I believe that the data can be
reused, distributed, and enhanced if anyone is willing to do so. The data file
file is available in the GitHub repository, and I am happy to accept Issues or PR’s with data enhancements.
It is important to mention other visualizations available in Slovakia, which you can check.
Kibana for the visualization and geospatial data analysis.
For those with visual understanding, the pipeline of the data flow is very simple and straight forward.
Logstash reads CSV file and indexing documents into Elasticsearch. Then the user is working with Kibana connected to Elasticsearch to view and analyze and visualize documents.
I use Slovakia geospatial data, and I published a different post on how to import Slovakia’s GIS data into Elasticsearch. The post dives deeper into the details on how to index geospatial documents. Please read the article to get familiar with the setup, as it will help you with the following tutorial.
If you do not have the GDAL library already installed, then read also post on how to setup latest GDAL using simple wrapper script for Docker image.
Optional
Working Docker environment in case you need to set up GDAL wrapper and run Elasticsearch cluster using Docker.
How to - tutorial
In the following section, I will describe the minimal steps required to build the same live dashboard on your local machine that will look like the dashboard at
covid-19.radoondas.io .
Clone my
repository from Github to get all necessary data and configuration files.
Make sure that you have your Elasticsearch cluster together with Kibana up and running. It can be any cluster, but a single Elasticsearch node with Kibana will serve the purpose perfectly fine. If you do not have your cluster at hand, use prepared Docker configuration located in the repository root - docker-compose.yml.
To use docker-compose environment run following command, and it will spin up 1 node cluster with version 7.9.3.
Optionally, use
Elastic cloud to spin up the cluster.
$ docker-compose up -d
Verify that your cluster is up and running. Open your browser and go to Kibana url http://127.0.0.1:5601/. You now see a Home page of
Kibana.
Ingest geospatial data into Elasticsearch. I generated geojson source files for all Towns/Cities/Villages (Mestá) together with Regions (Kraje) and Districts (Okresy). You can find those files in the
data folder of the repository (obce.json, kraje.json, okresy.json). JSON files are pre-generated from
Geoportal using the GDAL library for format conversion.
Read more in the section above dedicated to geospatial tools. After the import, you now have 3 indices in the cluster named kraje , obce, and okresy.
Commands using the GDAL library for the reference (copy/paste might not work depending on your GDAL install method):
#Imported indices GET _cat/indices/obce,kraje,okresy?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open okresy uidA 11790 5.1mb 5.1mb
yellow open obce uidB 1129270 26.5mb 26.5mb
yellow open kraje uidC 1180 1.8mb 1.8mb
Ingest data into Elasticsearch and stop Logstash Docker container when documents are indexed. This is a one-time index job. Run the following command from the root of the repository. If all works as expected, then you will see in the output no errors related to the ingest process and documents currently being index will also be printed on the screen.
There are a few essential pieces in the ingest command.
we use a special template for correct mapping in elasticsearch -v $(pwd)/template.json:/tmp/template.json
we point all files from data folder to the local source file -v $(pwd)/data/:/usr/share/logstash/covid/
we use our custom Logstash configuration -v $(pwd)/ls.conf:/usr/share/logstash/pipeline/logstash.conf
we disable monitoring which causes warning messages as we did not configure it -e MONITORING_ENABLED=false
# Check the index GET _cat/indices/covid-19-sk?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open covid-19-sk uidddd 1117510 557.5kb 557.5kb
Additionaly, import the template and actual data for annotations used in visualizations
cd data
# Import index template curl -s -H "Content-Type: application/x-ndjson" -XPUT "localhost:9200/_template/milestones"\
--data-binary "@template_milestones.json"; echo# You should see message: {"acknowledged":true}# Index actual data curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary \
"@milestones.bulk"; echo
# List template GET _cat/templates/milestones?v
name index_patterns order version composed_of
milestones [milestones]0# List index GET _cat/indices/milestones?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open milestones uidddd 11170 7.6kb 7.6kb
Note, if you named imported indices for geospatial data as in the example (kraje,okresy, obce), then there is no need to adjust index patterns.
Navigate to Dashboards and find the one with name [covid-19] Overview Dashboard, open and check the content.
If all goes right, this is the dashboard you will see (or similar as visualisations might develop over the time).
Covid-19 Dashboard
Covid-19 Dashboard in white
Note: The adjustment of all saved objects should not be necessary if you used ogr2ogr (or the wrapper) to import the geojson data. The mapping and index name will match those in Saved objects. If you use a different setup, then you will need to adjust ID’s of pattern inside of the visualisations.ndjson file.
Note: Tested with Elasticsearch v7.17.5 and Logstash v7.8.1
Let me know if you have learned something new, or if you were able to follow this hands-on tutorial successfully. Feel free to comment or suggest me how to improve the content so you can enjoy it more.