Covid-19 and its implications are the topic number one for quite some time. The situation naturally provides us a new dataset which any data analyst can process with different tools.
Despite this is not a happy dataset and one I would rather not be part of, it is also an opportunity to use tools, learn new applications, and hope that anything we do will help in some way to other people around us. Whether it is a scientist looking for a different approach to solve the problem or a student learning new tools using interesting data right now, everyone can benefit. Because I believe that we learn by doing ’things’, I am presenting a complete hands-on example based on Slovakia’s data. The same methodology can be applied for similar use cases or just as a proof of concept when needed.
The live dashboard of the setup is located at covid-19.radoondas.io and the source code in the github repository .
The Data
All over the world, many people, organizations, and companies provide different datasets about the Covid-19 outbreak. Accessibility and quality or completeness depend on each source. Because I live in Slovakia, I focus on local data. Unfortunately, we have no official complete and machine-readable data source provided by government authorities.
Since the beginning of the outbreak, we had only a few sources available, which were technically merged into one source over time. Government webpage dedicated to covid-19 . Data are provided by National health information centre . Access to any machine-readable and open data source is impossible, or I am not aware of it. I consider this as unfortunate. Another source is news that scrapes the official data and possibly enhances with other reports from their resources. This approach does not make for a good and reliable data source - primarily if not published freely. Everyone keeps data for themselves as far as I know about (please correct me if I am wrong).
Because there is no official data source, I decided to put together my own and open it to the public. I believe that the data can be reused, distributed, and enhanced if anyone is willing to do so. The data file file is available in the GitHub repository, and I am happy to accept Issues or PR’s with data enhancements.
It is important to mention other visualizations available in Slovakia, which you can check.
- korona.gov.sk
- arcgis
- each significant media also has a form of visualization available for readers.
Data set description
The data set consists CSV formatted rows with following header.
date;city;infected;gender;note_1;note_2;healthy;died;region;age;district
Columns description as of the publishing of this post. It may change over time. Please check for the latest description in Github repository.
Column name | Description |
---|---|
date | Date - the date of the record |
city | City - the location of the person infected by covid-19 |
infected | Infected - number of infected |
gender | Gender, M - male, Ž - female, D - children, X - unknown |
note_1 | Note 1 |
note_2 | Note 2 |
healthy | Healthy - number of people who recovered from the virus |
died | Dead - number people who died |
region | Region |
age | Age |
district | District |
Architecture and Tools
The application stack of choice is Elastic Stack to manage ingestion and analyze the data
- Logstash for data ingestion from CSV file into
- Elasticsearch and
- Kibana for the visualization and geospatial data analysis.
For those with visual understanding, the pipeline of the data flow is very simple and straight forward.
Logstash reads CSV file and indexing documents into Elasticsearch. Then the user is working with Kibana connected to Elasticsearch to view and analyze and visualize documents.
+----------+ +----------+ +---------------+ +--------+ +------+
| | | | | | | | | |
| CSV file +-----> Logstash +-----> Elasticsearch <-----> Kibana <-----+ User |
| | | | | | | | | |
+----------+ +----------+ +---------------+ +--------+ +------+
Geospatial tools
I use Slovakia geospatial data, and I published a different post on how to import Slovakia’s GIS data into Elasticsearch. The post dives deeper into the details on how to index geospatial documents. Please read the article to get familiar with the setup, as it will help you with the following tutorial.
If you do not have the GDAL library already installed, then read also post on how to setup latest GDAL using simple wrapper script for Docker image.
Optional
Working Docker environment in case you need to set up GDAL wrapper and run Elasticsearch cluster using Docker.
How to - tutorial
In the following section, I will describe the minimal steps required to build the same live dashboard on your local machine that will look like the dashboard at covid-19.radoondas.io .
- Clone my repository from Github to get all necessary data and configuration files.
$ git clone https://github.com/radoondas/covid-19-slovakia.git
- Navigate to your local copy of the repository.
$ cd covid-19-slovakia
-
Make sure that you have your Elasticsearch cluster together with Kibana up and running. It can be any cluster, but a single Elasticsearch node with Kibana will serve the purpose perfectly fine. If you do not have your cluster at hand, use prepared Docker configuration located in the repository root -
docker-compose.yml
.To use
docker-compose
environment run following command, and it will spin up 1 node cluster with version 7.9.3.Optionally, use Elastic cloud to spin up the cluster.
$ docker-compose up -d
-
Verify that your cluster is up and running. Open your browser and go to Kibana url
http://127.0.0.1:5601/
. You now see aHome page
of Kibana. -
Ingest geospatial data into Elasticsearch. I generated
geojson
source files for all Towns/Cities/Villages (Mestá) together with Regions (Kraje) and Districts (Okresy). You can find those files in the data folder of the repository (obce.json
,kraje.json
,okresy.json
). JSON files are pre-generated from Geoportal using the GDAL library for format conversion.Read more in the section above dedicated to geospatial tools. After the import, you now have 3 indices in the cluster named
kraje
,obce
, andokresy
.Commands using the GDAL library for the reference (copy/paste might not work depending on your GDAL install method):
$ ogr2ogr -lco INDEX_NAME=kraje "ES:http://localhost:9200" -lco NOT_ANALYZED_FIELDS={ALL} \
"$(pwd)/data/kraje.json"
$ ogr2ogr -lco INDEX_NAME=obce "ES:http://localhost:9200" -lco NOT_ANALYZED_FIELDS={ALL} \
"$(pwd)/data/obce.json"
$ ogr2ogr -lco INDEX_NAME=okresy "ES:http://localhost:9200" -lco NOT_ANALYZED_FIELDS={ALL} \
"$(pwd)/data/okresy.json"
#Imported indices
GET _cat/indices/obce,kraje,okresy?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open okresy uidA 1 1 79 0 5.1mb 5.1mb
yellow open obce uidB 1 1 2927 0 26.5mb 26.5mb
yellow open kraje uidC 1 1 8 0 1.8mb 1.8mb
-
Ingest data into Elasticsearch and stop Logstash Docker container when documents are indexed. This is a one-time index job. Run the following command from the root of the repository. If all works as expected, then you will see in the output no errors related to the ingest process and documents currently being index will also be printed on the screen.
There are a few essential pieces in the ingest command.
- we use a special template for correct mapping in elasticsearch
-v $(pwd)/template.json:/tmp/template.json
- we point all files from
data
folder to the local source file-v $(pwd)/data/:/usr/share/logstash/covid/
- we use our custom Logstash configuration
-v $(pwd)/ls.conf:/usr/share/logstash/pipeline/logstash.conf
- we disable monitoring which causes warning messages as we did not configure it
-e MONITORING_ENABLED=false
- we use a special template for correct mapping in elasticsearch
docker run --rm -it --network=host \
-v $(pwd)/template.json:/tmp/template.json \
-v $(pwd)/data/:/usr/share/logstash/covid/ \
-v $(pwd)/ls.conf:/usr/share/logstash/pipeline/logstash.conf \
-e MONITORING_ENABLED=false \
docker.elastic.co/logstash/logstash:7.8.1
# Check the index
GET _cat/indices/covid-19-sk?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open covid-19-sk uidddd 1 1 1751 0 557.5kb 557.5kb
- Additionaly, import the template and actual data for annotations used in visualizations
cd data
# Import index template
curl -s -H "Content-Type: application/x-ndjson" -XPUT "localhost:9200/_template/milestones" \
--data-binary "@template_milestones.json"; echo
# You should see message: {"acknowledged":true}
# Index actual data
curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary \
"@milestones.bulk"; echo
# List template
GET _cat/templates/milestones?v
name index_patterns order version composed_of
milestones [milestones] 0
# List index
GET _cat/indices/milestones?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open milestones uidddd 1 1 17 0 7.6kb 7.6kb
-
Import
Saved Objects
from provided visualisations file and adjust patterns appropriately if needed.Note, if you named imported indices for geospatial data as in the example (kraje,okresy, obce), then there is no need to adjust index patterns.
-
Navigate to
Dashboards
and find the one with name[covid-19] Overview Dashboard
, open and check the content. If all goes right, this is the dashboard you will see (or similar as visualisations might develop over the time).
Note: The adjustment of all saved objects should not be necessary if you used ogr2ogr
(or the wrapper) to import the geojson data. The mapping and index name will match those in Saved objects
. If you use a different setup, then you will need to adjust ID’s of pattern inside of the visualisations.ndjson
file.
Additional documentation
-
Logstash CSV filter plugin
-
Elasticsearch templates
Note: Tested with Elasticsearch v7.17.5 and Logstash v7.8.1
Let me know if you have learned something new, or if you were able to follow this hands-on tutorial successfully. Feel free to comment or suggest me how to improve the content so you can enjoy it more.