Categories
Healthcare Analytics Research

Elasticsearch for analyzing CORD-19 dataset

COVID-19 Open Research Dataset (CORD-19) is a dataset of approximately 47,000 scholarly articles, about COVID-19, SARS-CoV-2, and related coronaviruses made free to the research community by a coalition of research groups. The articles are provided as JSON files for the global research community to apply natural language processing.

Elasticsearch (ES) is a Lucene based text search engine using schema-free JSON documents. Elasticsearch is fast and has clients libraries available for most programming languages including python. Loading the COVID-19 data on to an ES instance will be helpful for easy search and analysis, all within the comfort of the jupyter notebook. The availability of the Apache spark (spark) connector makes the exchange of data between ES and spark easy. I have listed below, the simple steps to load the files to an ES instance.

First, download and install ES and the ES-spark connector from here, and start ES. Next, Download and install Apache spark from here: https://spark.apache.org/ CORD-19 dataset is available here.

STEP 1: Create a spark session: (add the path to the connector jar)

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("ElasticSpark-1") \
    .config("spark.driver.extraClassPath", "/path/elasticsearch-hadoop-7.6.2/dist/elasticsearch-spark-20_2.11-7.6.2.jar") \
    .config("spark.es.port","9200") \
    .config("spark.driver.memory", "8G") \
    .config("spark.executor.memory", "12G") \
    .getOrCreate()

STEP 2: Load the JSON files:

path = "/path/data/biorxiv_medrxiv/biorxiv_medrxiv/"
df = spark.read.json(path, multiLine=True)
df.show(1)

+——–+——————–+——————–+——————–+——–+——————–+——————–+
|abstract| back_matter| bib_entries| body_text|metadata| paper_id| ref_entries|
+——–+——————–+——————–+——————–+——–+——————–+——————–+
| []|[[[[456,, 453, 8 …|[[[[R, Zhang, [],…|[[[], [], , i c ,…| [[], ]|28b10724357672324…|[[, Fumagalli M, …|
+——–+——————–+——————–+——————–+——–+——————–+——————–+
only showing top 1 row

STEP 3: Select the required fields:

df3 = df.select(df.paper_id, df.metadata.title, df.metadata.authors, df.abstract.text, df.body_text.text)
df3.show(3)

+——————–+——————–+——————–+——————–+——————–+
| paper_id| metadata.title| metadata.authors| abstract.text| body_text.text|
+——————–+——————–+——————–+——————–+——————–+
|28b10724357672324…| | []| []|[i c , a n t i f …|
|1aa3e788fc6b03c14…|Dark Proteome of …|[[[Himachal Prade…|[Recently emerged…|[World health org…|
|558d318e1655da9f5…|Connectivity anal…|[[[University of …|[We utilized a ce…|[Schizophrenia is…|
+——————–+——————–+——————–+——————–+——————–+
only showing top 3 rows

STEP 4: Create an index and write the spark df into ES:

from elasticsearch import Elasticsearch
es = Elasticsearch()
es.indices.create(index="covid")
df3.write.format("es").mode('overwrite').save("covid")

STEP 5: Do the search!

es.search(index="covid", q="metadata.title:(CD14 OR CD8)")

That’s it! You can now use it for search and do analytics on the returned records. Next, I will show you how to use QRMine on CORD-19!