COVID-19 Open Research Dataset (CORD-19) is a dataset of approximately 47,000 scholarly articles, about COVID-19, SARS-CoV-2, and related coronaviruses made free to the research community by a coalition of research groups. The articles are provided as JSON files for the global research community to apply natural language processing.
Elasticsearch (ES) is a Lucene based text search engine using schema-free JSON documents. Elasticsearch is fast and has clients libraries available for most programming languages including python. Loading the COVID-19 data on to an ES instance will be helpful for easy search and analysis, all within the comfort of the jupyter notebook. The availability of the Apache spark (spark) connector makes the exchange of data between ES and spark easy. I have listed below, the simple steps to load the files to an ES instance.
First, download and install ES and the ES-spark connector from here, and start ES. Next, Download and install Apache spark from here: https://spark.apache.org/ CORD-19 dataset is available here.
STEP 1: Create a spark session: (add the path to the connector jar)
from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("ElasticSpark-1") \ .config("spark.driver.extraClassPath", "/path/elasticsearch-hadoop-7.6.2/dist/elasticsearch-spark-20_2.11-7.6.2.jar") \ .config("spark.es.port","9200") \ .config("spark.driver.memory", "8G") \ .config("spark.executor.memory", "12G") \ .getOrCreate()
STEP 2: Load the JSON files:
path = "/path/data/biorxiv_medrxiv/biorxiv_medrxiv/" df = spark.read.json(path, multiLine=True) df.show(1)
|abstract| back_matter| bib_entries| body_text|metadata| paper_id| ref_entries|
| |[[[[456,, 453, 8 …|[[[[R, Zhang, ,…|[[, , , i c ,…| [, ]|28b10724357672324…|[[, Fumagalli M, …|
only showing top 1 row
STEP 3: Select the required fields:
df3 = df.select(df.paper_id, df.metadata.title, df.metadata.authors, df.abstract.text, df.body_text.text) df3.show(3)
| paper_id| metadata.title| metadata.authors| abstract.text| body_text.text|
|28b10724357672324…| | | |[i c , a n t i f …|
|1aa3e788fc6b03c14…|Dark Proteome of …|[[[Himachal Prade…|[Recently emerged…|[World health org…|
|558d318e1655da9f5…|Connectivity anal…|[[[University of …|[We utilized a ce…|[Schizophrenia is…|
only showing top 3 rows
STEP 4: Create an index and write the spark df into ES:
from elasticsearch import Elasticsearch es = Elasticsearch() es.indices.create(index="covid") df3.write.format("es").mode('overwrite').save("covid")
STEP 5: Do the search!
es.search(index="covid", q="metadata.title:(CD14 OR CD8)")
That’s it! You can now use it for search and do analytics on the returned records. Next, I will show you how to use QRMine on CORD-19!
1 reply on “Elasticsearch for analyzing CORD-19 dataset”
[…] FHIR is emerging as the defacto standard for health system interoperability, owing largely to its simplicity and the use of existing and popular standards such as REST. As NoSQL databases become more and popular in healthcare, FHIR can also be a good persistence schema. It aligns well with search technologies such as elasticsearch. […]