Categories
Healthcare Analytics Research

Elasticsearch for analyzing CORD-19 dataset

COVID-19 Open Research Dataset (CORD-19) is a dataset of approximately 47,000 scholarly articles, about COVID-19, SARS-CoV-2, and related coronaviruses made free to the research community by a coalition of research groups. The articles are provided as JSON files for the global research community to apply natural language processing.

Elasticsearch (ES) is a Lucene based text search engine using schema-free JSON documents. Elasticsearch is fast and has clients libraries available for most programming languages including python. Loading the COVID-19 data on to an ES instance will be helpful for easy search and analysis, all within the comfort of the jupyter notebook. The availability of the Apache spark (spark) connector makes the exchange of data between ES and spark easy. I have listed below, the simple steps to load the files to an ES instance.

First, download and install ES and the ES-spark connector from here, and start ES. Next, Download and install Apache spark from here: https://spark.apache.org/ CORD-19 dataset is available here.

STEP 1: Create a spark session: (add the path to the connector jar)

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("ElasticSpark-1") \
    .config("spark.driver.extraClassPath", "/path/elasticsearch-hadoop-7.6.2/dist/elasticsearch-spark-20_2.11-7.6.2.jar") \
    .config("spark.es.port","9200") \
    .config("spark.driver.memory", "8G") \
    .config("spark.executor.memory", "12G") \
    .getOrCreate()

STEP 2: Load the JSON files:

path = "/path/data/biorxiv_medrxiv/biorxiv_medrxiv/"
df = spark.read.json(path, multiLine=True)
df.show(1)

+——–+——————–+——————–+——————–+——–+——————–+——————–+
|abstract| back_matter| bib_entries| body_text|metadata| paper_id| ref_entries|
+——–+——————–+——————–+——————–+——–+——————–+——————–+
| []|[[[[456,, 453, 8 …|[[[[R, Zhang, [],…|[[[], [], , i c ,…| [[], ]|28b10724357672324…|[[, Fumagalli M, …|
+——–+——————–+——————–+——————–+——–+——————–+——————–+
only showing top 1 row

STEP 3: Select the required fields:

df3 = df.select(df.paper_id, df.metadata.title, df.metadata.authors, df.abstract.text, df.body_text.text)
df3.show(3)

+——————–+——————–+——————–+——————–+——————–+
| paper_id| metadata.title| metadata.authors| abstract.text| body_text.text|
+——————–+——————–+——————–+——————–+——————–+
|28b10724357672324…| | []| []|[i c , a n t i f …|
|1aa3e788fc6b03c14…|Dark Proteome of …|[[[Himachal Prade…|[Recently emerged…|[World health org…|
|558d318e1655da9f5…|Connectivity anal…|[[[University of …|[We utilized a ce…|[Schizophrenia is…|
+——————–+——————–+——————–+——————–+——————–+
only showing top 3 rows

STEP 4: Create an index and write the spark df into ES:

from elasticsearch import Elasticsearch
es = Elasticsearch()
es.indices.create(index="covid")
df3.write.format("es").mode('overwrite').save("covid")

STEP 5: Do the search!

es.search(index="covid", q="metadata.title:(CD14 OR CD8)")

That’s it! You can now use it for search and do analytics on the returned records. Next, I will show you how to use QRMine on CORD-19!

Categories
HIS

Public Health Data Warehouse on FHIR

The Ontario government is building a connected health care system centred around patients, families and caregivers through the newly established Ontario Health Teams (OHT). As disparate healthcare and public health teams move towards a unified structure, there is a growing need to reconsider our information system strategy. Most off the shelf solutions are pricey, while open-source solutions such as DHIS2 is not popular in Canada. Some of the public health units have existing systems, and it will be too resource-intensive to switch to another system. The interoperability challenge needs an innovative solution, beyond finding the single, provincial EMR.

artificial intelligence

We have written about the theoretical aspects, especially the need to envision public health information systems separate from an EMR. In this working paper, we propose a maturity model for PHIS and offer some pragmatic recommendations for dealing with the common challenges faced by public health teams. 

Below is a demo project on GitHub from the data-intel lab that showcases a potential solution for a scalable data warehouse for health information system integration. Public health databases are vital for the community for efficient planning, surveillance and effective interventions. Public health data needs to be integrated at various levels for effective policymaking. PHIS-DW adopts FHIR as the data model for storage with the integrated Elasticsearch stack. Kibana provides the visualization engine. PHIS-DW can support complex algorithms for disease surveillance such as machine learning methods, hidden Markov models, and Bayesian to multivariate analytics. PHIS-DW is work in progress and code contributions are welcome. We intend to use Bunsen to integrate PHIS-DW with Apache Spark for big data applications. 

Public Health Data Warehouse Framework on FHIR

FHIR has some advantages as a data persistence schema for public health. Apart from its popularity, the FHIR bundle makes it possible to send observations to FHIR servers without the associated patient resource, thereby ensuring reasonable privacy. This is especially useful in the surveillance of pandemics such as COVID19. Some useful yet complicated integrations with OSCAR EMR and DHIS2 is under consideration. If any of the OHTs find our approach interesting, give us a shout. 

BTW, have you seen Drishti, our framework for FHIR based behavioural intervention? 

Categories
Machine Learning

Machine Learning in population health: Creating conditions that ensure good health.

Machine Learning (ML) in healthcare has an affinity for patient-centred care and individual-level predictions. Population health deals with health outcomes in a group of individuals and the outcome distribution in the group. Both individual health and population health are not divergent, but at the same time, both are not the same and may require different approaches. ML in public health applications receives far less attention.

The skills available to public health organizations to transition towards an integrated data analytics is limited. Hence the latest advances in ML and artificial intelligence (AI) have made very little impact on public health analytics and decision making. The biggest barrier is the lack of expertise in conceiving and implementing data warehouse systems for public health that can integrate health information systems currently in use. 

The data in public health organizations are generally scattered in disparate information systems within the region or even within the same organization. Efficient and effective health data warehousing requires a common data model for integrated data analytics. The OHDSI – OMOP Common Data Model allows for the systematic analysis of disparate observational databases and EMRs. However, the emphasis is on patient-level prediction. Research on how patient-centred data models to observation-centred population health data models are the need of the hour.

We are making a difficult yet important transition towards integrated health by providing new ways of delivering services in local communities by local health teams. The emphasis is clearly on digital health. We need efficient and effective digital tools and techniques. Motivated by the Ontario Health Teams’ digital strategy, I have been working on tools to support this transition.

Hephestus is a software tool for ETL (Extract-Transform-Load) for open-source EMR systems such as OSCAR EMR and national datasets such as Discharge Abstract Database (DAD). It is organized into modules to allow code reuse. Hephestus uses SqlAlchemy for database connection and auto-mapping tables to classes and bonobo for managing ETL. Hephaestus aims to support common machine learning workflows such as model building with Apache spark and model deployment using serverless architecture. I am also working on FHIR based standards for ML model deployments.

Hephaestus is a work in progress and any help will be highly appreciated. Hephaestus is an open-source project on GitHub. If you are looking for an open-source project to contribute to Hacktoberfest, consider Hephaestus! 

Categories
OpenSource Resources

Hephestus: Health data warehousing tool for public health and clinical research

Originally published by Bell Eapen at nuchange.ca on November 3, 2018. If you have some feedback, reach out to the author on TwitterLinkedIn or Github.

Health data warehousing is becoming an important requirement for deriving knowledge from the vast amount of health data that healthcare organizations collect. A data warehouse is vital for collaborative and predictive analytics. The first step in designing a data warehouse is to decide on a suitable data model. This is followed by the extract-transform-load (ETL) process that converts source data to the new data model amenable for analytics.

The OHDSI – OMOP Common Data Model is one such data model that allows for the systematic analysis of disparate observational databases and EMRs. The data from diverse systems needs to be extracted, transformed and loaded on to a CDM database. Once a database has been converted to the OMOP CDM, evidence can be generated using standardized analytics tools that are already available.

Each data source requires customized ETL tools for this conversion from the source data to CDM. The OHDSI ecosystem has made some tools available for helping the ETL process such as the White Rabbit and the Rabbit In a Hat. However, health data warehousing process is still challenging because of the variability of source databases in terms of structure and implementations.

Hephestus is an open-source python tool for this ETL process organized into modules to allow code reuse between various ETL tools for open-source EMR systems and data sources. Hephestus uses SqlAlchemy for database connection and automapping tables to classes and bonobo for managing ETL. The ultimate aim is to develop a tool that can translate the report from the OHDSI tools into an ETL script with minimal intervention. This is a good python starter project for eHealth geeks.

Anyone anywhere in the world can build their own environment that can store patient-level observational health data, convert their data to OHDSI’s open community data standards (including the OMOP Common Data Model), run open-source analytics using the OHDSI toolkit, and collaborate in OHDSI research studies that advance our shared mission toward reliable evidence generation. Join the journey! here

Disclaimer: Hephestus is just my experiment and is not a part of the official OHDSI toolset.

  • SSH URL
  • Clone URL