Health Research Methodology Healthcare Analytics

OHDSI OMOP CDM ETL Tools in Python, .Net and Go

TL;DR Here are few OHDSI OMOP CDM tools that may save you time if you are developing ETL tools!

Originally published by Bell Eapen at on June 11, 2020. If you have some feedback, reach out to the author on Twitter,  LinkedIn or  Github.

Python: pyomop | pypi
.NET: omopcdmlib | NuGet
Golang: gocdm

eHealth Programmer Girl

The COVID-19 pandemic brought to light many of the vulnerabilities in our data collection and analytics workflows. Lack of uniform data models limits the analytical capabilities of public health organizations and many of them have to re-invent the wheel even for basic analysis. As many other sectors embrace big data and machine learning, many healthcare analysts are still stuck with the basic data wrenching with Excel.

The OHDSI OMOP CDM (Common data model) for observational data is a popular initiative for bringing data into a common format that allows for collaborative research, large-scale analytics, and sharing of sophisticated tools and methodologies. Though OHDSI OMOP CDM is primarily for patient-centred observational analysis, mostly for clinical research, it can be used with minor tweaks for public health and epidemiologic data as well. We have written about some of the technical details here.

The OHDSI OMOP CDM is relatively simple and intuitive for clinical teams than emerging standards such as FHIR. Though the relational database approach and some of the software tools associated with OHDSI OMOP CDM are archaic, the data model is clinically motivated. There is an ecosystem of software tools for many of the analytics tools that can be used out of the box. The Observational Medical Outcomes Partnership (OMOP) CDM, now in its version 6.0, has simple but powerful vocabulary management. OHDSI OMOP CDM is a good choice for healthcare organizations moving towards health data warehousing and OLAP.

One weakness of OHDSI is the lack of tools for efficient ETL from existing EHR and HIS. Converting existing EHR data to the CDM is still a complex task that requires technical expertise. During the additional “home time” during the COVID pandemic, I have created three software libraries for ETL tool developers. These libraries in Python, .NET and Golang encapsulated the V6.0 CDM and helps in writing and reading data from a variety of databases with the V6.0 tables. The libraries also support creating the CDM tables for new databases and loading the vocabulary files.

Python: pyomop | pypi
.NET: omopcdmlib | NuGet
Golang: gocdm

These libraries might save you some time if you are building scripts for ETL to CDM. They are all open-source and free to use in your tools. Do give me a shout if you find these libraries useful and please star the repositories on GitHub.

Health Research Methodology Healthcare Analytics OpenSource

DADpy: The swiss army knife for discharge abstract database

Discharge Abstract Database (DAD) is a Canada-wide database of hospital admission and discharge data excluding the province of Quebec, maintained by the Canadian Institute for Health Information (CIHI). The data points in DAD include patient demographics, comorbidities coded in the International Statistical Classification of Diseases and Related Health Problems (ICD), interventions encoded in the Canadian Classification of Health Interventions (CCI) and the length of stay. DAD is the de-identified 10% sample available under the Data Liberation Initiative (DLI) for academic researchers. DAD is arguably the most comprehensive country-wide discharge dataset in the world.

The Swiss army knife for Discharge Abstract Database

Discharge Abstract Database is used for creating public reports for hospitals, researchers, and the general public. DAD data has also been used for disease-specific research and analysis, including public health, disease surveillance, and health services research. CIHI provides DAD in the SPSS (.sav) format with each record having horizontal fields for 20 comorbidities and 25 interventions. The format is not ideal for slicing and dicing the data for visualization for clinicians to obtain clinical insights.

DADpy provides a set of functions for using the DAD dataset for machine learning and visualization. The package does not include the dataset. Academic researchers can request the DAD dataset from CIHI. This is an unofficial repo, and I’m not affiliated with CIHI. Please retain the disclaimer below in forks.

Installation: (Will add to pypi soon)

We use poetry for development. PR are welcome. Please see in the repo. Start by renaming .env.example to .env and add path for tests to run. Add jupiter notebooks to the notebook folder. Include the disclaimer below.

Disclaimer: Parts of this material are based on the Canadian Institute for Health Information Discharge Abstract Database Research Analytic Files (sampled from fiscal years 2016-17). However the analysis, conclusions, opinions and statements expressed herein are those of the author(s) and not those of the Canadian Institute for Health Information.

Let us know if you use DADpy for creating interesting jupyter notebooks. 

Health Research Methodology Information Systems

Grounded Theory – QRMine: Qualitative Research support tools in Python.

Grounded theory (GT) emerged as a research methodology from medical sociology following the seminal work by Barney Glaser and Anselm Strauss. However, they later developed different views on their original contribution with their supporters leading to the establishment of a classical Glaserian GT and a pragmatic Straussian Grounded Theory. Constant comparison is central in Classical Grounded Theory, and it involves incident to incident comparison for identifying categories, incident to category comparison for refining the categories and category to category comparison for the emergence of the theory.

Grounded Theory ResearchGlaser’s Classical GT (1) provides guidelines for evaluation of the GT methodology. The evaluation should be based on whether the theory fits the data, whether the theory is understandable to the non-professionals, whether the theory is generalizable to other situations, and whether the theory offers control over the structure and processes.

Strauss and Corbin (2) recommended a strict coding structure elaborating on how to code and structure data. The seminal article by Strauss and Corbin describes three stages of coding: open coding, axial coding, and selective coding. Classical Grounded Theory offers more flexibility than Straussian GT while the latter may be easier to conduct especially for new researchers.

Open coding is the first step where data is broken down analytically, and conceptually similar chunks are grouped together under categories and subcategories. Once the differences between the categories are established, properties and dimensions of each are dissected. Coding in GT may be overwhelming, and scaling up of categories from open coding may be difficult. This leads to the generation of low-level theories. With natural language processing, information systems can help young researchers to make sense of the of data that they have collected during the stage of open coding. QRMine is a software suite for supporting qualitative researchers using NLP. Gtdict is a module that identifies Categories, Properties, and Dimensions in the interview transcript.

QRMine is opensource and is available here. Ideas, comments and pull requests welcome.

Last 3 commits to GitHub Repo:


Glaser BG. The Constant Comparative Method of Qualitative Analysis. Social Problems [Internet]. 1965 Apr;12(4):436–45. Available from:
Corbin JM, Strauss A. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative Sociology [Internet]. 1990;13(1):3–21. Available from: [Source]