Project Overview

Technologies and Tools Employed

  • Azure Data Factory (ADF): Serves as the backbone for orchestrating data workflows, facilitating the extraction, transformation, and loading (ETL) processes from various data sources.
  • Azure Data Lake Storage: Acts as a scalable storage solution for raw and processed data, accommodating both structured and unstructured datasets.​
  • Azure SQL Database: Functions as the staging and transformation area where data undergoes cleansing and structuring before being loaded into the data warehouse.​
  • Azure Key Vault: Ensures the secure management of secrets and credentials used across the data pipeline.​
  • Azure Databricks: Used to write medallion architecture codes in notebooks for data processing where all the other team members can collaborate and contribute.

What we have to do ?

  • It starts with a patient visit: patient details and insurance details are collected
  • Services are provided
  • Billing happens: The hospital will create a bill
  • Claims are reviewed: Insurance company review the bill and they might accept it or pay in full, or partial or decline.
  • Payments and follow-ups, if partial payment is done by insurance company
  • Tracking and improvement: RCM ensures the hospital can provide quality care while also staying financially healthy

Project Dataset

  • Patients
  • Providers
  • Departments
  • Transactions
  • Encounters​
  • Represents insurance claims information.
  • Provided as flat files (e.g. CSV) and placed in the landing zone of Azure Data Lake Storage Gen2.
  • Fetched via public APIs and stored in the bronze layer of the data lake.
  • Contains information about healthcare providers.
  • Retrieved through APIs and stored similarly to NPI data.
  • Standardized codes for diagnoses.
  • Provided as flat files and processed into the data lake.
  • Standardized codes for medical procedures.

Data Processing Layers

  • Bronze Layer:
    • Raw data ingestion.
    • Data is stored in Parquet format.​
  • Silver Layer:
    • Data cleansing and transformation.
    • Implementation of Slowly Changing Dimensions (SCD) Type 2 for tracking historical changes.
    • Data stored in Delta Lake format.​
  • Gold Layer:
    • Aggregated and business-ready data.
    • Creation of fact and dimension tables for reporting and analytics.
containers
compute-cluster
azure-data-factory
Databases-and-server

Solution Architecture

architecture-solution
  • Source Data:
    • From SQL DB, APIs (ICD, NPI), and CSV files (CPT, Claims)
  • Storage:
    • Stored in Azure Data Lake Storage Gen2 in Parquet format
  • Purpose:
    • Retain the raw, unprocessed data for traceability and auditability
  • Tools Used:
    • ADF for copying raw data
    • Data ingested into Databricks as external tables
  • Transformation:
    • Clean nulls, fix schema issues, standardize formats
    • Join datasets (e.g., claims with patient or provider info)
    • Apply SCD Type 2 logic for dimensions (track changes)
  • Storage:
    • Stored as Delta tables (with ACID compliance)
  • Purpose:
    • Ready for analytics but still granular
    • Acts as a reliable “single source of truth”
  • Tools Used:
    • Azure Databricks + PySpark
    • Business logic written in Notebooks
  • Aggregation:
    • Metrics like total claims per department, average cost per diagnosis, etc.
    • Data models built as fact and dimension tables
  • Purpose:
    • Ready for consumption in Power BI dashboards
    • Optimized for business queries and reports
  • Tools Used:
    • Databricks for final aggregation
    • Power BI connected to Gold layer Delta tables or Synapse

Star Schema: Fact & Dimension Modeling

star-schema

Project Data Ingestion Pipeline

linked-services-in-adf
  • Linked Service Name: AzureSqlDatabase1
  • Used in: Copy Activity (to bring Patient, Provider, Claims, etc.). This single linked service is used to connect to two EMR databases.
  • Authentication: SQL Auth
  • Purpose: Bring data from SQL database to Storage Account container.
  • Linked Service Name: AzureDataLakeStorage1
  • Used in:
    • Copy Activity (store raw and transformed data)
    • Sink in data movement
  • Authentication: Azure Key Vault / Managed Identity
  • Purpose: Destination for Bronze → Silver → Gold data files (Parquet/Delta)
  • Linked Service Name: AzureDatabricksDeltaLake1
  • Used in: Notebook Activity
  • Authentication: Access token or Managed Identity
  • Purpose: Trigger notebooks for data transformation logic and audit table in delta tables
  • Linked Service Name: LS_KeyVault
  • Used to: Store secrets like:
    • SQL DB credentials
    • API keys
    • Databricks access tokens
Factory-resource-datasets
create-pipeline-1
create-pipeline-2
create-pipeline-3
create-pipeline-4
create-pipeline-5
key-vaults

By Akshay Tekam

software developer, Data science enthusiast, content creator.

Leave a Reply

Your email address will not be published. Required fields are marked *