About the Project

high level view
five data image

Design Goles and Best Practices

  • Design a secure Lakehouse platform with dev, test, and production environments
  • Decouple data ingestion from data processing
  • Support batch and streaming workflows
  • Automate integration testing
  • Automated deployment pipeline for test and production environments
two-external-databases

Storage Design

lakehouse-platform
storage-structure

Decouple Data Ingestion

data-ingestion-pipeline-2

Design Bronze Layer

bronze-layer-design

Design Silver and Gold Layer

silver-and-gold-layers

Setup Environment

lakehouse-infrastructure

Create Workspace for Azure Databricks

workspace-creation
workspace-home-page

Create Storage Layer

  • Create a storage account
  • Create your storage container
  • Setup storage access connector for Databricks
containers-in-sa

Setup Unity Catalog

Source Control Setup

devOps-initialize

Start Coding

final-setup
  • Clone the project from the source control
  • Create your feature branch
  • Write code, unit test, and commit
  • Raise a pull request to merge your feature branch with main code

Test the Code

Load Historical Data

Ingest Into Bronze Layer

Process the Silver Layer

Implementing Gold Layer

Creating a Run Script

Prepare Integration Testing

project-payload

So we copy the first payload into the landing zone, and trigger our workflow job. Our workflow job will process all the records from the first payload and validate the results. We have also prepared the two sets of gold layer reports to compare the outputs. After the first payload and job run we will compare the gold layer table with expected result/outcome in the 7-gym_summary_1 file. We will pass the first iteration test if the results are as expected, then we will copy the second payload into the landing zone and rerun the job with second set of inputs and verify the results. And that’s how we can do complete end-to-end integration testing. The point is straight, we should test at least two iterations of input. Preparing test data files is one of the most challenging data activity. Once we have the test data, the next step is automate the integration testing.

Implementing CI/CD Pipeline

ci-cd-pipeline-architecture

By Akshay Tekam

software developer, Data science enthusiast, content creator.

Leave a Reply

Your email address will not be published. Required fields are marked *