leanage of project

Introduction

  • Data ingestion
  • Data cleaning and transformation
  • Data modeling
  • Data aggregation
  • Analytics-ready outputs

What is Lakeflow Spark Declarative Pipeline?

  • dependency management
  • execution order
  • incremental processing
  • pipeline orchestration

Project Architecture Overview

medallion flow of project

Business Problem

  • city_id
  • city_name
  • trip_id
  • driver_id
  • customer_id
  • distance
  • fare amount
  • ratings
  • timestamps
  • total rides
  • total revenue
  • average customer rating
  • ride distribution by time

Databricks Environment Setup

catalog folder

Data Ingestion (Bronze Layer)

  • CSV files stored in Amazon S3 / Cloud Storage
  • detects new files
  • processes them incrementally
  • prevents duplicate ingestion
  • raw data ingestion
  • minimal transformation
  • metadata columns added

Silver Layer (Data Transformation)

  • remove null values
  • correct data types
  • standardize columns
  • ratings must be between 1 and 10
  • distance cannot be negative
  • processing timestamps
  • derived columns

Change Data Capture (CDC)

  • new records
  • updated records
  • deleted records
  • faster processing
  • reduced compute cost
  • scalable pipelines

Calendar Dimension Table

  • date
  • day
  • month
  • quarter
  • weekday/weekend
  • holidays

Gold Layer (Analytics Tables)

  • trip data
  • city data
  • calendar data
  • reporting
  • dashboards
  • business insights

Pipeline Execution and Automation

  • execution order
  • scheduling
  • retries
  • batch mode
  • streaming mode
leanage of project

Security and Governance

  • access control
  • data governance
  • data lineage
  • security policies

Conclusion

By Akshay Tekam

software developer, Data science enthusiast, content creator.

Leave a Reply

Your email address will not be published. Required fields are marked *