Azure Databricks Series: Step-by-Step Guide to Creating an Iceberg Table from Azure PostgreSQL

Watch this on You Tube https://www.youtube.com/watch?v=Yo0x4u6jc4M

🔗 If you’d like to learn Azure Databricks step by step, check out the full playlist here:
👉 https://www.youtube.com/playlist?list=PLNj2XeCNjFeosTuxZLjfYvnW4H1hsPH07

🌟 Introduction

In today’s data-driven world, organizations need a scalable, open, and flexible way to manage data across storage and compute platforms. Azure Databricks and Apache Iceberg together offer exactly that!

In this blog, we’ll explore how to connect Azure Databricks to Azure PostgreSQL and create an Apache Iceberg table using a simple, step-by-step approach. This approach helps you modernize your data lake and unlock new possibilities for analytics and machine learning. 🚀

💡 What is Apache Iceberg?

Apache Iceberg is an open table format designed for large-scale, analytic datasets stored in data lakes. It brings data warehouse-like reliability to the data lake by supporting:

  • ACID transactions
  • Schema evolution
  • Partition evolution
  • Time travel queries
  • Hidden partitioning

With Iceberg, you can build a true lakehouse architecture that combines the performance of a warehouse with the flexibility of a data lake.

🧩 Why Connect Azure Databricks with Azure PostgreSQL?

Azure PostgreSQL often stores transactional or operational data. But for large-scale analytics, it’s better to replicate or move that data to Iceberg tables in Azure Databricks. This gives you:

  • Faster query performance
  • 🧠 Seamless integration with Spark and ML workloads
  • 🧱 Data versioning and audit support
  • ☁️ Scalable, cost-efficient storage

⚙️ Prerequisites

Before we begin, ensure you have:

  1. ✅ Access to an Azure Databricks Workspace
  2. ✅ A running Azure PostgreSQL Flexible Server
  3. ✅ Correct JDBC connection details (hostname, port, username, password)
  4. ✅ A Databricks cluster with Iceberg support enabled

🪜 Step-by-Step: Creating an Iceberg Table from Azure PostgreSQL

Let’s go hands-on and build it! 👇


🔹 Step 1: Define Connection Details

In your Databricks notebook, start by specifying the PostgreSQL connection details.

jdbcHostname = "jbpos-sql-vnet.postgres.database.azure.com"
jdbcPort = 5432
jdbcDatabase = "postgres"

jdbcUrl = f"jdbc:postgresql://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}"

connectionProperties = {
  "user": "jvivek2k1",
  "password": "xxxxxxxxx", 
  "driver": "org.postgresql.Driver"
}

Here:

  • jdbcHostname → your PostgreSQL server name
  • jdbcDatabase → the database you want to connect to
  • user and password → your login credentials
  • driver → PostgreSQL JDBC driver class

🔹 Step 2: Read Data from Azure PostgreSQL

Now, let’s pull data from your public.customer table in PostgreSQL into a Spark DataFrame.

df = spark.read.jdbc(
    url=jdbcUrl,
    table='public.customer',
    properties=connectionProperties
)

✅ This reads all rows and columns from your PostgreSQL table into Spark.
You can verify the data with:

display(df)

🔹 Step 3: Write Data to an Iceberg Table

Once the data is in Databricks, we can save it as an Iceberg Table in the Unity Catalog or Hive Metastore.

df.write.format("iceberg") \
  .mode("overwrite") \
  .saveAsTable("finance.default.postgres_customer_iceberg")

🔹 Step 4: Validate the Iceberg Table

After writing, you can run SQL queries in Databricks SQL or the notebook itself to validate the table:

SELECT * FROM finance.default.postgres_customer_iceberg;

🌍 Benefits of Using Iceberg Tables in Azure Databricks

1️⃣ High Performance Queries — Iceberg handles large datasets efficiently with advanced partition pruning and metadata optimization.

2️⃣ Schema Evolution — Add or modify columns without rewriting entire datasets.

3️⃣ Data Time Travel — Query data as it existed at any previous point in time.

4️⃣ Open Source & Interoperable — Works with multiple engines (Spark, Trino, Flink, Snowflake, etc.).

5️⃣ Cost-Effective Storage — Store data in open formats on low-cost cloud storage.


🏗️ Real-World Use Cases

  • Building a Data Lakehouse from operational systems
  • Creating auditable, version-controlled datasets
  • Simplifying ETL pipelines by standardizing on Iceberg tables
  • Enabling ML workloads with consistent and reliable data layers

🧠 Pro Tips

💬 Use Azure Key Vault integration to securely store your PostgreSQL credentials instead of embedding them in code.
⚙️ Use Incremental Loads instead of full overwrite for production pipelines.
📊 Consider using partition columns for large tables to improve query performance.


🎯 Summary

In this blog, we:
✅ Connected Azure Databricks to Azure PostgreSQL
✅ Loaded data from a PostgreSQL table into Databricks
✅ Created an Apache Iceberg table for modern data analytics
✅ Validated the data through SQL queries

By combining Azure Databricks + Apache Iceberg + Azure PostgreSQL, you’re enabling a modern, open, and scalable data lakehouse architecture that’s built for performance and flexibility. 💪

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Azure Databricks Series: The Hidden Way to Optimize Costs – No One Talks About!

Managing costs in Azure Databricks can be a real challenge. Clusters often stay idle, autoscaling isn’t always tuned properly, and over-provisioned resources can quickly blow up your bill 💸. In this blog, I’ll walk you through how you can analyze, monitor, and optimize costs in your own Databricks environment using Power BI and AI-powered recommendations.


Why Focus on Cost Optimization?

Azure Databricks is powerful, but without the right monitoring, it’s easy to:

  • Leave clusters running when not in use 🔄
  • Oversize driver and worker nodes 🖥️
  • Misconfigure autoscaling policies 📈
  • Miss out on spot instances or cluster pools

That’s why cost optimization is a must-have practice for anyone running Databricks in production or development.


What You’ll Learn in This Tutorial

Here’s the simple 3-step process we’ll follow:

1️⃣ Collect Cluster Configuration Data

In my previous videos, I showed how to use Azure Function Apps to export cluster configuration details.

These configurations will form the raw dataset for analysis.

2️⃣ Analyze with Power BI 📊

We’ll load the exported data into Power BI and use a ready-made Power BI template (download link below) to visualize:

  • Cluster usage
  • Node sizes
  • Autoscaling patterns
  • Idle vs active time

This gives you a clear picture of where money is being spent.

3️⃣ AI-Powered Recommendations 🤖

Finally, we’ll feed the Power BI output into an AI agent. The AI will provide actionable recommendations such as:

  • Resize underutilized clusters
  • Enable auto-termination for idle clusters
  • Use job clusters instead of all-purpose clusters
  • Consider spot instances to lower costs

Download the Power BI Template

To make this even easier, I’ve created a Power BI template file (.pbit) that you can use right away. Just download it, connect it with your exported cluster configuration data, and start analyzing your environment.

Pro Tips for Cost Savings

💡 Enable auto-termination for idle clusters
💡 Use job clusters instead of always-on interactive clusters
💡 Configure autoscaling properly
💡 Try spot instances where workloads allow
💡 Regularly monitor usage with Power BI dashboards


Final Thoughts

With the combination of Power BI and AI, cost optimization in Azure Databricks becomes less of a guessing game and more of a data-driven process.

📺 If you prefer a video walkthrough, check out my detailed step-by-step YouTube tutorial here: Azure Databricks Series on YouTube

👉 Don’t forget to like, share, and subscribe to stay updated with more tutorials in this series!

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Integrating Power BI with Databricks Model Serving in Secure Networks Using Logic Apps and Power Automate

Introduction

Watch this as a video on our you tube channel JBSWiki.

Many enterprises are rapidly adopting Azure Databricks for building machine learning models and serving real-time predictions. However, when strict network security measures are in place—like disabling public network access on Databricks workspaces—it can become incredibly challenging to integrate those models into tools like Power BI.

In this blog, we’ll explore how to securely call a Databricks Model Serving endpoint from Power BI under the scenario where:

  • The Databricks workspace has Allow Public Network Access = Disabled
  • Any direct call from Power Automate to Databricks fails with a 403 Unauthorized network access error

We’ll overcome this limitation using Logic Apps running inside a Virtual Network (VNet) and acting as a secure bridge between Power BI and Databricks.

Let’s dive in! 🔍


The Challenge: Network Restrictions and 403 Errors

By default, services like Power Automate send traffic over the public internet. If your Databricks workspace is configured with Allow Public Network Access disabled, any direct HTTP request to its REST APIs from Power Automate will fail.

The result is a 403 Unauthorized network access to workspace error.

This happens because Databricks:

  • Blocks all public network traffic
  • Only allows communication from services or VNets that are directly peered or integrated

In highly secure enterprise environments, keeping Databricks private is essential. But it poses a problem:

How can Power BI users trigger predictions from Databricks ML models if public access is disabled?


The Solution: Introducing Logic Apps as a Secure Proxy

Instead of connecting Power Automate directly to Databricks, we introduce Logic Apps running inside an Azure VNet.

Logic Apps can:

✅ Connect to Databricks Model Serving endpoints privately through peered VNets or private endpoints
✅ Expose an HTTP endpoint that Power Automate can call publicly
✅ Act as a secure proxy, handling all authentication and network routing

This architecture ensures:

  • Network security compliance
  • Seamless integration between Power BI and Databricks
  • Avoidance of 403 errors

Let’s walk through the full solution step by step. 🚀


Solution Architecture

Here’s how the integration flows:

  1. User clicks a button in Power BI ➡ triggers Power Automate.
  2. Power Automate ➡ sends an HTTP POST request to Logic Apps.
  3. Logic Apps ➡ securely calls the Databricks Model Serving endpoint within the VNet.
  4. Databricks Model Serving ➡ returns prediction results to Logic Apps.
  5. Logic Apps ➡ sends the response back to Power Automate.
  6. Power Automate ➡ updates Power BI visuals or datasets with prediction results.

This ensures Databricks never exposes its endpoints publicly, yet Power BI can still retrieve real-time predictions.


Step 1 — Create Databricks Model Serving Endpoint

First, make sure you’ve deployed your machine learning model to a Databricks Model Serving endpoint.

For this blog, let’s assume you’ve published an endpoint like:

https://adb-1311343844234579.11.azuredatabricks.net/serving-endpoints/HDFC_High_price_prediction/invocations

This endpoint:

  • Requires authentication via a Databricks PAT (Personal Access Token) or Azure AD token.
  • Accepts JSON requests.
  • Returns prediction results in JSON format.

Remember, because public network access is disabled, only resources inside your VNet—or peered VNets—can reach this endpoint.


Step 2 — Create Logic Apps in VNet

Next, deploy a Logic App Standard into a VNet.

Benefits:

  • Can communicate privately with Databricks.
  • Supports secure inbound and outbound traffic.
  • Scales to enterprise workloads.

Create an HTTP Trigger

Configure Logic Apps to start on an HTTP request.

Request Body JSON Schema for our scenario looks like this:

{
  "type": "object",
  "properties": {
    "inputs": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "Date": {
            "type": "string"
          },
          "OPEN": {
            "type": "integer"
          },
          "HIGH": {
            "type": "integer"
          },
          "LOW": {
            "type": "integer"
          },
          "CLOSE": {
            "type": "integer"
          }
        },
        "required": [
          "Date",
          "OPEN",
          "HIGH",
          "LOW",
          "CLOSE"
        ]
      }
    }
  }
}

his defines the expected JSON payload your Logic App will receive from Power Automate.


Step 3 — Logic Apps: Call Databricks Model Serving

Inside Logic Apps, add an HTTP action to call your Databricks endpoint:

Method: POST
URI:

https://adb-1311343844234579.11.azuredatabricks.net/serving-endpoints/HDFC_High_price_prediction/invocations

Headers:

Authorization: Bearer dapi****************
Content-Type: application/json

Body:

{
  "inputs": [
    {
      "Date": "2024-07-03",
      "OPEN": 2300,
      "HIGH": 2400,
      "LOW": 2298,
      "CLOSE": 2350
    }
  ]
}

Logic Apps will securely send this payload over private networking to Databricks and wait for the response.


Step 4 — Deploy Power Automate Flow

Now, let’s connect Power Automate to Logic Apps.

Your Power Automate flow will:

  • Trigger from Power BI (e.g. a button click).
  • Call the Logic Apps HTTP endpoint.
  • Receive the ML prediction results.
  • Optionally, update Power BI visuals or datasets.

Power Automate HTTP Request

Configure your HTTP action:

Method: POST
URI: The URL from your Logic App’s HTTP trigger (Step 1).
Headers:

Content-Type: application/json

Body:

{
  "inputs": [
    {
      "Date": "2024-07-03",
      "OPEN": 2300,
      "HIGH": 2400,
      "LOW": 2298,
      "CLOSE": 2350
    }
  ]
}

Why Not Call Databricks Directly From Power Automate?

A natural question is: why can’t we skip Logic Apps and call Databricks directly from Power Automate?

Here’s why:

  • Power Automate sends HTTP requests from public endpoints.
  • Databricks rejects all public traffic if public access is disabled.
  • There’s no way for Power Automate to reach Databricks privately.

Logic Apps in a VNet acts as a secure intermediary:

  • Power Automate → Logic Apps → Databricks
  • Databricks → Logic Apps → Power Automate

This architecture bridges private and public networks securely.


Benefits of This Architecture

Implementing this solution provides:

Enterprise Security

  • Complies with strict network isolation policies.
  • Prevents exposing Databricks to the internet.

Seamless User Experience

  • Power BI users get real-time predictions without knowing about the backend complexity.

Scalable Architecture

  • Logic Apps can handle thousands of requests.
  • Easy to maintain and extend for other models or services.

Governance and Monitoring

  • Centralized logging in Logic Apps.
  • Easy to integrate with Azure Monitor for alerting.

Use Case: Predicting Stock Prices

Imagine you have a machine learning model predicting HDFC high prices.

  • Power BI user clicks a “Predict” button.
  • Power Automate triggers a flow.
  • Flow sends stock price inputs to Logic Apps.
  • Logic Apps calls Databricks Model Serving.
  • Databricks returns the predicted high price.
  • Power BI visual updates dynamically with the prediction!

All of this happens securely, without exposing Databricks to the public internet. 🔒


Conclusion

Integrating Power BI with Databricks Model Serving under strict network security constraints can seem daunting.

But with the help of Logic Apps deployed inside a VNet, you can:

  • Securely bridge public and private networks
  • Enable real-time ML predictions in Power BI
  • Maintain enterprise-level security and compliance

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.