Azure Databricks Series: Step-by-Step Guide to Creating an Iceberg Table from Azure PostgreSQL

Watch this on You Tube https://www.youtube.com/watch?v=Yo0x4u6jc4M

๐Ÿ”— If youโ€™d like to learn Azure Databricks step by step, check out the full playlist here:
๐Ÿ‘‰ https://www.youtube.com/playlist?list=PLNj2XeCNjFeosTuxZLjfYvnW4H1hsPH07

๐ŸŒŸ Introduction

In todayโ€™s data-driven world, organizations need a scalable, open, and flexible way to manage data across storage and compute platforms. Azure Databricks and Apache Iceberg together offer exactly that!

In this blog, weโ€™ll explore how to connect Azure Databricks to Azure PostgreSQL and create an Apache Iceberg table using a simple, step-by-step approach. This approach helps you modernize your data lake and unlock new possibilities for analytics and machine learning. ๐Ÿš€

๐Ÿ’ก What is Apache Iceberg?

Apache Iceberg is an open table format designed for large-scale, analytic datasets stored in data lakes. It brings data warehouse-like reliability to the data lake by supporting:

  • โœ… ACID transactions
  • โœ… Schema evolution
  • โœ… Partition evolution
  • โœ… Time travel queries
  • โœ… Hidden partitioning

With Iceberg, you can build a true lakehouse architecture that combines the performance of a warehouse with the flexibility of a data lake.

๐Ÿงฉ Why Connect Azure Databricks with Azure PostgreSQL?

Azure PostgreSQL often stores transactional or operational data. But for large-scale analytics, itโ€™s better to replicate or move that data to Iceberg tables in Azure Databricks. This gives you:

  • โšก Faster query performance
  • ๐Ÿง  Seamless integration with Spark and ML workloads
  • ๐Ÿงฑ Data versioning and audit support
  • โ˜๏ธ Scalable, cost-efficient storage

โš™๏ธ Prerequisites

Before we begin, ensure you have:

  1. โœ… Access to an Azure Databricks Workspace
  2. โœ… A running Azure PostgreSQL Flexible Server
  3. โœ… Correct JDBC connection details (hostname, port, username, password)
  4. โœ… A Databricks cluster with Iceberg support enabled

๐Ÿชœ Step-by-Step: Creating an Iceberg Table from Azure PostgreSQL

Letโ€™s go hands-on and build it! ๐Ÿ‘‡


๐Ÿ”น Step 1: Define Connection Details

In your Databricks notebook, start by specifying the PostgreSQL connection details.

jdbcHostname = "jbpos-sql-vnet.postgres.database.azure.com"
jdbcPort = 5432
jdbcDatabase = "postgres"

jdbcUrl = f"jdbc:postgresql://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}"

connectionProperties = {
  "user": "jvivek2k1",
  "password": "xxxxxxxxx", 
  "driver": "org.postgresql.Driver"
}

Here:

  • jdbcHostname โ†’ your PostgreSQL server name
  • jdbcDatabase โ†’ the database you want to connect to
  • user and password โ†’ your login credentials
  • driver โ†’ PostgreSQL JDBC driver class

๐Ÿ”น Step 2: Read Data from Azure PostgreSQL

Now, letโ€™s pull data from your public.customer table in PostgreSQL into a Spark DataFrame.

df = spark.read.jdbc(
    url=jdbcUrl,
    table='public.customer',
    properties=connectionProperties
)

โœ… This reads all rows and columns from your PostgreSQL table into Spark.
You can verify the data with:

display(df)

๐Ÿ”น Step 3: Write Data to an Iceberg Table

Once the data is in Databricks, we can save it as an Iceberg Table in the Unity Catalog or Hive Metastore.

df.write.format("iceberg") \
  .mode("overwrite") \
  .saveAsTable("finance.default.postgres_customer_iceberg")

๐Ÿ”น Step 4: Validate the Iceberg Table

After writing, you can run SQL queries in Databricks SQL or the notebook itself to validate the table:

SELECT * FROM finance.default.postgres_customer_iceberg;

๐ŸŒ Benefits of Using Iceberg Tables in Azure Databricks

1๏ธโƒฃ High Performance Queries โ€” Iceberg handles large datasets efficiently with advanced partition pruning and metadata optimization.

2๏ธโƒฃ Schema Evolution โ€” Add or modify columns without rewriting entire datasets.

3๏ธโƒฃ Data Time Travel โ€” Query data as it existed at any previous point in time.

4๏ธโƒฃ Open Source & Interoperable โ€” Works with multiple engines (Spark, Trino, Flink, Snowflake, etc.).

5๏ธโƒฃ Cost-Effective Storage โ€” Store data in open formats on low-cost cloud storage.


๐Ÿ—๏ธ Real-World Use Cases

  • Building a Data Lakehouse from operational systems
  • Creating auditable, version-controlled datasets
  • Simplifying ETL pipelines by standardizing on Iceberg tables
  • Enabling ML workloads with consistent and reliable data layers

๐Ÿง  Pro Tips

๐Ÿ’ฌ Use Azure Key Vault integration to securely store your PostgreSQL credentials instead of embedding them in code.
โš™๏ธ Use Incremental Loads instead of full overwrite for production pipelines.
๐Ÿ“Š Consider using partition columns for large tables to improve query performance.


๐ŸŽฏ Summary

In this blog, we:
โœ… Connected Azure Databricks to Azure PostgreSQL
โœ… Loaded data from a PostgreSQL table into Databricks
โœ… Created an Apache Iceberg table for modern data analytics
โœ… Validated the data through SQL queries

By combining Azure Databricks + Apache Iceberg + Azure PostgreSQL, youโ€™re enabling a modern, open, and scalable data lakehouse architecture thatโ€™s built for performance and flexibility. ๐Ÿ’ช

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided โ€œAS ISโ€ with no warranties, and confers no rights.

Leave a Reply