delta lake performance

Managing and analyzing Delta tables in a Databricks environment requires insights into storage consumption and file distribution. In this blog, we will explore a PySpark query that lists all Delta tables under a specified catalog, retrieving their details, including table size and the number of parquet files.

🚀 Why This Query is Useful?

✅ Comprehensive Overview: Lists all Delta tables under a given catalog.
✅ Storage Insights: Displays table size in Bytes, MB, GB, and TB.
✅ File Count: Helps identify partitioning and performance optimization opportunities.
✅ Automation Ready: Easily adaptable for different catalogs and use cases.

🔍 The Query: Fetching Delta Table Details

The following PySpark script extracts metadata using the DESCRIBE DETAIL command and consolidates the results into a single DataFrame:

📌 Code Implementation

from pyspark.sql import DataFrame

# Fetch all DESCRIBE DETAIL queries
describe_queries = spark.sql("""
    SELECT 'DESCRIBE DETAIL ' || table_catalog || '.' || table_schema || '.' || table_name AS sql_query
    FROM system.information_schema.tables
    WHERE table_catalog = 'Your_Catalog_name_JBCatalog'  --CHANGE THE REQUIRED CATALOG
    AND data_source_format = 'DELTA'
""").collect()

# Execute each query and collect results
results = []
for row in describe_queries:
    query = row["sql_query"]
    detail_df = spark.sql(query)  # Run DESCRIBE DETAIL for each table
    results.append(detail_df)

# Combine all results into a single DataFrame
if results:
    final_df = results[0]
    for df in results[1:]:
        final_df = final_df.union(df)
    
    # Add sizeinMB, sizeinGB, sizeinTB columns
    final_df = final_df.withColumn("sizeinMB", final_df["sizeInBytes"] / (1024 * 1024))
    final_df = final_df.withColumn("sizeinGB", final_df["sizeInBytes"] / (1024 * 1024 * 1024))
    final_df = final_df.withColumn("sizeinTB", final_df["sizeInBytes"] / (1024 * 1024 * 1024 * 1024))
    
    display(final_df)  # Just display the output without storing it
else:
    print("No Delta tables found in the fraud catalog.")

📊 Breaking Down the Query

1️⃣ Querying Delta Tables

Uses system.information_schema.tables to fetch all Delta tables under a specified catalog.
Constructs DESCRIBE DETAIL queries dynamically for each table.

2️⃣ Executing DESCRIBE DETAIL

Iterates over the collected queries and runs DESCRIBE DETAIL on each Delta table.
Stores results in a list of DataFrames.

3️⃣ Combining Results

Unifies all DataFrames into a single DataFrame using union().
Computes additional columns for table size in MB, GB, and TB.
Displays the final result in Databricks.

📢 Key Takeaways

✔️ Quickly analyze Delta table metadata to understand storage and file distribution. ✔️ Optimize table performance by evaluating the number of Parquet files. ✔️ Scale storage efficiently by monitoring table sizes at different units. ✔️ Adaptable across catalogs by modifying the table_catalog filter.

📌 Let us know if you have any questions or need further enhancements! 🚀

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

JBs Wiki

Tag Archives: delta lake performance

Azure Databricks – Query to get Size and Parquet File Count for Delta Tables in a Catalog using PySpark