JBs Wiki

Menu

Skip to content
  • Home
  • SQL Server Blogs
  • YouTube Videos

Tag Archives: delta lake performance

Standard

Posted by

Vivek Janakiraman

Posted on

March 19, 2025

Posted under

Azure Databricks

Comments

Leave a comment

Azure Databricks – Query to get Size and Parquet File Count for Delta Tables in a Catalog using PySpark

Managing and analyzing Delta tables in a Databricks environment requires insights into storage consumption and file distribution. In this blog, we will explore a PySpark query that lists all Delta tables under a specified catalog, retrieving their details, including table size and the number of parquet files.

πŸš€ Why This Query is Useful?

  • βœ… Comprehensive Overview: Lists all Delta tables under a given catalog.
  • βœ… Storage Insights: Displays table size in Bytes, MB, GB, and TB.
  • βœ… File Count: Helps identify partitioning and performance optimization opportunities.
  • βœ… Automation Ready: Easily adaptable for different catalogs and use cases.

πŸ” The Query: Fetching Delta Table Details

The following PySpark script extracts metadata using the DESCRIBE DETAIL command and consolidates the results into a single DataFrame:

πŸ“Œ Code Implementation

from pyspark.sql import DataFrame

# Fetch all DESCRIBE DETAIL queries
describe_queries = spark.sql("""
    SELECT 'DESCRIBE DETAIL ' || table_catalog || '.' || table_schema || '.' || table_name AS sql_query
    FROM system.information_schema.tables
    WHERE table_catalog = 'Your_Catalog_name_JBCatalog'  --CHANGE THE REQUIRED CATALOG
    AND data_source_format = 'DELTA'
""").collect()

# Execute each query and collect results
results = []
for row in describe_queries:
    query = row["sql_query"]
    detail_df = spark.sql(query)  # Run DESCRIBE DETAIL for each table
    results.append(detail_df)

# Combine all results into a single DataFrame
if results:
    final_df = results[0]
    for df in results[1:]:
        final_df = final_df.union(df)
    
    # Add sizeinMB, sizeinGB, sizeinTB columns
    final_df = final_df.withColumn("sizeinMB", final_df["sizeInBytes"] / (1024 * 1024))
    final_df = final_df.withColumn("sizeinGB", final_df["sizeInBytes"] / (1024 * 1024 * 1024))
    final_df = final_df.withColumn("sizeinTB", final_df["sizeInBytes"] / (1024 * 1024 * 1024 * 1024))
    
    display(final_df)  # Just display the output without storing it
else:
    print("No Delta tables found in the fraud catalog.")

πŸ“Š Breaking Down the Query

1️⃣ Querying Delta Tables

  • Uses system.information_schema.tables to fetch all Delta tables under a specified catalog.
  • Constructs DESCRIBE DETAIL queries dynamically for each table.

2️⃣ Executing DESCRIBE DETAIL

  • Iterates over the collected queries and runs DESCRIBE DETAIL on each Delta table.
  • Stores results in a list of DataFrames.

3️⃣ Combining Results

  • Unifies all DataFrames into a single DataFrame using union().
  • Computes additional columns for table size in MB, GB, and TB.
  • Displays the final result in Databricks.

πŸ“’ Key Takeaways

βœ”οΈ Quickly analyze Delta table metadata to understand storage and file distribution. βœ”οΈ Optimize table performance by evaluating the number of Parquet files. βœ”οΈ Scale storage efficiently by monitoring table sizes at different units. βœ”οΈ Adaptable across catalogs by modifying the table_catalog filter.

πŸ“Œ Let us know if you have any questions or need further enhancements! πŸš€

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided β€œAS IS” with no warranties, and confers no rights.

  • LinkedIn
  • Instagram
  • Twitter
  • Facebook
  • Mail

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 44 other subscribers
Advertisements
Advertisements
Advertisements
Advertisements
Advertisements
Powered by WordPress.com.
 

Loading Comments...