Managing and analyzing Delta tables in a Databricks environment requires insights into storage consumption and file distribution. In this blog, we will explore a PySpark query that lists all Delta tables under a specified catalog, retrieving their details, including table size and the number of parquet files.
π Why This Query is Useful?
- β Comprehensive Overview: Lists all Delta tables under a given catalog.
- β Storage Insights: Displays table size in Bytes, MB, GB, and TB.
- β File Count: Helps identify partitioning and performance optimization opportunities.
- β Automation Ready: Easily adaptable for different catalogs and use cases.
π The Query: Fetching Delta Table Details
The following PySpark script extracts metadata using the DESCRIBE DETAIL command and consolidates the results into a single DataFrame:
π Code Implementation
from pyspark.sql import DataFrame
# Fetch all DESCRIBE DETAIL queries
describe_queries = spark.sql("""
SELECT 'DESCRIBE DETAIL ' || table_catalog || '.' || table_schema || '.' || table_name AS sql_query
FROM system.information_schema.tables
WHERE table_catalog = 'Your_Catalog_name_JBCatalog' --CHANGE THE REQUIRED CATALOG
AND data_source_format = 'DELTA'
""").collect()
# Execute each query and collect results
results = []
for row in describe_queries:
query = row["sql_query"]
detail_df = spark.sql(query) # Run DESCRIBE DETAIL for each table
results.append(detail_df)
# Combine all results into a single DataFrame
if results:
final_df = results[0]
for df in results[1:]:
final_df = final_df.union(df)
# Add sizeinMB, sizeinGB, sizeinTB columns
final_df = final_df.withColumn("sizeinMB", final_df["sizeInBytes"] / (1024 * 1024))
final_df = final_df.withColumn("sizeinGB", final_df["sizeInBytes"] / (1024 * 1024 * 1024))
final_df = final_df.withColumn("sizeinTB", final_df["sizeInBytes"] / (1024 * 1024 * 1024 * 1024))
display(final_df) # Just display the output without storing it
else:
print("No Delta tables found in the fraud catalog.")
π Breaking Down the Query
1οΈβ£ Querying Delta Tables
- Uses
system.information_schema.tablesto fetch all Delta tables under a specified catalog. - Constructs
DESCRIBE DETAILqueries dynamically for each table.
2οΈβ£ Executing DESCRIBE DETAIL
- Iterates over the collected queries and runs
DESCRIBE DETAILon each Delta table. - Stores results in a list of DataFrames.
3οΈβ£ Combining Results
- Unifies all DataFrames into a single DataFrame using
union(). - Computes additional columns for table size in MB, GB, and TB.
- Displays the final result in Databricks.
π’ Key Takeaways
βοΈ Quickly analyze Delta table metadata to understand storage and file distribution. βοΈ Optimize table performance by evaluating the number of Parquet files. βοΈ Scale storage efficiently by monitoring table sizes at different units. βοΈ Adaptable across catalogs by modifying the table_catalog filter.
π Let us know if you have any questions or need further enhancements! π
Thank You,
Vivek Janakiraman
Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided βAS ISβ with no warranties, and confers no rights.
- advanced analytics
- ai data processing
- analytics engineering
- automation
- azure adf
- azure data lake
- azure databricks
- azure ml
- Azure SQL
- Azure storage
- Azure Synapse
- big data
- big data analysis
- big data processing
- business intelligence
- caching
- Change Data Capture
- cloud computing
- cloud data engineering
- cloud data warehouse
- cloud native
- cloud storage
- cost optimization
- data access
- data analytics
- data auditing
- data catalog
- data compliance
- data deduplication
- Data Engineering
- data exploration
- data governance
- data insights
- data integration
- Data Lake
- data lakehouse
- data lineage
- data management
- Data Monitoring
- data optimization
- data partitioning
- data pipeline
- data profiling
- Data Reliability
- Data Replication
- data retrieval
- data science
- data security
- data sharding
- data storage
- data tracking
- data visualization
- Data Warehouse
- Database Administration
- Database Performance
- databricks
- databricks notebook
- databricks sql
- delta format
- delta lake
- delta lake performance
- delta merge
- delta tables
- describe detail
- elastic compute
- enterprise data management
- etl processing
- external tables
- hdfs
- incremental loading
- Indexing
- information schema
- lakehouse architecture
- large scale data
- managed tables
- metadata management
- metadata queries
- parquet files
- parquet optimization
- Performance Tuning
- pyspark
- Query Optimization
- query performance
- real-time analytics
- schema evolution
- schema management
- serverless computing
- spark dataframe
- spark sql
- sql on databricks
- SQL Queries
- storage cost
- storage efficiency
- storage performance
- structured data
- structured streaming
- system tables
- table metadata
- table schema
- union operation
- unstructured data