Posted by

Posted on

March 19, 2025

Posted under

Comments

Azure Databricks – Query to get Size and Parquet File Count for Delta Tables in a Catalog using PySpark

Managing and analyzing Delta tables in a Databricks environment requires insights into storage consumption and file distribution. In this blog, we will explore a PySpark query that lists all Delta tables under a specified catalog, retrieving their details, including table size and the number of parquet files.

🚀 Why This Query is Useful?

✅ Comprehensive Overview: Lists all Delta tables under a given catalog.
✅ Storage Insights: Displays table size in Bytes, MB, GB, and TB.
✅ File Count: Helps identify partitioning and performance optimization opportunities.
✅ Automation Ready: Easily adaptable for different catalogs and use cases.

🔍 The Query: Fetching Delta Table Details

The following PySpark script extracts metadata using the DESCRIBE DETAIL command and consolidates the results into a single DataFrame:

📌 Code Implementation

from pyspark.sql import DataFrame

# Fetch all DESCRIBE DETAIL queries
describe_queries = spark.sql("""
    SELECT 'DESCRIBE DETAIL ' || table_catalog || '.' || table_schema || '.' || table_name AS sql_query
    FROM system.information_schema.tables
    WHERE table_catalog = 'Your_Catalog_name_JBCatalog'  --CHANGE THE REQUIRED CATALOG
    AND data_source_format = 'DELTA'
""").collect()

# Execute each query and collect results
results = []
for row in describe_queries:
    query = row["sql_query"]
    detail_df = spark.sql(query)  # Run DESCRIBE DETAIL for each table
    results.append(detail_df)

# Combine all results into a single DataFrame
if results:
    final_df = results[0]
    for df in results[1:]:
        final_df = final_df.union(df)
    
    # Add sizeinMB, sizeinGB, sizeinTB columns
    final_df = final_df.withColumn("sizeinMB", final_df["sizeInBytes"] / (1024 * 1024))
    final_df = final_df.withColumn("sizeinGB", final_df["sizeInBytes"] / (1024 * 1024 * 1024))
    final_df = final_df.withColumn("sizeinTB", final_df["sizeInBytes"] / (1024 * 1024 * 1024 * 1024))
    
    display(final_df)  # Just display the output without storing it
else:
    print("No Delta tables found in the fraud catalog.")

📊 Breaking Down the Query

1️⃣ Querying Delta Tables

Uses system.information_schema.tables to fetch all Delta tables under a specified catalog.
Constructs DESCRIBE DETAIL queries dynamically for each table.

2️⃣ Executing DESCRIBE DETAIL

Iterates over the collected queries and runs DESCRIBE DETAIL on each Delta table.
Stores results in a list of DataFrames.

3️⃣ Combining Results

Unifies all DataFrames into a single DataFrame using union().
Computes additional columns for table size in MB, GB, and TB.
Displays the final result in Databricks.

📢 Key Takeaways

✔️ Quickly analyze Delta table metadata to understand storage and file distribution. ✔️ Optimize table performance by evaluating the number of Parquet files. ✔️ Scale storage efficiently by monitoring table sizes at different units. ✔️ Adaptable across catalogs by modifying the table_catalog filter.

📌 Let us know if you have any questions or need further enhancements! 🚀

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Posted by

Vivek Janakiraman

Posted on

August 31, 2024

Posted under

SQL Server 2022

Comments

SQL Server 2022 Enhancements to Batch Mode Processing: A Comprehensive Guide

In the world of data analytics and processing, efficiency and speed are crucial. SQL Server 2022 brings significant enhancements to batch mode processing, making data operations faster and more efficient. In this blog, we’ll explore these enhancements using the JBDB database and demonstrate their benefits through a detailed business use case. Let’s dive in! 🚀

Business Use Case: Optimizing Financial Reporting

Imagine a financial institution, “FinanceCorp,” that handles large volumes of transactional data daily. The company’s data analysts often run complex queries to generate reports on various financial metrics, including daily transactions, average transaction amounts, and customer spending patterns. However, these queries often take a long time to execute due to the sheer volume of data.

With SQL Server 2022’s enhancements to batch mode processing, FinanceCorp aims to optimize query performance, reduce execution times, and provide near real-time insights. This improvement will enhance decision-making and provide a competitive edge in the financial industry.

Understanding Batch Mode Processing

Batch mode processing is a technique where rows of data are processed in batches, rather than one at a time. This method significantly reduces CPU usage and increases query performance, particularly for analytical workloads. SQL Server 2022 introduces several key enhancements to batch mode processing:

Batch Mode on Rowstore: Previously, batch mode processing was limited to columnstore indexes. SQL Server 2022 extends batch mode processing to rowstore tables, allowing a broader range of queries to benefit from this optimization.
Improved Parallelism: SQL Server 2022 improves parallelism in batch mode processing, allowing more efficient use of system resources and faster query execution.
Enhanced Memory Grant Feedback: The new version provides better memory grant feedback, reducing the risk of excessive memory allocation and improving overall query performance.

Demo: Batch Mode Processing Enhancements with JBDB Database

Let’s see these enhancements in action using the JBDB database. We’ll demonstrate how batch mode processing can optimize query performance.

Step 1: Setting Up the JBDB Database

First, ensure the JBDB database is set up with the necessary tables and data. Here’s a sample setup:

CREATE DATABASE JBDB;
GO

USE JBDB;
GO

CREATE TABLE Transactions (
    TransactionID INT PRIMARY KEY,
    CustomerID INT,
    TransactionDate DATE,
    TransactionAmount DECIMAL(18, 2)
);
GO

-- Insert sample data
INSERT INTO Transactions VALUES 
    (1, 101, '2024-07-01', 100.00), 
    (2, 102, '2024-07-02', 150.00), 
    (3, 103, '2024-07-03', 200.00), 
    (4, 101, '2024-07-04', 250.00),
    (5, 102, '2024-07-05', 300.00);
GO

Step 2: Enabling Batch Mode on Rowstore

SQL Server 2022 allows batch mode processing on rowstore tables without requiring columnstore indexes. Let’s see how this affects query performance:

-- Traditional row-by-row processing
SELECT 
    CustomerID,
    AVG(TransactionAmount) AS AverageAmount
FROM Transactions
GROUP BY CustomerID;
GO

-- Batch mode processing on rowstore
SELECT 
    CustomerID,
    AVG(TransactionAmount) AS AverageAmount
FROM Transactions
GROUP BY CustomerID
OPTION (USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'));
GO

The USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE') hint forces the query to use parallelism, demonstrating the enhanced parallelism in batch mode.

Step 3: Observing Improved Memory Grant Feedback

SQL Server 2022’s improved memory grant feedback optimizes memory allocation for queries. This feature helps prevent excessive memory allocation, which can slow down query performance.

-- Example query with potential memory grant feedback
SELECT 
    COUNT(*)
FROM Transactions
WHERE TransactionAmount > 100.00;
GO

Run this query multiple times and observe the memory grant adjustments in the query plan.

Additional Example Queries: Exploring Batch Mode Processing Enhancements

Let’s explore more scenarios where batch mode processing can significantly improve query performance:

Example 1: Calculating Total Transactions per Day

SELECT 
    TransactionDate,
    SUM(TransactionAmount) AS TotalAmount
FROM Transactions
GROUP BY TransactionDate
ORDER BY TransactionDate;
GO

This query calculates the total transaction amount per day, which can benefit from batch mode processing due to its grouping and aggregation operations.

Example 2: Identifying High-Value Transactions

SELECT 
    TransactionID,
    CustomerID,
    TransactionAmount
FROM Transactions
WHERE TransactionAmount > 200.00
OPTION (USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'));
GO

Batch mode processing can speed up the filtering of high-value transactions, providing quick insights into significant purchases.

Example 3: Analyzing Customer Spending Patterns

SELECT 
    CustomerID,
    COUNT(TransactionID) AS TotalTransactions,
    SUM(TransactionAmount) AS TotalSpent
FROM Transactions
GROUP BY CustomerID
ORDER BY TotalSpent DESC;
GO

This query analyzes customer spending patterns, which can be critical for targeted marketing and personalized services. Batch mode processing enhances performance by efficiently handling the aggregation of transaction data.

Example 4: Calculating Monthly Transaction Averages

SELECT 
    YEAR(TransactionDate) AS Year,
    MONTH(TransactionDate) AS Month,
    AVG(TransactionAmount) AS AverageMonthlyAmount
FROM Transactions
GROUP BY YEAR(TransactionDate), MONTH(TransactionDate)
ORDER BY Year, Month;
GO

Calculating monthly averages involves aggregating data over time periods, making it an ideal candidate for batch mode processing.

Example 5: Detecting Transaction Spikes

WITH DailyTotals AS (
    SELECT 
        TransactionDate,
        SUM(TransactionAmount) AS TotalAmount
    FROM Transactions
    GROUP BY TransactionDate
)
SELECT 
    TransactionDate,
    TotalAmount,
    LAG(TotalAmount) OVER (ORDER BY TransactionDate) AS PreviousDayAmount,
    (TotalAmount - LAG(TotalAmount) OVER (ORDER BY TransactionDate)) AS DayOverDayChange
FROM DailyTotals
ORDER BY TransactionDate;
GO

This query uses window functions to detect day-over-day changes in transaction amounts, helping identify spikes in transactions. Batch mode processing optimizes the handling of these calculations.

Business Impact of Batch Mode Processing Enhancements

For FinanceCorp, the enhancements to batch mode processing mean faster report generation, reduced CPU usage, and more efficient memory utilization. This improvement leads to:

Faster Insights: Financial analysts can generate reports in a fraction of the time, allowing for quicker decision-making.
Cost Savings: Improved efficiency reduces the need for expensive hardware upgrades and lowers operational costs.
Competitive Advantage: Near real-time insights provide a strategic advantage in the highly competitive financial sector.

Conclusion

SQL Server 2022’s enhancements to batch mode processing offer substantial benefits, particularly for businesses handling large volumes of data. By leveraging these improvements, organizations like FinanceCorp can achieve faster query performance, optimize resource usage, and gain a competitive edge. Whether you’re in finance, healthcare, or any data-driven industry, these enhancements can significantly impact your data processing capabilities. 🌟

Stay tuned for more insights and detailed technical guides on the latest features in SQL Server 2022! 🎉

For more tutorials and tips on SQL Server, including performance tuning and database management, be sure to check out our JBSWiki YouTube channel.

Thank You,
Vivek Janakiraman

Posted by

Vivek Janakiraman

Posted on

August 24, 2024

Posted under

SQL Server 2022

Comments

Exploring the APPROX_COUNT_DISTINCT Function in SQL Server 2022

With the release of SQL Server 2022, a range of powerful new functions has been introduced, including the APPROX_COUNT_DISTINCT function. This function provides a fast and memory-efficient way to estimate the number of unique values in a dataset, making it an invaluable tool for big data scenarios where traditional counting methods may be too slow or resource-intensive. In this blog, we will explore the APPROX_COUNT_DISTINCT function, using the JBDB database for practical demonstrations and providing a detailed business use case to illustrate its benefits. Let’s dive into the world of approximate distinct counts! 🎉

Business Use Case: E-commerce Customer Segmentation 📦

In an e-commerce business, understanding the diversity of customer behavior is crucial for personalized marketing and inventory management. The JBDB database contains customer transaction data, including CustomerID, ProductID, and PurchaseDate. The business aims to estimate the number of unique customers making purchases each month and the variety of products they are buying. Using the APPROX_COUNT_DISTINCT function, the company can quickly analyze this data to identify trends, optimize stock levels, and tailor marketing campaigns.

Understanding the `APPROX_COUNT_DISTINCT` Function 🧠

The APPROX_COUNT_DISTINCT function estimates the number of distinct values in a column, offering a performance-efficient alternative to the traditional COUNT(DISTINCT column) approach. It is particularly useful in large datasets where an exact count is less critical than performance and resource usage.

Syntax:

APPROX_COUNT_DISTINCT ( column_name )

column_name: The column from which distinct values are counted.

Example 1: Estimating Unique Customers per Month 📅

Let’s calculate the estimated number of unique customers making purchases each month in the JBDB database.

Setup:

USE JBDB;
GO

CREATE TABLE CustomerTransactions (
    TransactionID INT PRIMARY KEY,
    CustomerID INT,
    ProductID INT,
    PurchaseDate DATE
);

INSERT INTO CustomerTransactions (TransactionID, CustomerID, ProductID, PurchaseDate)
VALUES
(1, 101, 2001, '2023-01-05'),
(2, 102, 2002, '2023-01-10'),
(3, 101, 2003, '2023-01-15'),
(4, 103, 2001, '2023-02-05'),
(5, 104, 2002, '2023-02-10'),
(6, 102, 2004, '2023-02-15'),
(7, 105, 2005, '2023-03-05'),
(8, 106, 2001, '2023-03-10');
GO

Query to Estimate Unique Customers:

SELECT 
    FORMAT(PurchaseDate, 'yyyy-MM') AS Month,
    APPROX_COUNT_DISTINCT(CustomerID) AS EstimatedUniqueCustomers
FROM CustomerTransactions
GROUP BY FORMAT(PurchaseDate, 'yyyy-MM');

Output:

Month	EstimatedUniqueCustomers
2023-01	2
2023-02	3
2023-03	2

This output gives an approximate count of unique customers making purchases in each month, providing quick insights into customer engagement over time.

Example 2: Estimating Product Variety by Month 📊

Now, let’s estimate the variety of products purchased each month to understand product diversity and demand trends.

Query to Estimate Product Variety:

SELECT 
    FORMAT(PurchaseDate, 'yyyy-MM') AS Month,
    APPROX_COUNT_DISTINCT(ProductID) AS EstimatedUniqueProducts
FROM CustomerTransactions
GROUP BY FORMAT(PurchaseDate, 'yyyy-MM');

Output:

Month	EstimatedUniqueProducts
2023-01	3
2023-02	3
2023-03	2

This data helps the business understand which months had the highest product variety, aiding in inventory and supply chain management.

Example 3: Comparing Traditional and Approximate Counts 🔄

To illustrate the efficiency of APPROX_COUNT_DISTINCT, let’s compare it with the traditional COUNT(DISTINCT column) method.

Traditional COUNT(DISTINCT) Method:

SELECT 
    FORMAT(PurchaseDate, 'yyyy-MM') AS Month,
    COUNT(DISTINCT CustomerID) AS ExactUniqueCustomers
FROM CustomerTransactions
GROUP BY FORMAT(PurchaseDate, 'yyyy-MM');

Approximate COUNT(DISTINCT) Method:

SELECT 
    FORMAT(PurchaseDate, 'yyyy-MM') AS Month,
    APPROX_COUNT_DISTINCT(CustomerID) AS EstimatedUniqueCustomers
FROM CustomerTransactions
GROUP BY FORMAT(PurchaseDate, 'yyyy-MM');

Comparison:

Month	ExactUniqueCustomers	EstimatedUniqueCustomers
2023-01	2	2
2023-02	3	3
2023-03	2	2

The approximate method provides similar results with potentially significant performance improvements, especially in large datasets.

Estimating Unique Products by Customer:

Calculate the estimated number of unique products purchased by each customer:

SELECT 
    CustomerID,
    APPROX_COUNT_DISTINCT(ProductID) AS EstimatedUniqueProducts
FROM CustomerTransactions
GROUP BY CustomerID;

Estimating Unique Purchase Dates:

Estimate the number of unique purchase dates in the dataset:

SELECT 
    APPROX_COUNT_DISTINCT(PurchaseDate) AS EstimatedUniquePurchaseDates
FROM CustomerTransactions;

Regional Sales Analysis:

If the dataset includes a region column, estimate unique customers per region:

SELECT 
    Region,
    APPROX_COUNT_DISTINCT(CustomerID) AS EstimatedUniqueCustomers
FROM CustomerTransactions
GROUP BY Region;

Conclusion 🏁

The APPROX_COUNT_DISTINCT function in SQL Server 2022 is a powerful tool for quickly estimating the number of distinct values in large datasets. This function is particularly useful in big data scenarios where performance and resource efficiency are crucial. By leveraging APPROX_COUNT_DISTINCT, businesses can gain rapid insights into customer behavior, product diversity, and other key metrics, enabling more informed decision-making. Whether you’re analyzing e-commerce data, customer segmentation, or product sales, this function offers a robust solution for your data analysis needs. Happy querying! 🎉

For more tutorials and tips on SQL Server, including performance tuning and database management, be sure to check out our JBSWiki YouTube channel.

Thank You,
Vivek Janakiraman

JBs Wiki

Tag Archives: data insights

SQL Server 2022 Enhancements to Batch Mode Processing: A Comprehensive Guide

Business Use Case: Optimizing Financial Reporting

Understanding Batch Mode Processing

Demo: Batch Mode Processing Enhancements with JBDB Database

Step 1: Setting Up the JBDB Database

Step 2: Enabling Batch Mode on Rowstore

Step 3: Observing Improved Memory Grant Feedback

Additional Example Queries: Exploring Batch Mode Processing Enhancements

Business Impact of Batch Mode Processing Enhancements

Conclusion

Azure Databricks – Query to get Size and Parquet File Count for Delta Tables in a Catalog using PySpark

🚀 Why This Query is Useful?

🔍 The Query: Fetching Delta Table Details

📊 Breaking Down the Query

1️⃣ Querying Delta Tables

2️⃣ Executing DESCRIBE DETAIL

3️⃣ Combining Results

📢 Key Takeaways

SQL Server 2022 Enhancements to Batch Mode Processing: A Comprehensive Guide

Business Use Case: Optimizing Financial Reporting

Understanding Batch Mode Processing

Demo: Batch Mode Processing Enhancements with JBDB Database

Step 1: Setting Up the JBDB Database

Step 2: Enabling Batch Mode on Rowstore

Step 3: Observing Improved Memory Grant Feedback

Additional Example Queries: Exploring Batch Mode Processing Enhancements

Business Impact of Batch Mode Processing Enhancements

Conclusion

Exploring the APPROX_COUNT_DISTINCT Function in SQL Server 2022

Business Use Case: E-commerce Customer Segmentation 📦

Understanding the APPROX_COUNT_DISTINCT Function 🧠

Syntax:

Example 1: Estimating Unique Customers per Month 📅

Setup:

Query to Estimate Unique Customers:

Output:

Example 2: Estimating Product Variety by Month 📊

Query to Estimate Product Variety:

Example 3: Comparing Traditional and Approximate Counts 🔄

Traditional COUNT(DISTINCT) Method:

Approximate COUNT(DISTINCT) Method:

Comparison:

Estimating Unique Products by Customer:

Estimating Unique Purchase Dates:

Regional Sales Analysis:

Conclusion 🏁

Understanding the `APPROX_COUNT_DISTINCT` Function 🧠