Posted by

Posted on

March 24, 2025

Posted under

Comments

Azure Databricks Series: Hands-On Machine Learning for Stock Prediction

Introduction

In today’s data-driven world, machine learning (ML) plays a crucial role in predictive analytics. One of the most popular use cases is stock price prediction, where ML algorithms analyze historical data to forecast future trends. In this blog, we will explore how to leverage Azure Databricks for stock price prediction using machine learning.

📺 Watch the full tutorial video here

📂 Download the code file from here

Why Use Azure Databricks for Machine Learning?

Azure Databricks provides a powerful environment for big data processing, machine learning, and real-time analytics. Here are some key reasons why it is ideal for ML-based stock prediction:

Scalability – Handles large volumes of historical stock data efficiently.
Integration – Seamlessly connects with Azure Storage, Delta Lake, and MLflow.
Collaborative Environment – Supports teamwork with shared notebooks and version control.
High Performance – Optimized for distributed computing and deep learning workloads.

Workflow for Stock Price Prediction

The machine learning workflow in Azure Databricks for stock prediction involves multiple steps:

Step 1: Data Collection

Stock market data is gathered from a financial data provider. Typically, historical stock prices include:

Date
Open price
High price
Low price
Close price
Volume traded

Step 2: Data Preprocessing

To ensure accurate predictions, the raw data undergoes preprocessing:

Handling missing values
Normalizing stock prices
Converting date-time format
Feature engineering to extract trends and patterns

Step 3: Storing Data in a Delta Table

Azure Databricks supports Delta Lake, an optimized storage layer for big data analytics. The cleaned dataset is stored in a Delta Table, which ensures:

ACID transactions for data integrity
Scalability to handle large datasets
Versioning for better data management

Step 4: Model Selection & Training

For stock prediction, various machine learning models can be used, such as:

Linear Regression – Suitable for trend analysis.
Random Forest – Effective for capturing non-linear relationships.
LSTM (Long Short-Term Memory) – A deep learning model ideal for time-series forecasting.

The model is trained using historical data, and performance is evaluated using metrics like Mean Squared Error (MSE) and R-Squared.

Step 5: Predicting Future Stock Prices

Once the model is trained, it is used to predict next-day stock prices based on recent trends.

Step 6: Visualization & Insights

The predicted prices are visualized using interactive charts and graphs to compare with actual values. This helps in understanding the performance of the model and refining future predictions.

Challenges in Stock Price Prediction

While ML provides valuable insights, predicting stock prices has inherent challenges:

Market Volatility – Prices can fluctuate due to unforeseen events.
External Factors – News, political events, and investor sentiment impact prices.
Overfitting – Models may perform well on historical data but struggle with real-world scenarios.

Despite these challenges, machine learning helps traders and investors make informed decisions based on data-driven insights.

Conclusion

Azure Databricks provides a robust, scalable, and efficient platform for stock price prediction using machine learning. By leveraging its powerful data processing capabilities and ML frameworks, we can build accurate and insightful models to analyze stock trends.

📺 Watch the complete tutorial here

📂 Download the code file from here

Stay tuned for more Azure Databricks tutorials in this series! 🚀

Posted by

Vivek Janakiraman

Posted on

March 19, 2025

Posted under

Azure Databricks

Comments

Azure Databricks – Query to get Size and Parquet File Count for Delta Tables in a Catalog using PySpark

Managing and analyzing Delta tables in a Databricks environment requires insights into storage consumption and file distribution. In this blog, we will explore a PySpark query that lists all Delta tables under a specified catalog, retrieving their details, including table size and the number of parquet files.

🚀 Why This Query is Useful?

✅ Comprehensive Overview: Lists all Delta tables under a given catalog.
✅ Storage Insights: Displays table size in Bytes, MB, GB, and TB.
✅ File Count: Helps identify partitioning and performance optimization opportunities.
✅ Automation Ready: Easily adaptable for different catalogs and use cases.

🔍 The Query: Fetching Delta Table Details

The following PySpark script extracts metadata using the DESCRIBE DETAIL command and consolidates the results into a single DataFrame:

📌 Code Implementation

from pyspark.sql import DataFrame

# Fetch all DESCRIBE DETAIL queries
describe_queries = spark.sql("""
    SELECT 'DESCRIBE DETAIL ' || table_catalog || '.' || table_schema || '.' || table_name AS sql_query
    FROM system.information_schema.tables
    WHERE table_catalog = 'Your_Catalog_name_JBCatalog'  --CHANGE THE REQUIRED CATALOG
    AND data_source_format = 'DELTA'
""").collect()

# Execute each query and collect results
results = []
for row in describe_queries:
    query = row["sql_query"]
    detail_df = spark.sql(query)  # Run DESCRIBE DETAIL for each table
    results.append(detail_df)

# Combine all results into a single DataFrame
if results:
    final_df = results[0]
    for df in results[1:]:
        final_df = final_df.union(df)
    
    # Add sizeinMB, sizeinGB, sizeinTB columns
    final_df = final_df.withColumn("sizeinMB", final_df["sizeInBytes"] / (1024 * 1024))
    final_df = final_df.withColumn("sizeinGB", final_df["sizeInBytes"] / (1024 * 1024 * 1024))
    final_df = final_df.withColumn("sizeinTB", final_df["sizeInBytes"] / (1024 * 1024 * 1024 * 1024))
    
    display(final_df)  # Just display the output without storing it
else:
    print("No Delta tables found in the fraud catalog.")

📊 Breaking Down the Query

1️⃣ Querying Delta Tables

Uses system.information_schema.tables to fetch all Delta tables under a specified catalog.
Constructs DESCRIBE DETAIL queries dynamically for each table.

2️⃣ Executing DESCRIBE DETAIL

Iterates over the collected queries and runs DESCRIBE DETAIL on each Delta table.
Stores results in a list of DataFrames.

3️⃣ Combining Results

Unifies all DataFrames into a single DataFrame using union().
Computes additional columns for table size in MB, GB, and TB.
Displays the final result in Databricks.

📢 Key Takeaways

✔️ Quickly analyze Delta table metadata to understand storage and file distribution. ✔️ Optimize table performance by evaluating the number of Parquet files. ✔️ Scale storage efficiently by monitoring table sizes at different units. ✔️ Adaptable across catalogs by modifying the table_catalog filter.

📌 Let us know if you have any questions or need further enhancements! 🚀

Thank You,
Vivek Janakiraman

Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Posted by

Vivek Janakiraman

Posted on

October 15, 2024

Posted under

Azure Databricks

Comments

Azure Databricks Series: Step-by-Step Guide to Integrating Azure Key Vault for Secure Access

Access storage account from Azure databricks using Storage account key.

from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession.builder.getOrCreate()

# Define the storage account and container details
storage_account_name = 'jbadbstorageaccount'
storage_account_key = '<<Account_Key>>'
container_name = 'deltatables'

# Define the configuration for the Azure Storage account
spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    storage_account_key
)

# Define the path to the CSV file
file_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/customer/customers.csv"

# Read the CSV file into a DataFrame
df = spark.read.format('csv').option('header', 'true').load(file_path)

# Display the DataFrame
display(df)

Access storage account from Azure databricks with Access keys placed in Azure key vault

from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession.builder.getOrCreate()

# Define the storage account and container details
storage_account_name = 'jbadbstorageaccount'
container_name = 'deltatables'

# Retrieve the storage account key from Azure Key Vault
key_vault_scope = 'jbswikidatabricksscopekeyvaultintegration'
key_vault_key = 'jbswikidatabrickssecret'

storage_account_key = dbutils.secrets.get(scope=key_vault_scope, key=key_vault_key)

# Define the configuration for the Azure Storage account
spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    storage_account_key
)

# Define the path to the CSV file
file_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/customer/customers.csv"

# Read the CSV file into a DataFrame
df = spark.read.format('csv').option('header', 'true').load(file_path)

# Display the DataFrame
display(df)

Regards;
Vivek Janakiraman

JBs Wiki

Category Archives: Azure Databricks

Azure Databricks Series: Hands-On Machine Learning for Stock Prediction

Introduction

Why Use Azure Databricks for Machine Learning?

Workflow for Stock Price Prediction

Step 1: Data Collection

Step 2: Data Preprocessing

Step 3: Storing Data in a Delta Table

Step 4: Model Selection & Training

Step 5: Predicting Future Stock Prices

Step 6: Visualization & Insights

Challenges in Stock Price Prediction

Conclusion

Azure Databricks Series: Step-by-Step Guide to Integrating Azure Key Vault for Secure Access