Business Intelligence, Data Engineering, Microsoft Fabric, Spark

Spark Optimization Guide in Microsoft Fabric

August 13, 2025

Business Intelligence, Data Engineering, Microsoft Fabric, Spark

In the universe of Business Intelligence modern world, handling large volumes of data efficiently is a constant challenge. The Microsoft Fabric emerges as a unified platform that simplifies data architecture, and its integration with the Apache Spark is at the heart of its large-scale data processing capability.

In this post, we will explore how this powerful combination works. We will start by understanding the role of Spark in distributed processing and its main advantages, and then we will show how easy it is to set up and optimize a Spark cluster within the Fabric environment.

1. Overview of Microsoft Fabric

Microsoft Fabric consolidates functionalities such as Data Warehousing, Data Engineering, real-time processing, Data Science, and Machine Learning into a single platform, making integrated information management easier.

In the realm of Data Engineering, the use of Spark is central. Spark is a distributed processing technology that executes tasks in parallel, optimizing performance in scenarios with large volumes of data. In Fabric, Spark comes pre-integrated—no additional installation is required—and clusters are automatically managed by the service, allowing dynamic scalability according to the workload.

2. How Does Spark Work?

Apache Spark is a distributed processing engine that operates through a master-worker architecture with inherent parallelism, allowing the processing of large volumes of data across multiple machines in a coordinated manner.

2.1. Parallelism Architecture in Spark

Spark operates on a hierarchical architecture composed of two main types of nodes: the master node (also referred to as driver) and the worker nodes (which execute executors). This distribution allows Spark to break down complex tasks into smaller operations that are executed in parallel.

The driver node acts as the central coordinator of the cluster, being responsible for:

Analyzing, distributing, and scheduling tasks among the executors
Maintaining the SparkContext, which represents the connection to the Spark cluster
Monitoring execution progress and ensuring fault tolerance

The worker nodes contain the executors, which are processes responsible for the actual execution of tasks. Each executor has two main responsibilities:

Executing the code assigned by the driver/worker node
Reporting the progress of computations back to the driver node

2.2. How Parallelism Works

Parallelism in Spark is achieved by dividing the data into partitions distributed across the different nodes of the cluster. Each partition is processed independently by different threads, allowing simultaneous operations. For example, if a dataset is divided into multiple 128MB partitions, different executors can process these partitions in parallel, maximizing the use of computational resources.

Spark creates a DAG (Directed Acyclic Graph) to schedule tasks and orchestrate the worker nodes in the cluster. This mechanism allows for optimizing the sequence of operations and facilitates recovery in case of failures by replicating only the necessary operations on the data from a previous state.

3. Spark Configurations in Microsoft Fabric

3.1. Prerequisites

Access to the Microsoft Fabric portal with the necessary permissions (admin, contributor, or member)
Previously purchased Fabric SKU OR an active Fabric Trial

3.2. Configuration Steps

Pre-warmed Cluster Configuration

Microsoft Fabric offers Starter Pools that use clusters pre-warmed running on Microsoft virtual machines to significantly reduce startup times. These clusters are always active and ready for use, providing Spark session initialization typically within 5 to 10 seconds, without the need for manual configuration.

Starter Pools use medium-sized nodes that scale dynamically based on the needs of Spark jobs. When there are no dependencies on custom libraries or custom Spark properties, sessions start almost instantly because the cluster is already running and requires no provisioning time.

However, there are scenarios where the startup time may be longer:

Custom libraries: Adds 30 seconds to 5 minutes for session customization
High regional usage: When Starter Pools are saturated, it may take 2–5 minutes to create new clusters
Network options: Private Links or Managed VNets disable Starter Pools, forcing on-demand creation

High Concurrency Mode Activation

It is recommended to enable the High Concurrency Mode to allow multiple notebooks to share the same Spark session, optimizing resource usage and drastically reducing startup times. In custom high-concurrency pools, users experience significantly faster session startup compared to standard Spark sessions.

To enable High Concurrency Mode:

Access the Workspace Settings
Navigate to Data Engineering/Science > Spark Settings > High Concurrency
Enable the option For notebooks

Recommended Specifications for Pools

To illustrate the ideal configurations, let's consider an example with SKU F64:

Base Capacity:

F64 = 64 Capacity Units = 128 Spark VCores
With a burst factor of 3x = 384 maximum Spark VCores (the burst factor multiplies the available processing capacity to enhance performance)

Recommended Configuration for Custom Pool:

Parameter	Valor Recomendado	Explanation
Node Family	Memory Optimized	Suitable for data processing workloads
Node Size	Medium (8 VCores)	Balance between performance and concurrency
Autoscale	Enabled (min: 2, max: 48)	48 nodes × 8 VCores = 384 VCores (maximum burst)
Dynamic Allocation	Enabled	Allows automatic adjustment of executors based on demand

Dynamic Allocation Configuration:

Min Executors: 2 (baseline for immediate availability)
Max Executors: 46 (reserving 2 nodes for driver and overhead)
Initial Executors: 4 (balance between startup time and resource waste)

Scaling Based on SKU:

For different SKUs, the maximum configurations vary:

SKU	Capacity Units	Max Spark VCores	Recommended Node Size	Max Nodes
F2	2	12	Small	3
F8	8	48	Medium	6
F16	16	96	Medium	12
F64	64	384	Medium/Large	48/24
F128	128	768	Large	48

Configuration Through the Portal

In the Microsoft Fabric portal:

Go to the section Data Engineering > Spark Settings
Select New Pool to create a custom cluster
Set the Node Family and Node Size according to the requirements
Configure Autoscale with a minimum number of nodes = 1 (Fabric ensures recoverable availability even with a single node)
Enable Dynamic Executor Allocation for automatic resource optimization

Integration with Data Sources:

Use native connectors to establish connections to Data Lakes or Data Warehouses
Ensure that credentials and security settings are correctly configured

Notebook and Task Configuration:

Configure notebooks (Python or Scala) for developing transformation scripts
Schedule batch tasks or configure streaming processes according to requirements

Custom pools have a autopause default of 2 minutes, after which sessions expire and clusters are deallocated, with charges applied only for active usage time.

Of course, the ideal cluster settings may vary depending on the processes to be executed. An evaluation accompanied by tests should always be conducted for each case.

4. Best Practices and Technical Considerations

Sizing and Optimization

Properly sizing the cluster is essential. Consider:

Burst Factor: Determine the required instant scalability capacity to handle processing peaks. The logic should include multiplying the number of VCores used in the selection to achieve the required burst factor. For example, for an F64 SKU (128 base VCores), configure the pool up to 384 VCores by adjusting the number of nodes and node size (e.g., Medium nodes (8 VCores each) × 48 nodes = 384 VCores).

Number of Cores and Memory: For the driver node, select an appropriate number of cores and memory, as it orchestrates the processing and must support cluster management tasks. For worker nodes, the choice should be based on multiplying the resources (cores and memory) needed for parallel processing. Consider the possibility of scaling these nodes to adjust performance according to the workload.

Automation and Scheduling

Automate recurring processes through scripts and scheduling, ensuring consistency and minimizing errors.

Monitoring

Use Fabric’s native monitoring tools to identify potential issues and adjust cluster performance in real time.

Security

Ensure the implementation of robust security policies by configuring permissions and using secure connections for data access.

What are the next steps?

B2F has solid experience in developing solutions on Microsoft Fabric, as well as implementing Spark-based processes. If you need expert support to maximize the performance and efficiency of your analytics platform, our team is ready to collaborate with you and find the best solutions for your challenges.

Ready to define a successful future?

Get in touch

João Conde Pereira

Business Intelligence Team Leader

Share the Post

Business Intelligence

Microsoft Copilot Studio

30 Oct 2025

Bruno Maranhão

Business Intelligence Consultant

Business Intelligence

Ideas that Transform: The Bial Case

29 Sep 2025

Spark Optimization Guide in Microsoft Fabric

1. Overview of Microsoft Fabric

2. How Does Spark Work?

2.1. Parallelism Architecture in Spark

2.2. How Parallelism Works

3. Spark Configurations in Microsoft Fabric

3.1. Prerequisites

3.2. Configuration Steps

Pre-warmed Cluster Configuration

High Concurrency Mode Activation

Recommended Specifications for Pools

Configuration Through the Portal

4. Best Practices and Technical Considerations

Sizing and Optimization

Automation and Scheduling

Monitoring

Security

What are the next steps?

Ready to define a successful future?

Get in touch

João Conde Pereira

Related Posts

Microsoft Copilot Studio

Bruno Maranhão

Ideas that Transform: The Bial Case

Ready to define a successful future?

Get in touch

Pedido de Contacto

Don't hesitate and get in touch with us.