Optimizing Microsoft Fabric: Performance Solutions

Introduction

Microsoft Fabric is great when you’re getting started. You build a pipeline, connect a notebook, load some data, and everything feels smooth.

Then production hits.

Pipelines that ran in minutes start taking much longer.
Notebooks slow down.
Tables grow.
Costs creep up.

I ran into all of this while building a data pipeline in Fabric, and honestly, most of the performance issues were not obvious at the beginning.

This post is a collection of things that actually made a difference for me.

Where things usually go wrong

In my case, the main problems came from:

Processing more data than needed
Inefficient merge logic
Too many small files
Poor table design
Not thinking about scale early

None of these show up in tutorials, but they hit quickly in production.

1. Stop doing full loads

This was my first mistake.

Initially, I was reprocessing entire tables every time. It worked fine when the data was small, so I didn’t question it.

But as tables like sales_order grew, things slowed down fast.

What I changed

Instead of loading everything, I switched to incremental files.

			
file_name = f"load_date_{yesterday}.parquet"
df = spark.read.parquet(path)

Now only new or changed data gets processed.

What changed after that

The difference was immediate.

Pipelines that were taking a long time dropped to minutes.
Compute usage also became much more stable.

Takeaway

If you are still doing full loads in Fabric, it will eventually become a problem. It is just a matter of time.

2. Your merge logic matters more than you think

At one point, my merge looked like this:

WHEN MATCHED THEN UPDATE

It seemed fine, but Fabric was updating rows even when nothing had changed.

That adds up very quickly.

What I changed

I added a condition:

WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE

Why this helped

Now updates only happen when there is an actual change.

This reduced unnecessary writes and improved performance noticeably.

Takeaway

Merge is not just about correctness. It has a big impact on performance.

3. The small files problem is real

This one took a while to notice.

Because I was loading incrementally, Fabric started generating a lot of small parquet files.

At first everything worked, but over time:

Queries slowed down
Reads became inefficient

What I did

I started controlling the number of files:

df = df.coalesce(10)

And also kept an eye on how data was being written.

Takeaway

Too many small files quietly kill performance. You will not notice it immediately, but it builds up.

4. Partitioning is not optional for large tables

For tables like sales_order, I initially didn’t use partitioning.

That meant every query had to scan the full table.

What I changed

I started partitioning based on date columns like created_at.

Result

Queries became much faster, especially when filtering recent data.

Takeaway

If your table is growing and you are not partitioning, you are leaving performance on the table.

5. Data types matter more than expected

I underestimated this.

Some columns were coming in as floats, others as large strings, and I let Fabric handle it automatically.

That caused both performance issues and occasional failures.

What I changed

I started normalizing everything:

df = df.withColumn("price", col("price").cast("decimal(20,6)"))

For large text fields:

df = df.withColumn(col_name, substring(col(col_name), 1, 8000))

Takeaway

Letting Fabric guess data types is convenient, but not reliable at scale.

6. Simpler transformations perform better

At one point, I had a notebook doing too many things in one go.

Multiple joins, transformations, and logic all together.

It worked, but it was slow and hard to debug.

What I changed

I broke it into smaller steps:

Stage intermediate results
Reduce heavy joins
Keep transformations simple

Result

Execution became faster and much easier to manage.

Takeaway

Complex pipelines might look efficient, but simpler ones usually perform better.

7. Deduplicate early, not later

Incremental loads brought duplicates.

If you don’t handle them early, they affect everything downstream.

What I used

			
window = Window.partitionBy(key).orderBy(col("updated_at").desc())
df = df.withColumn("rn", row_number().over(window)) \
       .filter(col("rn") == 1) \
       .drop("rn")

Takeaway

Cleaning data early reduces both data size and processing time.

8. Performance is not a one-time fix

This is probably the biggest realization.

Even after fixing things, performance changes over time as:

Data grows
Usage increases
Patterns change

What helped

Monitoring pipeline duration
Watching table sizes
Revisiting partitioning and merge logic

Final thoughts

Fabric is a strong platform, but performance does not come automatically. Most of the issues I faced were not bugs, they were design decisions that didn’t scale. The biggest shift for me was moving from:

“Get it working” to “Make it work efficiently at scale”

If you are working with Fabric

Start thinking about performance early. It is much easier to design it right than to fix it later.

Introduction

Where things usually go wrong

1. Stop doing full loads

What I changed

What changed after that

Takeaway

2. Your merge logic matters more than you think

What I changed

Why this helped

Takeaway

3. The small files problem is real

What I did

Takeaway

4. Partitioning is not optional for large tables

What I changed

Result

Takeaway

5. Data types matter more than expected

What I changed

Takeaway

6. Simpler transformations perform better

What I changed

Result

Takeaway

7. Deduplicate early, not later

What I used

Takeaway

8. Performance is not a one-time fix

What helped

Final thoughts

If you are working with Fabric

Share this:

Related

Leave a comment Cancel reply