Introduction
Microsoft Fabric is great when you’re getting started. You build a pipeline, connect a notebook, load some data, and everything feels smooth.
Then production hits.
- Pipelines that ran in minutes start taking much longer.
- Notebooks slow down.
- Tables grow.
- Costs creep up.
I ran into all of this while building a data pipeline in Fabric, and honestly, most of the performance issues were not obvious at the beginning.
This post is a collection of things that actually made a difference for me.
Where things usually go wrong
In my case, the main problems came from:
- Processing more data than needed
- Inefficient merge logic
- Too many small files
- Poor table design
- Not thinking about scale early
None of these show up in tutorials, but they hit quickly in production.
1. Stop doing full loads
This was my first mistake.
Initially, I was reprocessing entire tables every time. It worked fine when the data was small, so I didn’t question it.
But as tables like sales_order grew, things slowed down fast.
What I changed
Instead of loading everything, I switched to incremental files.
file_name = f"load_date_{yesterday}.parquet"df = spark.read.parquet(path)
Now only new or changed data gets processed.
What changed after that
The difference was immediate.
- Pipelines that were taking a long time dropped to minutes.
- Compute usage also became much more stable.
Takeaway
If you are still doing full loads in Fabric, it will eventually become a problem. It is just a matter of time.
2. Your merge logic matters more than you think
At one point, my merge looked like this:
WHEN MATCHED THEN UPDATE
It seemed fine, but Fabric was updating rows even when nothing had changed.
That adds up very quickly.
What I changed
I added a condition:
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE
Why this helped
Now updates only happen when there is an actual change.
This reduced unnecessary writes and improved performance noticeably.
Takeaway
Merge is not just about correctness. It has a big impact on performance.
3. The small files problem is real
This one took a while to notice.
Because I was loading incrementally, Fabric started generating a lot of small parquet files.
At first everything worked, but over time:
- Queries slowed down
- Reads became inefficient
What I did
I started controlling the number of files:
df = df.coalesce(10)
And also kept an eye on how data was being written.
Takeaway
Too many small files quietly kill performance. You will not notice it immediately, but it builds up.
4. Partitioning is not optional for large tables
For tables like sales_order, I initially didn’t use partitioning.
That meant every query had to scan the full table.
What I changed
I started partitioning based on date columns like created_at.
Result
Queries became much faster, especially when filtering recent data.
Takeaway
If your table is growing and you are not partitioning, you are leaving performance on the table.
5. Data types matter more than expected
I underestimated this.
Some columns were coming in as floats, others as large strings, and I let Fabric handle it automatically.
That caused both performance issues and occasional failures.
What I changed
I started normalizing everything:
df = df.withColumn("price", col("price").cast("decimal(20,6)"))
For large text fields:
df = df.withColumn(col_name, substring(col(col_name), 1, 8000))
Takeaway
Letting Fabric guess data types is convenient, but not reliable at scale.
6. Simpler transformations perform better
At one point, I had a notebook doing too many things in one go.
Multiple joins, transformations, and logic all together.
It worked, but it was slow and hard to debug.
What I changed
I broke it into smaller steps:
- Stage intermediate results
- Reduce heavy joins
- Keep transformations simple
Result
Execution became faster and much easier to manage.
Takeaway
Complex pipelines might look efficient, but simpler ones usually perform better.
7. Deduplicate early, not later
Incremental loads brought duplicates.
If you don’t handle them early, they affect everything downstream.
What I used
window = Window.partitionBy(key).orderBy(col("updated_at").desc())df = df.withColumn("rn", row_number().over(window)) \ .filter(col("rn") == 1) \ .drop("rn")
Takeaway
Cleaning data early reduces both data size and processing time.
8. Performance is not a one-time fix
This is probably the biggest realization.
Even after fixing things, performance changes over time as:
- Data grows
- Usage increases
- Patterns change
What helped
- Monitoring pipeline duration
- Watching table sizes
- Revisiting partitioning and merge logic
Final thoughts
Fabric is a strong platform, but performance does not come automatically. Most of the issues I faced were not bugs, they were design decisions that didn’t scale. The biggest shift for me was moving from:
“Get it working” to “Make it work efficiently at scale”
If you are working with Fabric
Start thinking about performance early. It is much easier to design it right than to fix it later.
Leave a comment