Incremental Refresh in Power BI, Part 3: Best Practices for Large Semantic Models

Incremental Refresh in Power BI, Best Practices for Large Semantic Models

In the two previous posts of the Incremental Refresh in Power BI series, we have learned what incremental refresh is, how to implement it, and best practices on how to safely publish the semantic model changes to Microsoft Fabric (aka Power BI Service). This post focuses on a couple of more best practices in implementing incremental refresh on large semantic models in Power BI.

Note

Since May 2023 that Microsoft announced Microsoft Fabric for the first time, Power BI is a part of Microsoft Fabric. Hence, we use the term Microsoft Fabric throughout this post to refer to Power BI or Power BI Service.

The Problem

Implementing incremental refresh on Power BI is usually straightforward if we carefully follow the implementation steps. However in some real-world scenarios, following the implementation steps is not enough. In different parts of my latest book, Expert Data Modeling with Power BI, 2’nd Edition, I emphasis the fact that understanding business requirements is the key to every single development project and data modelling is no different. Let me explain it more in the context of incremental data refresh implementation.

Let’s say we followed all the required implementation steps and we also followed the deployment best practices and everything runs pretty good in our development environment; the first data refresh takes longer, we we expected, all the partitions are also created and everything looks fine. So, we deploy the solution to production environment and refresh the semantic model. Our production data source has substantially larger data than the development data source. So the data refresh takes way too long. We wait a couple of hours and leave it to run overnight. The next day we find out that the first refresh failed. Some of the possibilities that lead the first data refresh to fail are Timeout, Out of resources, or Out of memory errors. This can happen regardless of your licensing plan, even on Power BI Premium capacities.

Another issue you may face usually happens during development. Many development teams try to keep their development data source’s size as close as possible to their production data source. And… NO, I am NOT suggesting using the production data source for development. Anyway, you may be tempted to do so. You set one month’s worth of data using the RangeStart and RangeEnd parameters just to find out that the data source actually has hundreds of millions of rows in a month. Now, your PBIX file on your local machine is way too large so you cannot even save it on your local machine.

This post provides some best practices. Some of the practices this post focuses on require implementation. To keep this post at an optimal length, I save the implementations for future posts. With that in mind, let’s begin.

Best Practices

So far, we have scratched the surface of some common challenges that we may face if we do not pay attention to the requirements and the size of the data being loaded into the data model. The good news is that this post explores a couple of good practices to guarantee smoother and more controlled implementation avoiding the data refresh issues as much as possible. Indeed, there might still be cases where we follow all best practices and we still face challenges.

Note

While implementing incremental refresh is available in Power BI Pro semantic models, but the restrictions on parallelism and lack of XMLA endpoint might be a deal breaker in many scenarios. So many of the techniques and best practices discussed in this post require a premium semantic model backed by either Premium Per User (PPU), Power BI Capacity (P/A/EM) or Fabric Capacity.

The next few sections explain some best practices to mitigate the risks of facing difficult challenges down the road.

Practice 1: Investigate the data source in terms of its complexity and size

This one is easy; not really. It is necessary to know what kind of beast we are dealing with. If you have access to the pre-production data source or to the production, it is good to know how much data will be loaded into the semantic model. Let’s say the source table contains 400 million rows of data for the past 2 years. A quick math suggests that on average we will have more than 16 million rows per month. While these are just hypothetical numbers, you may have even larger data sources. So having some data source size and growth estimation is always helpful for taking the next steps more thoroughly.

Practice 2: Keep the date range between the RangeStart and RangeEnd small

Continuing from the previous practice, if we deal with fairly large data sources, then waiting for millions of rows to be loaded into the data model at development time doesn’t make too much sense. So depending on the numbers you get from the previous point, select a date range that is small enough to let you easily continue with your development without needing to wait a long time to load the data into the model with every single change in the Power Query layer. Remember, the date range selected between the RangeStart and RangeEnd does NOT affect the creation of the partition on Microsoft Fabric after publishing. So there wouldn’t be any issues if you chose the values of the RangeStart and RangeEnd to be on the same day or even at the exact same time. One important point to remember is that we cannot change the values of the RangeStart and RangeEnd parameters after publishing the model to Microsoft Fabric.

Continue reading “Incremental Refresh in Power BI, Part 3: Best Practices for Large Semantic Models”

Incremental Refresh in Power BI, Part 2; Best Practice; Do NOT Publish Data Model Changes from Power BI Desktop

Incremental Refresh Best Practice, Do NOT Publish Changes from Power BI Desktop

In a previous post, I shared a comprehensive guide on implementing Incremental Data Refresh in Power BI Desktop. We covered essential concepts such as truncation and load versus incremental load, understanding historical and incremental ranges, and the significant benefits of adopting incremental refresh for large tables. If you missed that post, I highly recommend giving it a read to get a solid foundation on the topic.

Now, let’s dive into Part 2 of this series where we will explore tips and tricks for implementing Incremental Data Refresh in more complex scenarios. This blog follows up on the insights provided in the first part, offering a deeper understanding of how Incremental Data Refresh works in Power BI. Whether you’re a seasoned Power BI user or just getting started, this post will provide valuable information on optimising your data refresh strategies. So, let’s begin.

When we publish a Power BI solution from Power BI Desktop to Fabric Service, we upload the data model, queries, reports, and the loaded data into the data model to the cloud. In essence, the Power Query queries, the data model and the loaded data will turn to the Semantic Model and the report will be a new report connected to the semantic model with Connect Live storage mode to the semantic model. If you are not sure what Connect Live means, then check out this post where I explain the differences between Connect Live and Direct Query storage modes.

The Publish process in Power BI Desktop makes absolute sense in the majority of Power BI developments. While Power BI Desktop is the predominant development tool to implement Power BI solutions, the publishing process is still not quite up to the task, especially on more complex scenarios such as having Incremental Data Refresh configured on one or more tables. Here is why.

As explained in this post, publishing the solution into the service for the first time does not create the partitions required for the incremental refresh. The partitions will be created after the first time we refresh the semantic model from the Fabric Service. Imagine the case where we successfully refreshed the semantic model, but we need to modify the solution in Power BI Desktop and republish the changes to the service. That’s where things get more complex than expected. Whenever we republish the new version from Power BI Desktop to Fabric Service, we get a warning that the semantic model exists in the target workspace and that we want to Overwrite it with the new one. In other words, Power BI Desktop currently does not offer to apply the semantic model changes without overwriting the entire model. This means that if we move forward, as the warning message suggests, we replace the existing semantic model and the created partitions with the new one without any partitions. So the new semantic model is now in its very first stage and the partitions of the table(s) with incremental refresh are gone. Of course, the partitions will be created during the next refresh, but this is not efficient and realistically totally unacceptable in production environments. That’s why we MUST NOT use Power BI Desktop for republishing an already published semantic model to avoid overriding the already created tables’ partitions. Now that Power BI Desktop does not support more advanced publishing scenarios such as detecting the existing partitions created by the incremental refresh process, let’s discuss our other options.

Alternatives to Power BI Desktop to Publish Changes to Fabric Service

While we should not publish the changes from Power BI Desktop to the Service, we can still use it as our development tool and publish the changes using third-party tools, thanks to the External Tools support feature. The following subsections explain using two tools that I believe are the best.

Continue reading “Incremental Refresh in Power BI, Part 2; Best Practice; Do NOT Publish Data Model Changes from Power BI Desktop”

Microsoft Fabric: Capacity Cost Management Part 2, Automate Pause/Resume Capacity with Azure Logic Apps

Automate Pause Resume Suspend Fabric Capacity with Azure Logic Apps

In the previous blog post, I explained Microsoft Fabric capacities, shedding light on diverse capacity options and how they influence data projects. We delved into Capacity Units (CUs), pricing nuances, and practical cost control methods, including manually scaling and pausing Fabric capacity. Now, we’re taking the next step in our Microsoft Fabric journey by exploring the possibility of automating the pause and resume process. In this blog post, we’ll unlock the secrets to seamlessly managing your Fabric Capacity with automation that helps us save time and resources while optimising the usage of data and analytics workloads.

Right off the bat, this is a rather long blog, so I added a bonus section at the end for those who are reading from the beginning to the end. With that, let’s dive in!

The Problem

As we have learned in the previous blog post, one way to manage our Fabric capacity costs is to pause the capacity while not in use and resume it again when needed. While this can help with cost management, as it is a manual process, it is prone to human error, which makes it impractical in the long run.

The Solution

A more practical solution is to automate a daily process to pause and resume our Fabric capacity automatically. This can be done by running Azure Management APIs. Depending on our expertise, there are several ways to achieve the goal, such as running APIs on running the APIs via PowerShell (scheduling the runs separately), running the APIs via CloudShell, creating a flow in Power Automate, or creating the workflow in Azure Logic Apps. I prefer the latter, so it is the method that this blog post explains.

Automating Pause and Resume Fabric Capacity with Azure Logic Apps

Here is the scenario: we are going to create an Azure Logic Apps workflow that automatically does the following:

  • Check the time of the day
  • If it is between 8 am to 4 pm:
  • Check the status of the Fabric capacity
  • If the capacity is paused, then resume it, otherwise do nothing
  • If it is after 4 pm and before 8 am:
  • Check the status of the Fabric capacity
  • If the capacity is resumed, then pause it, otherwise do nothing

Follow these steps to implement the scenario in Azure Logic Apps:

  1. Login to Azure Portal and search for “Logic App
  2. Click the Logic App service
Finding Logic Apps on Azure Portal

This navigates us to the Logic App service. If you currently have existing Logic Apps workflows, they will appear here.

Continue reading “Microsoft Fabric: Capacity Cost Management Part 2, Automate Pause/Resume Capacity with Azure Logic Apps”

Microsoft Fabric: Capacity Options and Cost Management, Part 1; The Basics

Microsoft Fabric: Capacity Options and Cost Management, Part 1

Microsoft Fabric is a SaaS platform that allows users to get, create, share, and visualise data using a wide set of tools. It provides a unified solution for all our data and analytics workloads, from data ingestion and transformation to data engineering, data science, data warehouse, real-time analytics, and data visualisation. In a previous blog post, I explained the basics of the Microsoft Fabric data platform. In a separate blog post, I explained some Microsoft Fabric terminologies and personas where I explained what Tenant and Capacities are.

In this blog post, we will explore the different types of Fabric capacities, how they affect the performance and cost of our Fabric projects, and how you can control the capacity costs by pausing the capacity in Azure when it is not in use.

Fabric capacity types

Fabric capacities are the compute resources that power all the experiences in Fabric. They are available in different sizes and prices, depending on our needs and budget. We can currently obtain Fabric capacities in one of the following options:

If we want to purchase Microsoft Fabric capacities on Azure, they come in SKUs (Stock Keeping Units) sized from F2 – F2048, representing 2 – 2048 CU (Capacity Units). A CU is a unit of measure representing the resource power available for a Fabric capacity. The higher the CU, the more resources we get on our Fabric projects. For example, an F8 capacity has 8 CUs, which means it is four times more powerful than an F2 capacity, which has 2 CUs.

When purchasing Azure SKUs with a pay-as-you-go subscription, we are billed for compute power (which is the size of the capacity we choose) and for OneLake storage, which is charged for the data stored in OneLake per gigabyte per month (approximately $0.043 (New Zealand Dollar) per GB). OneLake is the unified storage layer for all the Fabric workloads. It allows users to store and access our data in a secure, scalable and cost-effective way.

Azure Fabric capacities are priced uniquely across regions. The pay-as-you-go pricing for a Fabric capacity at Australia East region is $0.3605 (NZD) per CU per hour, which translates to a monthly price of $526.217 (NZD) for an F2 ($0.3605 * 2 * 730 hours).

Microsoft Fabric pricing overview
Microsoft Fabric pricing overview

It is important to note that billing is per second with a one-minute minimum. Therefore, we will be billed for when the capacity is not in use. Here is a full list of prices available at the Azure portal by selecting our Fabric capacity region.

Now that we have an indication of the costs of owning Microsoft Fabric capacities let’s explore the methods to control the cost.

Nuances of Fabric’s Cost of Ownership

It is important to note that all the math we have gone through in the previous section is just about the capacity itself. But are there any other costs that may apply? The answer is it depends. If we obtain any SKUs lower than F64, we must buy Power BI Pro licenses per user on top of the capacity costs. For the tiers above F64, we get unlimited free users but, BUT, we still have to purchase Power BI Pro licenses for all developers on top of the cost of the capacity itself.

Another gotcha is that the Fabric experiences are unavailable to either Power BI Premium (PPU) users or the Power BI Embedded capacities. Just be mindful of that.

The good news for organisations owning Power BI Premium capacities is that you do not need to do anything to leverage Fabric capabilities. As a matter of fact, you already own a Fabric capacity, you just need to enable it on your tenant.

Continue reading “Microsoft Fabric: Capacity Options and Cost Management, Part 1; The Basics”