SAP BW 7 and SQL Server 2008 Star Join Optimization
Table of contents :
1. Intro
2. Sample query using SQL Server 2005
3. Sample query with star join optimization using SQL Server 2008 R2
4. like 3. but with row compression
5. Sample query with non-optimized bitmaps using SQL Server 2008 R2
6. Could a query with star join optimization be slower than without ?
7. How to get star join optimization in BW ?
8. Summary
1. Intro
SQL Server 2008 and SQL Server 2008 R2 have a lot of improvements compared to SQL Server 2005. One of them is the so-called star join optimization. There is a lot of information out there. Here are four links where all the details can be found about this specific feature :
https://msdn.microsoft.com/en-us/library/bb522541.aspx
https://technet.microsoft.com/en-us/magazine/2008.04.dwperformance.aspx?pr=blog
https://msdn.microsoft.com/en-us/library/ee410012(SQL.100).aspx
https://technet.microsoft.com/en-us/library/cc278097(SQL.100).aspx
SAP BW is based on a relational snowflake schema which raises the question if it couldn’t benefit from the SQL Server 2008 star join optimization. SAP BW version 7 indeed offers a way to enable star join optimization when running on SQL Server 2008. It’s described in SAP note 1126568. The challenge though is to know that this note exists. You might wonder why it’s necessary at all to do something on the BW side because the optimizer will use star join optimization automatically when max degree of parallelism is active. But the latter aspect is exactly the reason. Right now star join optimization isn’t supported with serial query plans. It requires parallelism and hash joins. SAP BW doesn’t only send star join queries to the database. Besides queries going against the BW snowflake schemas you will see tons of queries which are not related to the BW “InfoCubes” but the SAP application server ( e.g. SAP work process handling ). Therefore a server wide setting of parallelism would have an impact on all statements coming from BW not only the star joins. While this isn’t really an issue in most cases there could be some special situations where it’s not desirable. Imagine a BW process where parallelism is already managed by the application and additional parallelism on the database level could be counterproductive. Section 7 of this blog explains what was implemented in SAP BW to restrict the star join optimization just to the queries going against the snowflake schemas.
The other sections in the blog will show sample query plan details to give a better idea about if and how SAP BW might benefit from star join optimization. The test tables were not taken from a real BW system but the design is so close to a standard BW InfoCube that the outcome can be used to draw a conclusion which is valid for SAP BW production systems.
A sample fact table was created including ten key figures ( like SSAS measures ) and a few key columns to join to the dimensions. The dimensions were loaded with 1000 rows each and the data was evenly distributed. The fact table was loaded with 20 million rows. A very simple query was defined joining the fact table with two of the dimensions and grouping the join result at the end. The idea was to check the runtime as well as the query plan using parallelism set to 1 ( no parallelism ) and 2 ( using two logical CPUs ). The tests were done within a VM without any tuning on the IO side. Therefore the absolute runtimes are not so important. It’s more interesting to see the difference between a “cold” run ( after cleaning the buffer cache via dbcc dropcleanbuffers ) and a “warm” run ( no disk IO at all ) and then compare these scenarios between SQL 2005 and SQL 2008. Both SQL Server versions were installed within the same VM. But the most important part of these tests was to look at the query plans to check if star join optimization was used.
2. Sample query using SQL Server 2005
Here is the sample query mentioned in the intro. The “set statistics” commands were used to get detailed info about the query plans, runtime and physical reads. The dbcc command cleaned the buffer cache and the option() clause was used to run with and without parallelism. The filter on the dimension was defined as a range which would return almost half of the fact table as the join result. The group by at the end should shrink the final result set to just a few rows.
The query plan shows the old 2005 format regarding table partitions. The test fact table had 101 partitions. This was reflected in the query plan via a nested loop join and a constant scan. This was always a little bit confusing because it didn’t mean that there was really a nested loop join on 20 million rows. This was then changed in 2008 as you can see further down on the screenshots of the 2008 tests. So as expected SQL Server read all 20 million rows from the fact table and did a hash join against a hash table built on the small dimension with the filter. This makes sense because with such a huge join result the hash join is much more efficient than a nested loop join.
The first test showed physical disk IO and a runtime of 76 seconds.
Running the query again there was no physical disk IO anymore and the runtime went down to 37 seconds – roughly twice as fast as before.
Turning parallelism on ( set MAXDOP to 2 ) the query plan looks different. We see in the “Executes” column that two threads were working in parallel. But we also see that dynamic bitmaps were created and repartitioning of the data streams was introduced. In addition the order of the tables changed. Now it joins with dimension 2 first and it moves over 16 million rows to the next level instead of 8 million in the serial plan. Somehow this doesn’t seem to be so great.
The runtime shows that despite MAXDOP=2 it took basically as long as without parallelism before. Now one could claim that this is probably related to the different table order. So I used the option clause to force the table order we had with the serial plan.
This time we see the 8 million rows again and the join with dimension 7 first like in the test without parallelism. But we also see that these 8 million rows show up in a hash join and two times in “Repartition Streams”. And the hash join at the end still has to process almost 7 million rows. Even without understanding how all this works in detail we can simply add the numbers and get a sum of roughly 31 million rows ( 3x8 +7 ). The plan before had almost 17 million rows in a hash join and then the same number in a “Repartition Streams”. Well – from a number of rows perspective this in fact doesn’t look so much different.
And the runtime using the “force order” option and MAXDOP=2 is even slightly worse than with the serial plan or not specifying any option clause. So we don’t have to further investigate this direction.
Running the query again without the force order option the response time improved quite a bit. But it is significantly slower than the test with the serial plan where it took only about 37 seconds without any disk IO. Because there were no physical reads and the number of logical reads on the fact table is again exactly 353000 like before without parallelism it’s obvious that all the additional CPU cycles related to the parallel plan made things worse. Setting MAXDOP to a value bigger than 2 would make things faster. But depending on the CPU and memory resources this would reduce the number of queries which can run at the same time using parallelism.
Anyway – let’s see how the same query running against the same tables looks like when using SQL Server 2008 R2.
3. Sample query with star join optimization using SQL Server 2008 R2
Now let’s repeat exactly the same test as described in section 2 with SQL Server 2008 R2.
The serial plan looks the same. It scans all 20 million rows from the fact table and joins with dimension table 7 and the join result is 8 million rows. The nested loop / constant scan construct due to the partitioning is gone in SQL 2008 which makes the query plan more readable.
The runtime with disk IO is pretty much the same as with 2005. As mentioned before – the disk IO in the VM wasn’t tuned at all. Therefore these numbers are not very precise and have to be seen to get a rough idea. 70.9 secs are pretty close to the 76 secs with 2005.
Without any disk IO we see 36 secs compared to 37 secs before with 2005. We cannot call this a significant improvement which would be outside of the error band width due to imprecise measurements. The error without any disk IO on exactly the same VM isn’t so big as there was no major disturbance on the underlying physical host. Now it’s becoming interesting – how well will SQL 2008 handle parallelism ?
Positive surprise ! We see “Opt_Bitmap” in the plan which is a proof that star join optimization was active. This bitmap was used for the clustered index scan on the fact table ( PROBE([Opt_Bitmap1018]…. ). In the 2005 query plan all 20 million rows from the fact table were joined with the dimension to figure out that only 8 million of them did match. And then the second join to the other dimension reduced it to 6.68 million rows ( example with force order ). With 2008 R2 it joined now in this order – first with dimension 7 and then with dimension 2.
But it created in fact bitmaps for both dimensions – Opt_Bitmap1018 and Opt_Bitmap1019. These were used to find the matching rows already during the clustered index scan which is much more efficient than doing it via a join. Therefore it returned the 6.68 million rows from the fact table scan and the following joins didn’t reduce this further.
Well done ! The “cold” test with disk IO improved from 7x secs to 33 secs !
And the “warm” test without any disk IO improved from 36 secs to 20 secs. That’s pretty cool.
4. like 3 but with row compression
Right now in SAP BW 7 “row compression” is the default when running on SQL Server 2008. The tests above were run without any compression feature to have a better comparison between 2005 and 2008. It’s interesting to check if and how the picture changes when using row compression for the tables and indexes.
In the first test there were roughly 600K pages for the fact table with its six secondary indexes.
Turning row compression on - the used pages went down to roughly 300K pages which is about half the size as before.
Row compression obviously had no impact on the serial query plan.
But as expected we see much less logical reads and the response time of the “cold” test went down from over 70 secs to 57 secs.
The “warm” test shows a degradation though from 36 secs before up to 45 secs. Data compression is especially great for reduced disk IO. On the other side it costs CPU cycles to uncompress for internal processing. Given the size of typical BW databases and the available hardware we can expect that usually the big fact tables and BW DSOs don’t fit into memory. So overall we should see a performance improvement in almost all cases due to reduced disk IO besides the big advantage of saving disk space costs.
Turning on parallelism showed again exactly the same query plan using star join optimization ( [Opt_Bitmap1018] + [Opt_Bitmap1019] ).
The response time went down from 57 secs with the serial plan to 30 secs with MAXDOP=2 ( 33 secs without row compression ).
The “warm” case improved the response time from 45 secs to 25 secs. Like with the serial plan it’s a little bit slower than without row compression ( 20 secs ) due to the overhead of uncompressing the data. As mentioned before the cache hit ratio usually isn’t so super optimal for huge data warehouses as long as you don’t have a giant main memory. Thus the reduced disk IO will very likely outweigh the CPU overhead and overall query response times will improve.
5. Sample query with non-optimized bitmaps using SQL Server 2008 R2
As you can read in the details about star join optimization / optimized bitmap ( see links in the intro section at the beginning of the blog ) there are some prerequisites. One of them is a single-column join. So let’s do one test where we use a two-column join between fact table and one dimension. This doesn’t occur with a BW InfoCube. It’s just to find a simple sample in order to get an idea how it looks like when non-optimized bitmaps are used. It could also happen in BW when the optimizer comes to the conclusion that an optimized bitmap doesn’t help.
The test data for the fact table was generated in a way to allow the addition of a second join column without changing the size of the result set. Every key column in BW is only used to join the fact table with one specific dimension. In this special sample the knowledge about the generated data was used to “misuse” the key column for dimension 5 as a second join column to dimension 4. This time only dimension 4 was joined with the fact table.
The serial query plan looks familiar. We see again a scan of all 20 million rows from the fact table and then a join result of 8 million rows.