Optimizing and balancing Unit of Work in SSIS with T-SQL and Biml

Optimizing and balancing Unit of Work in SSIS with T-SQL and Biml

Diesen Artikel finden Sie hier auch in deutscher Sprache.

Have you ever run into the situation, where you logically could run all of your load tasks in SSIS in parallel but the system (source or target) would not support it physically?
Then you probably ended up organizing your dataloads into multiple groups, aiming for ideally equally sized containers.
Doing that manually is a time consuming issue – and it never ends as the timings of your data loads will constantly change.

This post is about showing you, how you can easily you can automate the assignment of your tables to containers using simple T-SQL as well as about automating the load package using Biml.

It is not about the difference in performance between load patterns or how to build staging packages with Biml – there are many other posts and tutorials out there about that, so we’ll strictly focus on how we can optimize our SSIS workload.

First, create a new database (ours is called "Performance", but you can choose any name that suits your needs; just be sure to adjust the connection string in the Biml code later on).

Within that database, we’ll need some meta tables which we will store in a new schema:

CREATE SCHEMA meta
GO
CREATE TABLE [meta].[Container]
([ContainerID] [INT] NULL,
 [TotalCost]   [FLOAT] NULL,
 [Tables]      [INT] NULL
)
ON [PRIMARY]
GO
ALTER TABLE [meta].[Container]
ADD CONSTRAINT [DF_Container_TotalCost] DEFAULT((0)) FOR [TotalCost]
GO

ALTER TABLE [meta].[Container]
ADD CONSTRAINT [DF_Container_Tables] DEFAULT((0)) FOR [Tables]
GO
CREATE TABLE [meta].[Tables]
([TableName] [NVARCHAR](50) NULL,
 [Cost]      [FLOAT] NULL,
 [Container] [INT] NULL
)
ON [PRIMARY]

Let’s also populate our meta.tables table with some sample data (this is just fictional data based of AdventureWorks):

INSERT INTO [meta].[Tables] ([TableName] ,[Cost] ,[Container]) VALUES ('Person.Person',1,0)
INSERT INTO [meta].[Tables] ([TableName] ,[Cost] ,[Container]) VALUES ('Person.PersonPhone',1,0)
INSERT INTO [meta].[Tables] ([TableName] ,[Cost] ,[Container]) VALUES ('Sales.SalesOrderHeader',10,0)
INSERT INTO [meta].[Tables] ([TableName] ,[Cost] ,[Container]) VALUES ('Sales.SalesOrderDetail',20,0)

The table meta.containers could technically also be a temporary table. We’ve decided against that so you can see what’s happening behind the scenes.

Let’s focus on the meta.tables table for now. It has three columns:

– TableName – guess what we’ll store in there
– Container – this one will hold the information, which container we want this table to be loaded it, which will be automatically populated by our stored procedure
– Cost – this column will hold the load cost of this specific table. In our opinion, this should ideally be the average time it took SSIS to load this table in the recent past. If you don’t have that information available, it might as well something like the size of this table in either Gigabytes or Rows. The more accurate this column is, the better your results will be.

As pointed out, we will have a stored procedure (or actually, multiple stored procedures) do the magic for us, so let’s also create these:

SP_CreateContainers will simply populate our meta.container table with one row per container:

CREATE PROCEDURE [dbo].[SP_CreateContainers](@MaxContainers INT)
AS
	BEGIN
	    DECLARE @CurrentContainer INT= 1
	    TRUNCATE TABLE Meta.Container
	    WHILE @CurrentContainer <= @MaxContainers
		   BEGIN
			  INSERT INTO Meta.Container
			  (ContainerId)
			  VALUES
			  (@CurrentContainer)
			   SET @CurrentContainer = @CurrentContainer + 1
		   END
	END

SP_UpdateContainerStats will update the statistics of each container (how many tables are in there and what is the total of their cost):

CREATE PROCEDURE [dbo].[SP_UpdateContainerStats]
AS
	BEGIN
	    UPDATE a SET tables = cnt, TotalCost = Cost  FROM meta.container a
	    INNER JOIN
	    (SELECT container, COUNT(*) AS CNT, SUM(Cost) AS Cost FROM meta.tables GROUP BY container) b
	    ON a.containerid = b.container
	END
GO

SP_SetContainers is the one doing the actual work. It will:

– Reset the container assignment currently stored in meta.tables
– Call SP_CreateContainers and have it create the number of containers that was passed through using the only parameter: MaxContainers
– Open a cursor that reads all tables from meta.tables, sorted from highest to lowest cost
– For each table, check which container currently has the lowest total cost, put it in that container, update container statistics

Once that is done (for 2 or for 2000 tables), all our tables will have been assigned to a container and their distribution should be fairly equal:

CREATE PROCEDURE [dbo].[SP_SetContainers](@MaxContainers INT)
AS
	BEGIN
	    UPDATE meta.tables
		 SET
			container = 0
	exec [SP_CreateContainers] @maxcontainers
	Declare @TableName nvarchar(50)
	Declare @Cost float
	declare @NextContainer Bigint
	DECLARE tbl_Cursor CURSOR FOR  
SELECT TableName,Cost
FROM meta.tables order by Cost desc

OPEN tbl_Cursor   
FETCH NEXT FROM tbl_Cursor INTO @TableName,@Cost 

WHILE @@FETCH_STATUS = 0   
BEGIN   
       select @NextContainer= min(containerid) from (
SELECT   [ContainerID]
      ,[TotalCost]
      ,[Tables],min([TotalCost]) over() MinCost
  FROM  [meta].[Container]) a where [TotalCost] = MinCost
	   update meta.tables set container = @NextContainer where TableName = @TableName
	   exec SP_UpdateContainerStats
       FETCH NEXT FROM tbl_Cursor INTO  @TableName,@Cost  
END  
 
		CLOSE tbl_Cursor   
DEALLOCATE tbl_Cursor    
  exec SP_UpdateContainerStats

	END

But we will still need our SSIS packages to populate those tables.
Of course, we want to automate the package creation with Biml, so let’s take a look at that.

In C#:

<#@ import namespace="System.Data.SqlClient" #>
<#@ import namespace="System.Data" #>
<# int MaxContainers = 3;
string ConnString = "Data source=localhost; Database=Performance; Integrated Security=SSPI";
SqlConnection conn = new SqlConnection(ConnString);
SqlCommand cmd = new SqlCommand("exec [dbo].[SP_SetContainers] "  + MaxContainers, conn);
conn.Open();
cmd.ExecuteNonQuery();
conn.Close(); #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Packages>
        <Package Name="Load_Containers" ConstraintMode="Parallel">
            <Tasks>
                <# for (int Container = 1; Container <= MaxContainers; Container++) { #>
                <Container Name="Load Container <#=Container #>" ConstraintMode="Linear"><Tasks>
                <# foreach (DataRow row in ExternalDataAccess.GetDataTable("Provider=SQLNCLI11;" + ConnString,"select * from meta.tables where container = " + Container).Rows) {  #>
                <Container Name="Load <#= row["TableName"] #>" ConstraintMode="Linear">
                    <Tasks>
                           <!-- Your dataload pattern goes here! -->
                     </Tasks>
                </Container>
                 <# } #>
                 </Tasks></Container>
                <# } #>
             </Tasks>
        </Package>
    </Packages>
</Biml>

Or in VB:

<#@ template language="VB" optionexplicit="False" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ import namespace="System.Data" #>
<# MaxContainers = 3
ConnString = "Data source=localhost; Database=Performance; Integrated Security=SSPI"
Dim conn as new SqlConnection(ConnString)
Dim cmd = new SqlCommand("exec [dbo].[SP_SetContainers] "  & maxcontainers, conn)
conn.Open()
cmd.ExecuteNonQuery()
conn.Close() #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Packages>
        <Package Name="Load_Containers" ConstraintMode="Parallel">
            <Tasks>
                <# for Container = 1 to MaxContainers #>
                <Container Name="Load Container <#=Container #>" ConstraintMode="Linear"><Tasks>
                <# for each row in ExternalDataAccess.GetDataTable("Provider=SQLNCLI11;" + ConnString,"select * from meta.tables where container = " & container).Rows  #>
                <Container Name="Load <#= Row("TableName")#>" ConstraintMode="Linear">
                    <Tasks>
                           <!-- Your dataload pattern goes here! -->
                     </Tasks>
                </Container>
                 <# next #>
                 </Tasks></Container>
                <# next #>
             </Tasks>
        </Package>
    </Packages>
</Biml>
 

As you can see, our Biml code will hold a variable that defines how many containers we actually want, will run our stored procedure with that parameter and will eventually build a package with the defined number of containers.

Each container will hold the tables that ended up in the respective container. As mentioned before, this post is not about load patterns in general, so the actual code to populate the tables is missing.

Thats’s it!

Any questions or comments? We’d love to hear from you at biml@solisyon.de!

Further Information on Biml in English can be found on our English Biml Page.

Happy Biml’ing!

Facebook