What’s in my stage?

When you build an ETL Solution, at some point you will most likely feel the need to compare the data between your source and your staging (or datawarehouse) database. There may be various reasons for them to be out of sync, such as delta loads, aggregations or added business logic, but one day your phone will ring and whoever is on the other end will tell you that the numbers are wrong. While this doesn’t necessarily have to be an issue within your ETL process, it might just as well be the report itself. In many cases, this is a good starting point to look at.

This article (formally published on PASS Community Blog) focusses on SSIS as your orchestrator, but the same principles could, obviously, also be applied to Azure Data Factory for example. Also, as we want to solve this task as lightweight as possible, we will, demonstrably, use Biml to implement it!

Collecting the Data to be Compared

As mentioned previously, we’re trying to do this very lightweight so our solution will only collect numerical data resulting from aggregation functions. Whenever collecting current values, we will collect them for all configured columns and tables.

If this is not applicable, or sufficient for your environment, you may need to extend the code samples a bit.

We will need two tables: One to configure the data that we want to collect and one for the actual data we’ve collected. Just create those in a separate database (we’ll call it BimlDemo_Compare):

CREATE TABLE [dbo].[Compare_Config](
       [CompareID] [INT] IDENTITY(1,1) NOT NULL,
       [Source_ConnectionName] [NVARCHAR](50) NULL,
       [Source_SchemaName] [NVARCHAR](50) NULL,
       [Source_TableName] [NVARCHAR](50) NULL,
       [Source_ColumnName] [NVARCHAR](50) NULL,
       [Target_ConnectionName] [NVARCHAR](50) NULL,
       [Target_SchemaName] [NVARCHAR](50) NULL,
       [Target_TableName] [NVARCHAR](50) NULL,
       [Target_ColumnName] [NVARCHAR](50) NULL,
       [Aggregation] [NVARCHAR](50) NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Compare_Log](
       [CollectionID] [INT] IDENTITY(1,1) NOT NULL,
       [CompareID] [INT] NOT NULL,
       [Step] [NVARCHAR](50) NOT NULL,
       [ConnectionType] [NVARCHAR](50) NOT NULL,
       [TimeStamp] [DATETIME] NOT NULL,
       [ReturnValue] [NUMERIC](28, 12) NOT NULL
) ON [PRIMARY]
GO

Let’s add some sample data to our config table. If you don’t have the AdventureWorks2017 databases yet, you can download them at https://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure.

INSERT INTO [dbo].[Compare_Config]
      ([Source_ConnectionName]
      ,[Source_SchemaName]
      ,[Source_TableName]
      ,[Source_ColumnName]
      ,[Target_ConnectionName]
      ,[Target_SchemaName]
      ,[Target_TableName]
      ,[Target_ColumnName]
      ,[Aggregation])
    VALUES
      ('AW2017_SRC'
      ,'Sales'
      ,'SalesOrderDetail'
      ,'Linetotal'
      ,'AW2017_DWH'
      ,'dbo'
      ,'FactInternetSales'
      ,'SalesAmount'
      ,'SUM')

INSERT INTO [dbo].[Compare_Config]
      ([Source_ConnectionName]
      ,[Source_SchemaName]
      ,[Source_TableName]
      ,[Source_ColumnName]
      ,[Target_ConnectionName]
      ,[Target_SchemaName]
      ,[Target_TableName]
      ,[Target_ColumnName]
      ,[Aggregation])
    VALUES
      ('AW2017_SRC'
      ,'Sales'
      ,'Customer'
      ,'AccountNumber'
      ,'AW2017_DWH'
      ,'dbo'
      ,'DimCustomer'
      ,'CustomerKey'
      ,'COUNT')

These records assume that you have already created a Control (pointing to the database you’ve just created), AW2017_SRC (pointing to an AdventureWorks2017 database) and AW2017_DWH (pointing to an AdventureWorksDW2017 database) connection in your Biml code.

Now for the Biml part, we’ll just add a new file “CBS_GetLogContainer.biml” which can later be used with CallBimlScript:

<#@ property type="string" name="ct" #>
<#@ property type="string" name="StepName" #>
<#@ Import namespace="System.Data"#>
<# var connMetaData = RootNode.Connections["Control"].RenderedConnectionString;
var sqlStr = "SELECT [CompareID],[" + ct + "_ConnectionName] ConnectionName,[" + ct + 
"_SchemaName] SchemaName,[" + ct + "_TableName] TableName,[" + ct + "_ColumnName] 
ColumnName,[Aggregation] FROM [Compare_Config]";
DataTable DT = ExternalDataAccess.GetDataTable(connMetaData,sqlStr); #>
<Container Name="Get Compare Data - <#= ct #> - <#= StepName #>" ConstraintMode="Linear">
   <Tasks>
     <# foreach (DataRow r in DT.Rows) {#>
     <Dataflow Name="<#= r["CompareID"]  #> - <#= r["Aggregation"]  #> - <#= 
r["ConnectionName"]  #> - <#= r["SchemaName"]  #> - <#= r["TableName"]  #> - <#= 
r["ColumnName"]  #>">
     <Transformations>
       <OleDbSource Name="SRC" ConnectionName="<#=r["ConnectionName"] #>">
       <DirectInput>
         SELECT <#= r["CompareID"]  #> CompareID,cast('<#= StepName #>' AS 
NVARCHAR(50)) Step,cast('<#= ct #>' AS NVARCHAR(50)) ConnectionType,GetDate() 
TimeStamp,<#=r["Aggregation"] #>([<#= r["ColumnName"]  #>]) ReturnValue FROM [<#= 
r["SchemaName"]  #>].[<#= r["TableName"]  #>]
       </DirectInput>
       </OleDbSource>
       <OleDbDestination Name="Dest" ConnectionName="Control">
           <ExternalTableOutput Table="[dbo].[Compare_Log]"/>
       </OleDbDestination>
     </Transformations>
   </Dataflow>
   <# } #>
 </Tasks>
</Container>

The file will get 2 parameters: a Connection type (Source or Target), as well as a step name. It will then return a container with a dataflow for every value that should be compared, writing it’s current value as well as the name of the step, the connection type and a time stamp to the log table.

How Can I Integrate This Into My Biml Solution?

As we’re using CallBimlScript, all you need to do is call this new file, whenever you want to collect the aggregations:

<#= CallBimlScript("CBS_GetLogContainer.biml",ConnectionType,StepName) #>

A full Biml file could look like this:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    <Package Name="MyPackage" ConstraintMode="Linear" 
ProtectionLevel="EncryptSensitiveWithUserKey">
      <Tasks>
        <!-- Inital data collection from source -->
        <#= CallBimlScript("CBS_GetLogContainer.biml","Source","Pre-Load") #>

        <!-- Your dataflow goes here -->

        <!-- Getting data from source again -->
        <#= CallBimlScript("CBS_GetLogContainer.biml","Source","Post-Load") #>
        
        <!-- Your business logic goes here-->

        <!-- Getting Data from DWH -->
        <#= CallBimlScript("CBS_GetLogContainer.biml","Target","Post-Business Logic") #>
      </Tasks>
    </Package>
  </Packages>
</Biml>

How Can I Get Insights From This?

Well, that depends on your needs. If you simply want to get the latest count for each step, a simple query like this might be suifficient:

SELECT a.*
FROM [Compare_Log] a
  INNER JOIN(SELECT MAX(CollectionID) CollectionID
FROM [dbo].[Compare_Log]
GROUP BY CompareID, Step, ConnectionType) b ON a.CollectionID=b.CollectionID
ORDER BY a.CompareID, a.CollectionID;

But of course, you can run any kind of report over your log table, which could also be used to illustrate how your data is growing over time.

I hope this example showed you, once again, the beauty of Biml. With just one line of code, added to your files here and there, you’re adding functionality and value that would normally be so much manual work that you’d probably never even consider it otherwise!

If you have questions, please feel free to reach out to me on Twitter: @bweissman – my DMs are open.