Loading tables based on foreign key topology using Biml – Part 1: Linear loads

Loading tables based on foreign key topology using Biml – Part 1: Linear loads
Diesen Artikel finden Sie hier auch in deutscher Sprache.

All our previous posts were running data loads in parallel, ignoring potential foreign key constraints. But in real life scenarios, your datawarehouse may actually have tables refering to each other using such, meaning that it is crucial to create and populate them in the right order.

In this blog post, we’ll actually solve 2 issues at once: We’ll provide a list of tables, will then identify any tables that our listed tables rely on (recursively) and will then create and load them in the right order.

In this sample, we’ll use AdventureWorksDW2014 as our source and transfer the FactInternetSales-table as well as all tables it is connected to through foreign key constraints. Eventually, we will create all these tables including the constraints in a new database AdventureWorksDW2014_SalesOnly (sorting them so we get no foreign key violations) and eventually populate them with data.

Biml does not provide a topology sort out of the box so we will start with a little extension method called "TopoSort" which we’ll use to achieve this. It basically loops through a list of tables, checks for each table if there are tables that need to be created before and then adds them to a new list in this specific order:

Imports System.Collections.Generic
Imports System.Linq
Imports System.Linq.Expressions
Imports Varigence.Languages.Biml.Table
Imports System.Runtime.CompilerServices

Module TopologySort	
 	Public OutputPath As String = "C:\SSIS\TopologySort"  
	Public LoadedTables As New List (Of String) 
    Public LoadableTables As New List (Of AstTableNode)
    Public Level As Integer = 0
	<Extension()> 
	Public Function TopoSort (tables As ICollection (Of AstTableNode)) As ICollection (Of AstTableNode)
		Dim visitedList As  New List (Of AstTableNode)
		Dim outputList As  New List (Of AstTableNode)
		For Each tbl As asttablenode In tables 
			TopoVisit(tbl, outputList, visitedList)
		Next 
		Return outputList
	End Function

	Private Function TopoVisit( node As AstTableNode,  outputList As List (Of AstTableNode),  visitedList As List (Of AstTableNode))
		If Not visitedList.Contains(node)  Then
			visitedList.Add(node)
			For Each dependentTable As AstTableNode In node.Columns.OfType(Of AstTableColumnTableReferenceNode).Select(Function(c) c.ForeignTable)
				TopoVisit(dependentTable, outputList, visitedList)
	      	Next	
			For Each dependentTable As AstTableNode In node.Columns.OfType(Of AstMultipleColumnTableReferenceNode).Select(Function(c) c.ForeignTable)
				TopoVisit(dependentTable, outputList, visitedList)
	      	Next	
			outputList.Add(node)
	  	End If
	End Function	
End Module

To be able to actually access our data, we’ll need to define 2 connections, one for the source and one for our new target database. This will be done using a static Biml file:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <Connection Name="AW_DW" ConnectionString="Provider=SQLNCLI11;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2014" />
		<Connection Name="AW_DW_SalesOnly" ConnectionString="Provider=SQLNCLI11;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2014_SalesOnly" />
    </Connections>
	<Databases>
		<Database Name="AW_DW_SalesOnly" ConnectionName="AW_DW_SalesOnly"></Database>
    </Databases>
	<Schemas>
		<Schema Name="dbo" DatabaseName="AW_DW_SalesOnly"></Schema>
    </Schemas>
</Biml>

Now that we’ve got our connections defined, we can retrieve our metadata. As mentioned before, this example will filter on FactInternetSales but feel free to add other tables, change it to another table or don’t add any filter at all.

In general, the logic works as follows:

– Provide a list (which can be empty) of tables that you want to retrieve from your source
– Get the complete database schema of your source
– For each table in your list, check if it has a foreign key constraint – if it does, add the referenced table to your list of tables
– Keep repeating that step until there are no foreign key constraints left, whose referenced tables are not in your list
– Pass the final list to GetDatabaseSchema so you can simply loop through this result
– Add a table-tag for all tables in your import results, change the schema to your target schema

<#@ template language="VB" tier="2"#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Tables>
        <# dim tablelist as new list (of string)
	    dim tablelist_checked as new list (of string)
		tablelist.add ("FactInternetSales")	   	   			
		dim importResult as Importresults = rootnode.connections("AW_DW").GetDatabaseSchema()
		do while importresult.tablenodes.where(function(e) tablelist.Contains(e.name) and not tablelist_checked.Contains(e.name)).where(function(r) r.columns.OfType(Of AstTableColumnTableReferenceNode)().Any).Count > 0
			for each t as asttablenode in importresult.tablenodes.where(function(e) tablelist.Contains(e.name) and not tablelist_checked.Contains(e.name)).where(function(r) r.columns.OfType(Of AstTableColumnTableReferenceNode)().Any)
				for each r as asttablecolumntablereferencenode in t.columns.ofType(of AstTableColumnTableReferenceNode)() 
					tablelist.add(r.foreigntable.name)
				next
				tablelist_checked.add(t.name)		
			next 
		loop
		importresult = rootnode.connections("AW_DW").GetDatabaseSchema(nothing,tablelist,ImportOptions.ExcludeViews)
		for each t as asttablenode in importresult.tablenodes 
		t.schema = rootnode.schemas(0)#>
        <#= t.getbiml #>
        <# next #>
    </Tables>
</Biml>

Now, that we have set up our metadata, we can go ahead and create our tables in the target database. To make sure, we don’t run into any foreign key issues, we will drop them (if they exist) first (using TopoSort.Reverse), starting with those tables that are not reference by and foreign keys. Then we will loop through the list of tables again, this time starting with the tables holding the primary keys, and create them using the GetDropAndCreateDDL method:

<#@ template tier="3" language="VB"#> 
<#@ code file="../Code/TopologySort.vb" #>
<#@ import namespace="TopologySort" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<Packages>
		<Package Name="01_Create" ConstraintMode="Linear" PackageSubpath="<#= OutputPath #>">
			<Tasks>
				<ExecuteSQL Name="DROP Tables" ConnectionName="AW_DW_SalesOnly">
					<DirectInput>
					<# for each table as asttablenode in RootNode.Tables.TopoSort.reverse  #>
						IF EXISTS (SELECT * from sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[<#= table.name #>]') AND type IN (N'U'))
						DROP TABLE [dbo].[<#= table.name #>]
						GO
					<# next #>
                    </DirectInput>
				</ExecuteSQL>
				<ExecuteSQL Name="CREATE Tables" ConnectionName="AW_DW_SalesOnly">
					<DirectInput>
					<# for each table as asttablenode in RootNode.Tables.TopoSort  #>
						<#= table.GetDropAndCreateDDL() #>
					<# next #>
                    </DirectInput>
				</ExecuteSQL>
            </Tasks>
        </Package>
    </Packages>	
</Biml>

We will just recreate all tables every time before we populate them, so we don’t have to deal with slow DELETE statements – of course, DELETE would work using the same logic. After cleaning up our tables, we can just load them by creating a dataflow for each of them, using the same sorting that we’ve used to actually create them:

<#@ template tier="4" language="VB"#> 
<#@ code file="../Code/TopologySort.vb" #>
<#@ import namespace="TopologySort" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Packages>
        <Package Name="02_Load_Linear" ConstraintMode="Linear" PackageSubpath="<#= OutputPath #>">
            <Tasks>
                <ExecutePackage Name="Recreate Tables">
                    <ExternalFile ExternalFilePath="<#= OutputPath #>\01_Create.dtsx"/>
                </ExecutePackage>
                <# for each t as asttablenode in rootnode.tables.TopoSort #>
                <#= CallBimlScript("06_Dataflow.biml",t,nothing) #>
                <# next #>
            </Tasks>
        </Package>
    </Packages>
</Biml>
<#@ template language="VB" designerbimlpath="Biml/Packages/Package/Tasks" #>
<#@ property name="tbl" type="AstTableNode" #> 
<#@ property name="LoadedTables" type="List (of String)" #> 
<# if not loadedtables is nothing then LoadedTables.add (tbl.name) #>
<Dataflow Name="Load <#=tbl.Name #>">
    <Transformations>
        <OleDbSource Name="Get Rows" ConnectionName="AW_DW">
            <DirectInput>Select <#= tbl.GetColumnList() #> from <#=tbl.SchemaQualifiedName#></DirectInput>
        </OleDbSource>
        <OleDbDestination Name="Set Rows" ConnectionName="AW_DW_SalesOnly" KeepIdentity="true">
            <TableOutput TableName="<#=tbl.ScopedName#>" />
        </OleDbDestination>
    </Transformations>
</Dataflow>
 

This gives us the ability to define any list of tables and transfer them to another database including all it’s depending tables. Pretty cool already 🙂

Now the only downside to this is: The load is running sequential – this might potentially take much more time than it actually has to. In Part 2 of this post, we’ll explain how to actually load these tables in parallel!

Any questions or comments? We’d love to hear from you at biml@solisyon.de!

Further Information on Biml in English can be found on our English Biml Page.

Happy Biml’ing!

Facebook