Spoiler Alert: This post will have code in VB – and C#. :)
Have you heard about data virtualization using SQL Server 2019 Big Data Clusters? If not, you should definitely go check it out.
It is basically the next version of polybase and now supports data sources such as other SQL Servers and Oracle. I will not go into the details of the features itself here though.
If you have questions on Big Data Clusters (Polybase or Spark Cluster), feel free to reach out!
If you have started playing with polybase, you probably figured out by now, that – as awesome as it is – it’s still a bit of a pain to set it up and maintain external tables. There is a wizard in Azure Data Studio but it’s still under development, especially from a usability standpoint.
So what can be done about that? Well, we effectively looking for an easy way to read metadata from a relational database and automate T-SQL to mirror that metadata. HELLO?! Perfect usecase for Biml – which is NOT just for SSIS.
Let’s take a look at how that can be done…
To get started, we should probably have some kind of metadata repository to maintain our polybase sources.
This example will only work with SQL Server as your source, it will not encrypt your credentials and will always mirror a full schema.
While we probably agree, that these are limitations that you wouldn’t want to run into in production, it should still be sufficient to showcase how Biml is a perfect fit for this requirement.
If you know some basic Biml, it should be easy to actually tweak the code to add Oracle or Terradata support as well as a list of tables to be added (then again, since no actual data movement is happening, it may not even be a big deal to mirror a whole schema).
Just create a new database and run the following T-SQL script which will create and populate our simple metadata repository for this demo.
As you can see, I am pointing to an Azure SQL Database. You don’t have to do that – any SQL Server will do – but given how easy and cheap it is to set up an AdventureWorks there, it seemed like the obvious choice.
CREATE TABLE [dbo].[Datasources]( [DataSource] [nvarchar](50) NULL, [Server] [nvarchar](50) NULL, [UserID] [nvarchar](50) NULL, [Password] [nvarchar](50) NULL, [SRC_DB] [nvarchar](50) NULL, [SRC_Schema] [nvarchar](50) NULL, [DEST_Schema] [nvarchar](50) NULL ) ON [PRIMARY] GO INSERT INTO Datasources VALUES ('AW','<yourinstance>.database.windows.net','<yourUser>','<yourPassword>','AdventureWorks','SalesLT','dbo')
The solution also expects a master key to be configured already. As this is a one-time task, there is no good reason to automate it.
Since we’re using Biml to create T-SQL and everybody has it’s own methodology on how to actually run that, we’ll just use Biml to create on corresponding .sql file in c:\Temp\Polybase\ for each record in our metadata table.
To do so, we provide two connection strings (in this example, they are the same but this allows to separate metadata from the database holding the metadata): One for the metadata database and one for the database where the polybase tables should be created/maintained.
Using the metadata connection, we will just load the metadata into a data table, loop through it and pass the contents of each row as well as the connection string for the target database to another Biml file which we’ll adress using CallBimlScript:
<#@ template language="VB" optionexplicit="False"#> <# MetaString = "Data Source=.;Initial Catalog=BigDataFun;Provider=SQLNCLI11.1;Integrated Security=SSPI;" TargetString = "Data Source=.;Initial Catalog=BigDataFun;Provider=SQLNCLI11.1;Integrated Security=SSPI;" ExternalConnections = ExternalDataAccess.GetDataTable(MetaString,"SELECT * FROM Datasources") for each conn in ExternalConnections.Rows System.IO.File.WriteAllText("C:\Temp\Polybase\" + conn("datasource") + "_" + conn("SRC_Schema") + ".sql", CallBimlScript("02_PolybaseWriter_VB.biml",conn,TargetString)) next #> <Biml xmlns="http://schemas.varigence.com/biml.xsd"/>
That was easy – but not magic.
The interesting part is happening in the called Biml file “02_PolybaseWriter_VB.biml”.
Let’s go through that step by step:
First, we import some namespaces that we’ll use later and declare the two parameters, that will be passed through – easy:
<#@ template language="VB" optionexplicit="false" #> <#@ import namespace="Varigence.Biml.CoreLowerer.TSqlEmitter" #> <#@ import namespace="System.Data" #> <#@ property name="conn" type="DataRow" #> <#@ property name="TargetString" type="String" #> -- Syncing schema <#= conn("SRC_Schema") #> in <#= conn("SRC_DB")#> to <#= conn("DEST_Schema")#> -- This script assumes that a master key has been set
External tables require as master key (as mentioned before, this script expects that key to be set up before), a credential and a datasource that references a credential.
Basically all we do with regards to the credential is to take some T-SQL code that will either run a CREATE or an ALTER statement and inject some dynamic elements from our DataRow into it:
-- CREATE/ALTER CREDENTIAL IF NOT EXISTS(select * from sys.database_credentials WHERE NAME = '<#= conn("DataSource")#>') BEGIN CREATE DATABASE SCOPED CREDENTIAL [<#= conn("DataSource")#>] WITH IDENTITY = '<#= conn("UserID")#>', SECRET = '<#= conn("Password")#>'; END ELSE BEGIN ALTER DATABASE SCOPED CREDENTIAL [<#= conn("DataSource")#>] WITH IDENTITY = '<#= conn("UserID")#>', SECRET = '<#= conn("Password")#>'; END GO
For the datasource, we’re doing the same thing:
-- CREATE DATASOURCE IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE NAME = '<#= conn("DataSource")#>') BEGIN CREATE EXTERNAL DATA SOURCE [<#= conn("DataSource")#>] WITH (LOCATION = 'sqlserver://<#= conn("Server")#>', CREDENTIAL = [<#= conn("DataSource")#>]); END ELSE BEGIN ALTER EXTERNAL DATA SOURCE [<#= conn("DataSource")#>] SET LOCATION = N'sqlserver://<#= conn("Server")#>', CREDENTIAL = [<#= conn("DataSource")#>] END GO
As a next measure, we drop all existing external tables linked to this connection. That may seem a bit harsh but as external tables currently don’t support indexes (only statistics) or much else that would be worth preserving, it seemed like the easiest way to handle changes:
-- DROP EXISTING TABLES <# DropSQL = "SELECT schem.name SchemaName,tbl.Name TableName,object_id FROM sys.external_tables tbl INNER JOIN sys.schemas schem on tbl.schema_id = schem.schema_id INNER JOIN sys.external_data_sources ds on tbl.data_source_id = ds.data_source_id WHERE ds.name = '" & conn("DataSource") & "'" ExistingTables = ExternalDataAccess.GetDataTable(TargetString,DropSQL) for each tbl in ExistingTables.Rows #> IF EXISTS(select * from sys.external_tables WHERE object_id = <#= tbl("object_id")#>) BEGIN DROP EXTERNAL TABLE [<#= tbl("SchemaName")#>].[<#= tbl("TableName")#>] END GO <# next #>
Now for the actual fun part. To figure out which tables actually exist and which columns and datatypes they consist of, we use a standard Biml function: ImportDB. You could of course also use any other function like GetDatabaseSchema or come up with your very own way of extracting the metadata.
We will issue a CREATE EXTERNAL TABLE statement for every table but before we can actually do so, we need to do some cleaning.
EXTERNAL TABLES currently don’t support a couple of data types like for example XML or (max) (e.g. varchar(max)). To overcome this issue, we will just loop through all columns affected by such a restriction and change it’s data type in the Biml object.
Once we have all incompatibilities worked out, we can use a Biml method called TSqlTypeTranslator.Translate to come up with the T-SQL for each column. We will just union all these with a comma and a line break and voila: Fancy automated T-SQL!
-- CREATE TABLES <# Src_ConnStr= "Data Source=" + conn("Server") + ";Initial Catalog=" + conn("SRC_DB") + ";Provider=SQLNCLI11.1;user id=" + conn("UserID") + ";Password=" + conn("password") + ";" importResult = SchemaManager.CreateConnectionNode("Source", Src_ConnStr).ImportDB(conn("SRC_Schema"),nothing,nothing) for each tbl as AstTableNode in importResult.TableNodes for each col in tbl.Columns.Where(function(c) c.DataType = dbtype.Xml) col.datatype = dbtype.AnsiString col.length = 8000 next for each col in tbl.Columns.Where(function(c) (c.DataType = dbtype.string) and c.length= -1) col.length = 4000 next for each col in tbl.Columns.Where(function(c) (c.DataType = dbtype.AnsiString or c.DataType = dbtype.binary) and c.length= -1) col.length = 8000 next#> IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE NAME = '<#=tbl.Name#>') BEGIN CREATE EXTERNAL TABLE [<#= conn("DEST_Schema") #>].[<#=tbl.Name#>] ( <#=string.Join("," + vbCrLf,tbl.Columns.Select(function(i) i.Name + " " + TSqlTypeTranslator.Translate(i.DataType, i.Length, i.Precision, i.Scale, i.CustomType) + if(i.IsNullable," NULL "," NOT NULL "))) #> ) WITH (LOCATION = '[<#= conn("SRC_DB") #>].<#= tbl.SchemaQualifiedName#>', DATA_SOURCE = [<#= conn("DataSource") #>]); END GO <# next #>
And this is what the full file looks like:
<#@ template language="VB" optionexplicit="false" #> <#@ import namespace="Varigence.Biml.CoreLowerer.TSqlEmitter" #> <#@ import namespace="System.Data" #> <#@ property name="conn" type="DataRow" #> <#@ property name="TargetString" type="String" #> -- Syncing schema <#= conn("SRC_Schema") #> in <#= conn("SRC_DB")#> to <#= conn("DEST_Schema")#> -- This script assumes that a master key has been set -- CREATE/ALTER CREDENTIAL IF NOT EXISTS(select * from sys.database_credentials WHERE NAME = '<#= conn("DataSource")#>') BEGIN CREATE DATABASE SCOPED CREDENTIAL [<#= conn("DataSource")#>] WITH IDENTITY = '<#= conn("UserID")#>', SECRET = '<#= conn("Password")#>'; END ELSE BEGIN ALTER DATABASE SCOPED CREDENTIAL [<#= conn("DataSource")#>] WITH IDENTITY = '<#= conn("UserID")#>', SECRET = '<#= conn("Password")#>'; END GO -- CREATE DATASOURCE IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE NAME = '<#= conn("DataSource")#>') BEGIN CREATE EXTERNAL DATA SOURCE [<#= conn("DataSource")#>] WITH (LOCATION = 'sqlserver://<#= conn("Server")#>', CREDENTIAL = [<#= conn("DataSource")#>]); END ELSE BEGIN ALTER EXTERNAL DATA SOURCE [<#= conn("DataSource")#>] SET LOCATION = N'sqlserver://<#= conn("Server")#>', CREDENTIAL = [<#= conn("DataSource")#>] END GO -- DROP EXISTING TABLES <# DropSQL = "SELECT schem.name SchemaName,tbl.Name TableName,object_id FROM sys.external_tables tbl INNER JOIN sys.schemas schem on tbl.schema_id = schem.schema_id INNER JOIN sys.external_data_sources ds on tbl.data_source_id = ds.data_source_id WHERE ds.name = '" & conn("DataSource") & "'" ExistingTables = ExternalDataAccess.GetDataTable(TargetString,DropSQL) for each tbl in ExistingTables.Rows #> IF EXISTS(select * from sys.external_tables WHERE object_id = <#= tbl("object_id")#>) BEGIN DROP EXTERNAL TABLE [<#= tbl("SchemaName")#>].[<#= tbl("TableName")#>] END GO <# next #> -- CREATE TABLES <# Src_ConnStr= "Data Source=" + conn("Server") + ";Initial Catalog=" + conn("SRC_DB") + ";Provider=SQLNCLI11.1;user id=" + conn("UserID") + ";Password=" + conn("password") + ";" importResult = SchemaManager.CreateConnectionNode("Source", Src_ConnStr).ImportDB(conn("SRC_Schema"),nothing,nothing) for each tbl as AstTableNode in importResult.TableNodes for each col in tbl.Columns.Where(function(c) c.DataType = dbtype.Xml) col.datatype = dbtype.AnsiString col.length = 8000 next for each col in tbl.Columns.Where(function(c) (c.DataType = dbtype.string) and c.length= -1) col.length = 4000 next for each col in tbl.Columns.Where(function(c) (c.DataType = dbtype.AnsiString or c.DataType = dbtype.binary) and c.length= -1) col.length = 8000 next#> IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE NAME = '<#=tbl.Name#>') BEGIN CREATE EXTERNAL TABLE [<#= conn("DEST_Schema") #>].[<#=tbl.Name#>] ( <#=string.Join("," + vbCrLf,tbl.Columns.Select(function(i) i.Name + " " + TSqlTypeTranslator.Translate(i.DataType, i.Length, i.Precision, i.Scale, i.CustomType) + if(i.IsNullable," NULL "," NOT NULL "))) #> ) WITH (LOCATION = '[<#= conn("SRC_DB") #>].<#= tbl.SchemaQualifiedName#>', DATA_SOURCE = [<#= conn("DataSource") #>]); END GO <# next #>
That’s it – whenever you update your .sql files and run/deploy them, your external tables will be synced with the underlying source. As easy as that.
Oh, and as promised, since it’s the time of giving – here is the whole thing again. In C#. Please don’t tell anyone…
<#@ import namespace="System.Data"#> <# string MetaString = "Data Source=.;Initial Catalog=BigDataFun;Provider=SQLNCLI11.1;Integrated Security=SSPI;"; string TargetString = "Data Source=.;Initial Catalog=BigDataFun;Provider=SQLNCLI11.1;Integrated Security=SSPI;"; DataTable ExternalConnections = ExternalDataAccess.GetDataTable(MetaString,"SELECT * FROM Datasources"); foreach (DataRow conn in ExternalConnections.Rows) { System.IO.File.WriteAllText(@"C:\Temp\Polybase\" + conn ["datasource"] + "_" + conn["SRC_Schema"] + ".sql", CallBimlScript("12_PolybaseWriter_C.biml",conn,TargetString)); } #> <Biml xmlns="http://schemas.varigence.com/biml.xsd"/>
<#@ import namespace="Varigence.Biml.CoreLowerer.TSqlEmitter" #> <#@ import namespace="System.Data" #> <#@ property name="conn" type="DataRow" #> <#@ property name="TargetString" type="String" #> -- Syncing schema <#= conn["SRC_Schema"] #> in <#= conn["SRC_DB"]#> to <#= conn["DEST_Schema"]#> -- This script assumes that a master key has been set -- CREATE/ALTER CREDENTIAL IF NOT EXISTS(select * from sys.database_credentials WHERE NAME = '<#= conn["DataSource"]#>') BEGIN CREATE DATABASE SCOPED CREDENTIAL [<#= conn["DataSource"]#>] WITH IDENTITY = '<#= conn["UserID"]#>', SECRET = '<#= conn["Password"]#>'; END ELSE BEGIN ALTER DATABASE SCOPED CREDENTIAL [<#= conn["DataSource"]#>] WITH IDENTITY = '<#= conn["UserID"]#>', SECRET = '<#= conn["Password"]#>'; END GO -- CREATE DATASOURCE IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE NAME = '<#= conn["DataSource"]#>') BEGIN CREATE EXTERNAL DATA SOURCE [<#= conn["DataSource"]#>] WITH (LOCATION = 'sqlserver://<#= conn["Server"]#>', CREDENTIAL = [<#= conn["DataSource"]#>]); END ELSE BEGIN ALTER EXTERNAL DATA SOURCE [<#= conn["DataSource"]#>] SET LOCATION = N'sqlserver://<#= conn["Server"]#>', CREDENTIAL = [<#= conn["DataSource"]#>] END GO -- DROP EXISTING TABLES <# string DropSQL = "SELECT schem.name SchemaName,tbl.Name TableName,object_id FROM sys.external_tables tbl INNER JOIN sys.schemas schem on tbl.schema_id = schem.schema_id INNER JOIN sys.external_data_sources ds on tbl.data_source_id = ds.data_source_id WHERE ds.name = '" + conn["DataSource"] + "'";DataTable ExistingTables = ExternalDataAccess.GetDataTable(TargetString,DropSQL); foreach (DataRow tbl in ExistingTables.Rows) { #> IF EXISTS(select * from sys.external_tables WHERE object_id = <#= tbl["object_id"]#>) BEGIN DROP EXTERNAL TABLE [<#= tbl["SchemaName"]#>].[<#= tbl["TableName"]#>] END GO <# } #> -- CREATE TABLES <# string Src_ConnStr= "Data Source=" + conn["Server"] + ";Initial Catalog=" + conn["SRC_DB"] + ";Provider=SQLNCLI11.1;user id=" + conn["UserID"] + ";Password=" + conn["password"] + ";"; string SRC_Schema = conn["SRC_Schema"] + ""; var srcMeta = SchemaManager.CreateConnectionNode("Source", Src_ConnStr).ImportDB(SRC_Schema,null,ImportOptions.None); foreach (AstTableNode tbl in srcMeta.TableNodes) { foreach (AstTableColumnNode col in tbl.Columns.Where(c => c.DataType == DbType.Xml)) { col.DataType = DbType.AnsiString; col.Length = 8000; } foreach (AstTableColumnNode col in tbl.Columns.Where(c => (c.DataType == DbType.String) && (c.Length == -1))) { col.Length = 4000; } foreach (AstTableColumnNode col in tbl.Columns.Where(c => (c.DataType == DbType.AnsiString || c.DataType == DbType.Binary) & c.Length == -1)) { col.Length = 8000; } #> IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE NAME = '<#=tbl.Name#>') BEGIN CREATE EXTERNAL TABLE [<#= conn["DEST_Schema"] #>].[<#=tbl.Name#>] ( <#=string.Join(",\n",tbl.Columns.Select(i => i.Name + " " + TSqlTypeTranslator.Translate(i.DataType, i.Length, i.Precision, i.Scale, i.CustomType) + (i.IsNullable ? " NULL" : " NOT NULL")))#> ) WITH (LOCATION = '[<#= conn["SRC_DB"] #>].<#= tbl.SchemaQualifiedName#>', DATA_SOURCE = [<#= conn["DataSource"] #>]); END GO <# } #>
Any questions or comments? We’d love to hear from you at !
Further Information on Biml in English can be found on our English Biml Page.
Happy Biml’ing – and big data custering :)!
[…] Ben Weissman combines two of my favorite things: […]
[…] Using Biml to automagically keep your external polybase tables in sync! Ben (T) presents a way of reducing the pain of setting up and maintain external tables. […]