Deep dive on Data Discovery and Classification utility in SSMS 17.5

SQL Server Management Studio 17.5 was released recently with several really interesting fixes and functionalities (release notes are here). One of the things, included in this version is the Data Discovery and Classification – a tool in SQL Server Management Studio that can discover and label columns in your database. Basically, it looks like a single screen wizard, which scans a certain database and based on some logic, suggests labels for the columns (which you can change) and ultimately write them to the database for further reporting and assessment needs. I am not going to give you the basics of this utility – if you want to learn more on how to get started with Data Discovery and Classification in SSMS, you can do so here. My idea in this post is to try and get under the hood and try to find answers for the following.:

  • What is the algorithm for column classification
  • How is SSMS storing the classification
  • How can I summarize the data for all databases on my instance?

So, let’s get started!

Preparing the environment

For this you will need SSMS 17.5 and AdventureWorks Database

I will be using XEvent Profiler in SSMS to try and trace what are the queries, going against my DB, when I run the Data Discovery and Classification tool.

What lies behind the classification algorithm

So, let’s find out how Microsoft is deciding which column what kind of data it has.

I am starting a Standard Trace:

Then simply run the Classify Data utility:

Under the hood there are huge amount of metadata queries, but if you are patient enough () you will see a query that looks like this:

Cool! That looks like something that might help me with the task.

Complete query you can find here.

This is obviously the algorithm that evaluates and suggests what labels to assign to the columns in your database. The important part is the table @Dictionary, defined in the beginning of the query. For ease of use, I compiled a table, where you can find all predefined Information Types and Classification Labels by Microsoft. A short sample you can find below. If you need the complete set, please follow this link.

That is more than 350 patterns which can cover a lot of scenarios. However:

  • Those patterns are fixed. Yes, they cover several languages, such as English, Spanish, German, French, but still what should the rest of the world do? Like me, who live in Bulgaria and sometimes use Bulgarian for column names.
  • The label of a column can be changed, but you cannot define your own labels.

Well, yes, you can run the classification query on your own altering the @Dictionary table, but still you won’t get the ability to store the classification, which is kind of the main point of the whole things, isn’t it! So I decided to dig deeper and in the next sections I found how to store the information the same way SSMS is doing it and more importantly, I can create reports on those labels, that are not the out-of-the box ones.

Storing the classification labels

Now we know what the algorithm for determining the labels and information type for column in a database is. As we clarified, you do not have the ability to add your own Information Types or Sensitivity labels. But let’s see if we can implement something custom, reusing the technique of Microsoft for storing the information type and labels.

So, here’s how the labelling screen looks like:

On the left part you have all discovered sensitive columns, the Information Type (can be changed) and the Sensitivity Label (can be changed). On the right, you can define classification and labelling for columns, that were not suggested initially (this pane is accessible through the Add Classification button on top of the utility). Once you save it, the Discovery and Classification tool is storing the info in the database, using extended properties for the column. There are four extended properties being logged per column:

  • sys_information_type_id
  • sys_information_type_name
  • sys_sensitivity_label_id
  • sys_sensitivity_label_name

From this point forward, all reports and functionalities are based on those four extended properties. Below is an example of how labelling a single column looks like in the background:

exec
sp_addextendedproperty


    @name=N'sys_information_type_name',

    @level0type=N'schema',

    @level0name=N'Person',

    @level1type=N'table',

    @level1name=N'BusinessEntity',

    @level2type=N'column',

    @level2name=N'BusinessEntityID',

    @value=N'National ID'

exec
sp_addextendedproperty


    @name=N'sys_information_type_id',

    @level0type=N'schema',

    @level0name=N'Person',

    @level1type=N'table',

    @level1name=N'BusinessEntity',

    @level2type=N'column',

    @level2name=N'BusinessEntityID',

    @value=N'6F5A11A7-08B1-19C3-59E5-8C89CF4F8444'

exec
sp_addextendedproperty


    @name=N'sys_sensitivity_label_name',

    @level0type=N'schema',

    @level0name=N'Person',

    @level1type=N'table',

    @level1name=N'BusinessEntity',

    @level2type=N'column',

    @level2name=N'BusinessEntityID',

    @value=N'Highly Confidential'

exec
sp_addextendedproperty


    @name=N'sys_sensitivity_label_id',

    @level0type=N'schema',

    @level0name=N'Person',

    @level1type=N'table',

    @level1name=N'BusinessEntity',

    @level2type=N'column',

    @level2name=N'BusinessEntityID',

    @value=N'b82ce05b-60a9-4cf3-8a8a-d6a0bb76e903'


Reporting classification and label properties for single database

Once you have all your sensitive columns labeled using Extended Properties for a column, then all reports are easy. The query, that can give you the detailed information for labels in a database (the one that Microsoft is using):

exec sp_executesql @stmt=N'
            SELECT Schema_name(objects.schema_id) AS schema_name, 
                   objects.NAME                   AS table_name, 
                   columns.NAME                   AS column_name, 
                   ISNULL(EP.information_type_name,'''') AS  information_type_name,
                   ISNULL(EP.sensitivity_label_name,'''') AS  sensitivity_label_name
                   FROM (SELECT ISNULL(C1.major_id,C2.major_id) AS major_id, 
                                ISNULL(C1.minor_id,C2.minor_id) AS minor_id, 
                                C1.information_type_name, 
                                C2.sensitivity_label_name 
                                FROM (SELECT major_id, 
                                             minor_id,
                                             NULLIF(value,'''') AS information_type_name
                                             FROM sys.extended_properties 
                                             WHERE NAME = ''sys_information_type_name'') C1 
                                FULL OUTER JOIN (SELECT major_id, 
                                                        minor_id, 
                                                        NULLIF(value,'''') AS sensitivity_label_name
                                                        FROM sys.extended_properties 
                                                        WHERE  NAME = ''sys_sensitivity_label_name'') C2 
                                ON ( C2.major_id = C1.major_id AND C2.minor_id = C1.minor_id )) EP 
                   JOIN sys.objects objects 
                     ON EP.major_id = objects.object_id 
                   JOIN sys.columns columns 
                     ON ( EP.major_id = columns.object_id AND EP.minor_id = columns.column_id )
        ',@params=N''

The result looks like this:

Reporting classification and label properties for all databases

With slight modification of the above query you can produce a report for all your databases in a given instance:

DECLARE @ResultClassification table (DBName sysname,SchemaName sysname,TableName sysname,ColumnsName sysname,InformationTypeName sql_variant,SensitivityLabelName sql_variant)
INSERT INTO @ResultClassification
EXEC sp_MSforeachdb'
USE ?
SELECT DB_NAME(),Schema_name(objects.schema_id) AS schema_name, 
                   objects.NAME                   AS table_name, 
                   columns.NAME                   AS column_name, 
                   ISNULL(EP.information_type_name,'''') AS  information_type_name,
                   ISNULL(EP.sensitivity_label_name,'''') AS  sensitivity_label_name
                   FROM (SELECT ISNULL(C1.major_id,C2.major_id) AS major_id, 
                                ISNULL(C1.minor_id,C2.minor_id) AS minor_id, 
                                C1.information_type_name, 
                                C2.sensitivity_label_name 
                                FROM (SELECT major_id, 
                                             minor_id,
                                             NULLIF(value,'''') AS information_type_name
                                             FROM sys.extended_properties 
                                             WHERE NAME = ''sys_information_type_name'') C1 
                                FULL OUTER JOIN (SELECT major_id, 
                                                        minor_id, 
                                                        NULLIF(value,'''') AS sensitivity_label_name
                                                        FROM sys.extended_properties 
                                                        WHERE  NAME = ''sys_sensitivity_label_name'') C2 
                                ON ( C2.major_id = C1.major_id AND C2.minor_id = C1.minor_id )) EP 
                   JOIN sys.objects objects 
                     ON EP.major_id = objects.object_id 
                   JOIN sys.columns columns 
                     ON ( EP.major_id = columns.object_id AND EP.minor_id = columns.column_id )'

SELECT * FROM @ResultClassification

 

One Reply to “Deep dive on Data Discovery and Classification utility in SSMS 17.5”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.