This article describes Dynamic Data Masking in Snowflake and indicates the best practices for implementing a data masking policy. Using the techniques described, you can deploy highly sensitive data and rest assured that your data is protected and visible on a need-to-know basis.
This article explains how to create a data masking policy and demonstrates the simple but highly effective best practices for deploying data masking on potentially massive deployments with hundreds of roles and thousands of users.
Experience with Snowflake Data Masking shows many Snowflake Administrators get tied up in complexity and produce difficult-to-understand and hard-to-maintain solutions that closely tie data security with Role Based Access Control. This article aims to create a system that is “As simple as possible, and no simpler”.
What is Dynamic Data Masking?
Dynamic Data masking is a Snowflake security feature that selectively masks plain-text data in tables or views based on a predefined masking policy. This can be used, for example, to mask (or hide) sensitive data and ensure columns containing sensitive data are only available on a need-to-know/need-to-use basis.
Take, for example, a system that records data about Products, Customers, and Sales records. This may include a range of data values, including:
Personally Identifying Data (PII). For example, a customer's date of birth or home address is subject to GDPR Regulations, which protect the use of PII.
Commercially Sensitive Data. For example, the current and historical product price lists or profit margins.
Financially Sensitive Data. For example, the quarterly sales revenue and profit and loss values are subject to financial compliance regulations.
Internal Data, such as the names of products on sale**,** is often a default classification for data accessible to all employees.
Using the above example, we may decide that only a limited number of people need access to Personally Identifiable data, only senior managers need access to commercially or financially sensitive data, and most of the remaining data will be classified as internal use. Based upon this categorization and the business roles, we may decide to dynamically mask data unless people need to see the data in plain text.
Data Classification and Business Roles
A common mistake solution architects make (and almost every other article on data masking) is to immediately dive into the technology details of Masking Policies and Role Hierarchies. This is a fundamental mistake.
Consider, instead, the critical points of the previous paragraph:
Data Classification: Including PII, Commercially Sensitive and Internal. This determines which data should be masked.
Business Roles Include System Administrators, Marketing Managers, and Finance Analysts. This determines who should see the data at all.
Remember that often, over 99% of data is not sensitive, and yet I often see architects get so tied up in knots about the solution they ignore the inherent simplicity of the problem. There is a need to understand how the data is classified and how users are grouped according to access needs, and then implement the links between the two. This leads to two important Best Practices:
Classify the Data: You don’t need to classify every data item and decide who needs access on a table-by-table or column-by-column basis. This is a potentially painful and error-prone approach. Instead, classify the data into (a maximum of about five) categories. You may find dependencies. For example, a customer's credit card could be classified as both personally identifying and financially sensitive data, but these relationships should be ignored immediately.
Design the RBAC Solution: Don’t confuse data secrecy (masking - the 1% of the data) with data access (99% of the problem). Having agreed upon a data classification approach, ignore this and design an RBAC Architecture to control who needs access to which data.
RBAC Architecture
Role-based access Control (RBAC) is a challenging subject in itself. I have discussed RBAC before in a sequence of articles, starting with An Introduction to RBAC. The diagram below illustrates a common RBAC architecture.
In the above diagram, three groups of users have access to the same SALES data. To simplify data access, they share a standard set of Security Roles that determine their data access.
However, when we consider data secrecy, we find additional requirements, including:
Finance Analysts: With the
PROD_FINANCE_ROLE
, who need financial and commercially sensitive data access but not personally identified data.Marketing Analysts: The
PROD_MARKETING_ANALYST
role requires access to Commercially Sensitive data but not Financially Sensitive data.Engineering Support: Those who have full access to create, modify, or alter tables but need no access to sensitive data.
Ideally, we need a simple, easily understood data classification solution based on the architecture of the RBAC mentioned above. The diagram below illustrates how this can be achieved by adding additional roles to indicate access to sensitive data.
Want to learn more from a Snowflake Expert? Click on the image below for training from Analytics Today.
Data Classification Architecture
The above diagram illustrates a simple but flexible way to implement access control on potentially sensitive data. There are a few points worth noting in this design:
Table Independence: Notice that, unlike the RBAC Architecture diagram above, tables or views do not grant access to the data classification roles. Using Snowflake dynamic data masking, we can add or remove masking on individual columns as needed. Unlike the Security Roles in the diagram above, granting access to these Data Classification roles is unnecessary.
Classification Hierarchies: As all users with access to Financially sensitive data also need plain text access to Commercially sensitive data, we can implement a simple hierarchy by granting one role to the other. This means any business role granted one role automatically inherits the other.
Business Functional Roles: Indicate the task undertaken by users and are therefore granted a combination of the Security Roles (which determine which data is accessible), and Data Classification roles which control sensitive data visibility.
Personally Sensitive Data Access: This is optional in this example. In the above diagram, the
PROD_ENGINEERING_SUPPORT
role has no access to sensitive data, and in this example, there are no users with access to Personally Identifiable sensitive data.
Potential Challenge: Functional Role Explosion
The diagram above illustrates a potential problem with the Data Classification solution. In this example, not every Finance or Marketing Analyst can access sensitive data. This leads to an explosion of functional roles, whereby every Functional Role is duplicated to produce access to either sensitive or non-sensitive data.
This is not ideal as it unnecessarily adds additional roles and, therefore, system complexity. Every functional role must be correctly granted access to the underlying Data Security roles. There’s a risk (for example) that PROD_MARKETING
users would be granted slightly different access to PROD_MARKETING_SENSITIVE
, which will cause confusion and frustration.
The diagram below illustrates an elegant and more flexible solution to this problem.
The Solution to Role Explosion
The diagram above illustrates a potential solution. Effectively, the Functional Role (e.g., PROD_FINANCE
) gives READ
access to the underlying tables, and the Data Classification Role (e.g., Financially Sensitive Data) grants additional access. In this case, the user can view the columns marked as Financially Sensitive in plain text.
This means (for example) that although both LEON
and RICK
can view the same data, only LEON can see the columns marked as Financially Sensitive in plain text. RICK can also see data there, which is automatically masked at execution time.
Snowflake Training by an Expert.
Click in the image below for more detail
Implementing Dynamic Data Masking
As indicated above, the key to maintaining simplicity is Data Classification.
We need to decide which data is potentially sensitive and classify it accordingly. In the above example, we classified the data as either commercially or financially sensitive.
Having identified and tagged the sensitive data, the Snowflake Administrator can deploy a data masking policy. For example:
-- Role gives access to Financially sensitive data
create or replace role FINANCIALLY_SENSITIVE_DATA;
-- The Policy Masks Financially sensitive numeric values
create or replace masking policy FINANCIALLY_SENSITIVE_NUMBER as
(val number) return number
case when contains(current_available_roles(),'FINANCIAL_SENSITIVE_DATA')
then val
else null
end;
-- Mask Financially sensitive string values
create or replace masking policy FINANCIALLY_SENSITIVE_STRING as
(val string) return string
case when contains(current_available_roles(),'FINANCIAL_SENSITIVE_DATA')
then val
else repeat('*', length(val)) -- Replace strings with a sequence of *******
end;
The Masking Policy statements above check whether the user has inherited access to the FINANCIALLY_SENSITIVE_DATA role. If not, the values are returned as either NULL or replaced by asterisks. Once deployed, even users with OWNERSHIP privilege on the table cannot view the data in plain text without this role.
The above solution's flexibility comes from the fact that it does not limit the user to a single CURRENT_ROLE()
to define both data access and data classification.
The CURRENT_AVAILABLE_ROLES
()
function returns a JSON list of all the roles granted to the current user, provided the user inherits or has been directly granted access to the FINANCALLY_SENSITIVE_DATA
role, they can view the data in plain text.
Equally, the user must be granted access to the underlying data through their primary functional role.
The solution also allows each user access to multiple Data Classification roles, making it remarkably powerful for such a simple solution. Once all the necessary masking policies are defined, the SQL below shows an example of how to mark data as potentially masked.
alter table if exists SALES_PROFIT
modify column TOTAL_SALES
set masking policy FINANCIALLY_SENSITIVE_NUMBER;
alter table if exists PRODUCTS
modify column PRODUCT_NAME
set masking policy FINANCIALLY_SENSITIVE_STRING;
Notice the mask FINANCIALLY_SENSITIVE_STRING uses the Snowflake replace
function to replace a varying length character string with asterisks. Alternatively we could write a User Defined Function or use a regular expression function to replace parts of the string. The screenshot below shows an example of a query against data, which includes dynamic data masking.
Equally, the solution means we can define a single masking policy but apply it to an unlimited number of columns. This means (for example) we can mask data items across multiple environments, including PROD, Test, and Development. Finally, as data is cloned or replicated, the masking policies are also cloned, promoting automatic, consistent data governance across the account.
Conclusion
Snowflake Dynamic Data Masking is a simple but powerful data governance feature that can automatically mask sensitive data items. However, it can be implemented in many sophisticated and overly complex ways, making it hard to understand and verify.
The key steps in deploying a data masking strategy are:
Classify the Data Types: Select a maximum of around five sensitive data classifications, including Personally-Identifying data, FinanciallySensitive, or even Departmentally Sensitive - for example, data only visible to Human Resources (HR) personnel.
Identify Business Roles and Deploy RBAC: After classifying the data, set them aside and devise a sensible Role-Based Access Architecture independent of handling sensitive data. Be sure to consider the RBAC Best Practices and perhaps the Introduction to Role-Based Access Control.
Implement Dynamic Data Masking: Use the Data Classifications from Step 1 to define the necessary masking policies and decide whether to grant the roles directly to users (for maximum flexibility) or to existing functional roles. Note: Avoid splitting roles into sensitive and non-sensitive data access, as this can lead to an explosion of roles. This requirement is better implemented by granting users direct access to the Data Classification roles.
The important takeaway here is to separate the problem of Data Access (the RBAC challenge of determining who needs access to tables) from Data Masking (whether specific columns need additional security using Automatic Data Masking).
Using the simple but powerful CURRENT_AVAILABLE_ROLES
()
function allows us to treat these two problems as separate challenges and deliver a simple solution.
“Everything should be as Simple as Possible and no simpler” — Albert Einstein
Snowflake Training from an Snowflake Superhero.
Click on the image below for more information.