Sunday, 20 May 2012

Managing Files with FileTables

FileTables are a new feature in SQL Server 2012. They are layered on the existing FILESTREAM functionality.

FILESTREAM is designed to allow large objects, such as images and documents, to be stored logically in a column in a database table, but physically in a file in the operating system. This keeps large objects outside the database, keeping the size of the database down and also allows the files to be accessed directly by client applications reducing overhead.

FileTables in contrast allow SQL Server to manage a whole folder in the Windows file system including all the sub folders and files in that folder. The benefits of this is that from a client application perspective the files are exactly the same as those stored in the normal file system and can be directly accessed by standard programs such as Microsoft Word. From a SQL Server perspective the files and folders can be accessed and manipulated using T-SQL just as if there were stored in a table. Other advantages include the ability to directly index the contents of the folder using the full text indexing facility in SQL Server.

To enable a SQL Server Instance and database for FileTable use; requires a few simple steps:
·         First enable FILESTREAM on the SQL Instance
sp_configure 'filestream access level', 2
RECONFIGURE

·         Then create a database with FILESTREAM group. In this example the files will reside in folder C:\FileTableExample\FT               
CREATE DATABASE FileTableExample
ON PRIMARY ( NAME = FileTableExample,
       FILENAME = 'C:\FileTableExample\Data\FileTableExample.mdf'),
FILEGROUP FileTableExampleFS CONTAINS FILESTREAM( NAME = FileTableExampleFS,
       FILENAME = 'C:\FileTableExample\FT')
LOG ON  ( NAME = FileTableExampleLog,
FILENAME = 'C:\FileTableExample\Data\FileTableExample.ldf')

·         Then define the root folder to be used by FileTables. Note that setting NON_TRANSACTED_ACCESS to FULL exposes FileTables for use using the Windows file system, the other options are OFF and READ_ONLY. In this example the files will reside in folder C:\FileTableExample\FT\RootFolder           
ALTER DATABASE FileTableExample
       SET FILESTREAM ( NON_TRANSACTED_ACCESS = FULL,
DIRECTORY_NAME = N'RootFolder' );

Finally to create a FileTable just use:
CREATE TABLE FileTableExample1 AS FileTable

From a SQL Server perspective a FileTable is a special type of table. Unlike normal tables, FileTables have pre-set columns that represent the Meta data properties of the tables and folder they contain. These columns can be retrieved and updated using T-SQL allowing the file system data that the table represents to be managed. The columns of the FileTable created above are shown below in Object Explorer:

To manipulate files and folders using Windows Explorer first obtain the file share name using:
SELECT FileTableRootPath('dbo.FileTableExample1')

In this example the UNC path is \\SERVERNAME\mssqlserver\RootFolder\FileTableExample1 which can be accessed in Windows Explorer in the normal way (Just drag and drop files):

Files can be manipulated using T-SQL INSERT, UPDATE and DELETE (or any other T-SQL).
To rename a file:
UPDATE FileTableExample1
SET name='textfile1a.txt'
WHERE name='textfile1.txt'

To select the contents of a file. Note the file_stream column is varbinary(max) hence the CONVERT:
SELECT CONVERT(varchar(max), file_stream)
FROM FileTableExample1
WHERE name='textfile1a.txt'

To update the contents of a file:
UPDATE FileTableExample1
SET file_stream = CONVERT(varbinary(max),
CONVERT(varchar(max), file_stream) + 'new')
WHERE name='textfile1a.txt'

Folders are dealt with in a similar way.

Files and folders can be managed in T-SQL and using the normal File System in Windows. The best of both worlds.

Monday, 7 May 2012

FILESTREAM and Partitioning

There is a requirement for a large number of large PDF documents to be managed by a SQL Server database without the contents of the documents actually being stored in the database, for size reasons. FILESTREAM meets these requirements.

There is an additional requirement to store documents of different types in different folders so as to be able to take advantage of multiple disk drives. Partitioning allows a single table to be allocated to multiple file groups according to values in certain columns and so again this meets the requirements.

Combining FILESTREAM and partitioning, however, turned out not to be as simple as expected.

I started by enabling FILESTREAM on SQL Server using:
sp_configure 'filestream access level', 2
RECONFIGURE

I then created a database with two FILESTREAM file groups using:
 CREATE DATABASE [FileStreamExample] ON  PRIMARY
( NAME = N'FileStreamTest', FILENAME = N'c:\FileStreamExample\FileStreamExample.mdf'),
FILEGROUP [FStream1] CONTAINS FILESTREAM 
( NAME = N'FStream1', FILENAME = N'c:\FileStreamExample\FStream1' ),
 FILEGROUP [FStream2] CONTAINS FILESTREAM
( NAME = N'FStream2', FILENAME = N'c:\FileStreamExample\FStream2' )
 LOG ON
( NAME = N'FileStreamTest_log', FILENAME = N'c:\FileStreamExample\FileStreamExample_log.ldf')
Note that each FILESTREAM file group can only one have one folder associated with it, hence the need for multiple file groups.

I then proceeded to setup partitioning by creating a partitioning function. The document type will be either 1 or 2 and each one of these should be assigned to a different file group:
CREATE PARTITION FUNCTION FSPartFunc ( int )
AS RANGE LEFT
FOR VALUES (1)

Next I assigned created two partitioning schemes, one for the FILESTREAM content and one for the other columns in the table. I need both because the table and the FILESTREAM need to follow the same partitioning scheme:
CREATE PARTITION SCHEME FSPartScheme
AS PARTITION FSPartFunc
TO (FStream1, FStream2)

CREATE PARTITION SCHEME PartScheme
AS PARTITION FSPartFunc
TO ([PRIMARY], [PRIMARY])

Then I created the table:
CREATE TABLE FSPartExampleTable (
      ID uniqueidentifier not null ROWGUIDCOL,
      DataType int NOT NULL,
      TheData varbinary(max) filestream NOT NULL)
ON PartScheme(DataType)
FILESTREAM_ON FSPartScheme
But I got error: A table with FILESTREAM column(s) must have a non-NULL unique ROWGUID column.

So I added a PRIMARY KEY on the ID column as below:
CREATE TABLE FSPartExampleTable (
      ID uniqueidentifier not null ROWGUIDCOL PRIMARY KEY,
      DataType int NOT NULL,
      TheData varbinary(max) filestream NOT NULL)
ON PartScheme(DataType)
FILESTREAM_ON FSPartScheme
But then I got the error: Partition columns, DataType in this case, for a unique index must be a subset of the index key. I tried again with a UNIQUE constraint and got the same error.

It seemed like there were two conflicting requirements here: Firstly the ID column must be unique and secondly the unique column must contain the partitioning column. So I tried to create a primary key based on both ID and DataType:
CREATE TABLE FSPartExampleTable (
      ID uniqueidentifier not null ROWGUIDCOL,
      DataType int NOT NULL,
      TheData varbinary(max) filestream NOT NULL,
PRIMARY KEY (ID, DataType ))
ON PartScheme(DataType)
FILESTREAM_ON FSPartScheme
But then I was back to the first error: But I got error: A table with FILESTREAM column(s) must have a non-NULL unique ROWGUID column.

So FILESTREAM is pretty insistent on having a unique index on the ID column, whereas partitioning wants to have a composite index if the index is to be partitioned. That’s when I realised what the issue was, the unique indexes are being associated with the partitioning scheme of the table and actually I do not need this to be the case. So I changed my primary key to be non-clustered, so it is separate from the table, and associated it with a different file group, [PRIMARY] in this example, and the issue was solved:
CREATE TABLE FSPartExampleTable (
      ID uniqueidentifier not null ROWGUIDCOL
            PRIMARY KEY NONCLUSTERED ON [PRIMARY],
      DataType int NOT NULL,
      TheData varbinary(max) filestream NOT NULL)
ON PartScheme(DataType)
FILESTREAM_ON FSPartScheme

Now the FILESTREAM data for DataType 1 and 2 are stored in different folders as required:
INSERT INTO FSPartExampleTable
VALUES (NEWID(), 1, CAST ('Part 1 ...data' AS varbinary(max)))
INSERT INTO FSPartExampleTable
VALUES (NEWID(), 2, CAST ('Part 2 ...data' AS varbinary(max)))

Wednesday, 2 May 2012

The Quality of Data

With the introduction of SQL Server 2012 we now have a dedicated facility for the management of data quality, aptly named Data Quality Services. So what exactly does this entail?

The first question to ask is “what exactly is data quality?”. SQL Server is a relational database and by its very design it maintains relationships between tables through the addition of foreign key constraints, it ensures uniqueness of data columns using unique constraints, primary keys and indexes. We have data types to ensure that the correct type of value is entered into columns and we can add check constraints and triggers to apply logical validations. So what could possibly go wrong with the data quality?

Well interestingly quite a lot can still go wrong. Let’s take the user registration database for a typical ecommerce site as an example. Users register when they wish to make a purchase, the system records the email address, client name and delivery and billing addresses. So far so good SQL server will ensure that this data is all stored correctly in the database. Then 6 months later the user returns to the site to buy something else, and forgets that they already registered, so they register again, with a different email address, a middle initial added to the name and the address formatted slightly different with Street instead of St and no postcode. SQL Server stores the user and address and all looks good, but all is not good. The user’s details have been stored twice, the address has been duplicated and the formatting of the fields is inconsistent. The above example may sound trivial but it is a big issue for online retailers and just about every other business. The database server is rigorous about storing structured data, but it can’t handle the inconsistencies of the users that enter the data. Having many thousands of duplications in a database is not uncommon.

Data Quality Services is designed to address the above issues. It comes as a server program that manages the data and client program used to manage the data quality process. Part of the client program user interface is shown below.

The stages of use are:
·         Firstly rules are established in the knowledge base. Rules can be imported from a standard set of rules, or derived from sample data or setup by the user. Rules include things such as:
o   Spell checking data
o   Validating or standardising data
o   Defining matching rules that test the similarity of data to check for duplicates
·         Secondly a Data Quality project is created to apply the rules in the knowledge base to a particular set of data in a database. The project applies the rules to cleanse the data and then provides an interactive facility for the user to approve or reject its changes. It can also apply matching rules to the data to detect and resolve duplicates
·         Finally Data Quality Services will report on data completeness and quality through its Profiling and Notification facilities

Data cleaning is not a one-time activity. Once the Knowledge Base and Data Projects have been created they are repeatedly used and evolved over time to keep the data quality levels high.

Data Quality Services comes as part of SQL Server 2012 Enterprise and Business Intelligence Editions.