Monday, 4 June 2012

Server Management with Utility Control Points

Server Management with Utility Control Points
Utility Control Points (UCPs) were first introduced in SQL Server 2008 R2 and are designed for managing servers  of SQL Server 2008 R2 or later.

The bad news is that if you have older versions of SQL Server then you will have to upgrade to take advantage of this facility for those servers.

The rest is pretty much all good news because UCPs are a very easy way to get an overall overview of the state of all servers and to drill into view the state of specific servers.

To create a UCP in Management Studio first display the Utility Explorer using menu View > Utility Explorer menu and then select the third icon in the tool bar and run through the wizard to create the UCP:

Then select the Enroll Instance menu, as shown below, to register each server that you wish to monitor:

Then wait 15 minutes for the data collectors to update the database and the system is operational (as shown by the green ticks in the screen below):


The top level UCP dashboard then gives an overview for the state of health of all servers:

Information on individual servers can be investigated by selecting the individual server.

Polices on what constitutes over utilization and underutilisation and so on, can be set globally or per server.

The only part that I found marginally confusing during the configuration was when I received error message “The collection sets on the specified instance of SQL Server must be stopped”. This error concerned me because initially I thought it meant that it was not possible to use the Management Data Warehouse facilities for monitoring query performance at the same time as using UCPs. However it turns out that UCPs just use the same Management Data Warehouse infrastructure, and so require that any Management Data Warehouse data collectors on any of the servers involved must use the same database. So you just have to turn off any data collectors you have running and then restart them once the UCP is in place and all will work together nicely. To stop and start data collectors just right click on each one and use the Stop or Start Data Collection Set menu as shown below:

Overall the Utility Explorer and UCPs represent a simple to use and efficient way of viewing overall health for a number of SQL Server instances.

Sunday, 20 May 2012

Managing Files with FileTables

FileTables are a new feature in SQL Server 2012. They are layered on the existing FILESTREAM functionality.

FILESTREAM is designed to allow large objects, such as images and documents, to be stored logically in a column in a database table, but physically in a file in the operating system. This keeps large objects outside the database, keeping the size of the database down and also allows the files to be accessed directly by client applications reducing overhead.

FileTables in contrast allow SQL Server to manage a whole folder in the Windows file system including all the sub folders and files in that folder. The benefits of this is that from a client application perspective the files are exactly the same as those stored in the normal file system and can be directly accessed by standard programs such as Microsoft Word. From a SQL Server perspective the files and folders can be accessed and manipulated using T-SQL just as if there were stored in a table. Other advantages include the ability to directly index the contents of the folder using the full text indexing facility in SQL Server.

To enable a SQL Server Instance and database for FileTable use; requires a few simple steps:
·         First enable FILESTREAM on the SQL Instance
sp_configure 'filestream access level', 2
RECONFIGURE

·         Then create a database with FILESTREAM group. In this example the files will reside in folder C:\FileTableExample\FT               
CREATE DATABASE FileTableExample
ON PRIMARY ( NAME = FileTableExample,
       FILENAME = 'C:\FileTableExample\Data\FileTableExample.mdf'),
FILEGROUP FileTableExampleFS CONTAINS FILESTREAM( NAME = FileTableExampleFS,
       FILENAME = 'C:\FileTableExample\FT')
LOG ON  ( NAME = FileTableExampleLog,
FILENAME = 'C:\FileTableExample\Data\FileTableExample.ldf')

·         Then define the root folder to be used by FileTables. Note that setting NON_TRANSACTED_ACCESS to FULL exposes FileTables for use using the Windows file system, the other options are OFF and READ_ONLY. In this example the files will reside in folder C:\FileTableExample\FT\RootFolder           
ALTER DATABASE FileTableExample
       SET FILESTREAM ( NON_TRANSACTED_ACCESS = FULL,
DIRECTORY_NAME = N'RootFolder' );

Finally to create a FileTable just use:
CREATE TABLE FileTableExample1 AS FileTable

From a SQL Server perspective a FileTable is a special type of table. Unlike normal tables, FileTables have pre-set columns that represent the Meta data properties of the tables and folder they contain. These columns can be retrieved and updated using T-SQL allowing the file system data that the table represents to be managed. The columns of the FileTable created above are shown below in Object Explorer:

To manipulate files and folders using Windows Explorer first obtain the file share name using:
SELECT FileTableRootPath('dbo.FileTableExample1')

In this example the UNC path is \\SERVERNAME\mssqlserver\RootFolder\FileTableExample1 which can be accessed in Windows Explorer in the normal way (Just drag and drop files):

Files can be manipulated using T-SQL INSERT, UPDATE and DELETE (or any other T-SQL).
To rename a file:
UPDATE FileTableExample1
SET name='textfile1a.txt'
WHERE name='textfile1.txt'

To select the contents of a file. Note the file_stream column is varbinary(max) hence the CONVERT:
SELECT CONVERT(varchar(max), file_stream)
FROM FileTableExample1
WHERE name='textfile1a.txt'

To update the contents of a file:
UPDATE FileTableExample1
SET file_stream = CONVERT(varbinary(max),
CONVERT(varchar(max), file_stream) + 'new')
WHERE name='textfile1a.txt'

Folders are dealt with in a similar way.

Files and folders can be managed in T-SQL and using the normal File System in Windows. The best of both worlds.

Monday, 7 May 2012

FILESTREAM and Partitioning

There is a requirement for a large number of large PDF documents to be managed by a SQL Server database without the contents of the documents actually being stored in the database, for size reasons. FILESTREAM meets these requirements.

There is an additional requirement to store documents of different types in different folders so as to be able to take advantage of multiple disk drives. Partitioning allows a single table to be allocated to multiple file groups according to values in certain columns and so again this meets the requirements.

Combining FILESTREAM and partitioning, however, turned out not to be as simple as expected.

I started by enabling FILESTREAM on SQL Server using:
sp_configure 'filestream access level', 2
RECONFIGURE

I then created a database with two FILESTREAM file groups using:
 CREATE DATABASE [FileStreamExample] ON  PRIMARY
( NAME = N'FileStreamTest', FILENAME = N'c:\FileStreamExample\FileStreamExample.mdf'),
FILEGROUP [FStream1] CONTAINS FILESTREAM 
( NAME = N'FStream1', FILENAME = N'c:\FileStreamExample\FStream1' ),
 FILEGROUP [FStream2] CONTAINS FILESTREAM
( NAME = N'FStream2', FILENAME = N'c:\FileStreamExample\FStream2' )
 LOG ON
( NAME = N'FileStreamTest_log', FILENAME = N'c:\FileStreamExample\FileStreamExample_log.ldf')
Note that each FILESTREAM file group can only one have one folder associated with it, hence the need for multiple file groups.

I then proceeded to setup partitioning by creating a partitioning function. The document type will be either 1 or 2 and each one of these should be assigned to a different file group:
CREATE PARTITION FUNCTION FSPartFunc ( int )
AS RANGE LEFT
FOR VALUES (1)

Next I assigned created two partitioning schemes, one for the FILESTREAM content and one for the other columns in the table. I need both because the table and the FILESTREAM need to follow the same partitioning scheme:
CREATE PARTITION SCHEME FSPartScheme
AS PARTITION FSPartFunc
TO (FStream1, FStream2)

CREATE PARTITION SCHEME PartScheme
AS PARTITION FSPartFunc
TO ([PRIMARY], [PRIMARY])

Then I created the table:
CREATE TABLE FSPartExampleTable (
      ID uniqueidentifier not null ROWGUIDCOL,
      DataType int NOT NULL,
      TheData varbinary(max) filestream NOT NULL)
ON PartScheme(DataType)
FILESTREAM_ON FSPartScheme
But I got error: A table with FILESTREAM column(s) must have a non-NULL unique ROWGUID column.

So I added a PRIMARY KEY on the ID column as below:
CREATE TABLE FSPartExampleTable (
      ID uniqueidentifier not null ROWGUIDCOL PRIMARY KEY,
      DataType int NOT NULL,
      TheData varbinary(max) filestream NOT NULL)
ON PartScheme(DataType)
FILESTREAM_ON FSPartScheme
But then I got the error: Partition columns, DataType in this case, for a unique index must be a subset of the index key. I tried again with a UNIQUE constraint and got the same error.

It seemed like there were two conflicting requirements here: Firstly the ID column must be unique and secondly the unique column must contain the partitioning column. So I tried to create a primary key based on both ID and DataType:
CREATE TABLE FSPartExampleTable (
      ID uniqueidentifier not null ROWGUIDCOL,
      DataType int NOT NULL,
      TheData varbinary(max) filestream NOT NULL,
PRIMARY KEY (ID, DataType ))
ON PartScheme(DataType)
FILESTREAM_ON FSPartScheme
But then I was back to the first error: But I got error: A table with FILESTREAM column(s) must have a non-NULL unique ROWGUID column.

So FILESTREAM is pretty insistent on having a unique index on the ID column, whereas partitioning wants to have a composite index if the index is to be partitioned. That’s when I realised what the issue was, the unique indexes are being associated with the partitioning scheme of the table and actually I do not need this to be the case. So I changed my primary key to be non-clustered, so it is separate from the table, and associated it with a different file group, [PRIMARY] in this example, and the issue was solved:
CREATE TABLE FSPartExampleTable (
      ID uniqueidentifier not null ROWGUIDCOL
            PRIMARY KEY NONCLUSTERED ON [PRIMARY],
      DataType int NOT NULL,
      TheData varbinary(max) filestream NOT NULL)
ON PartScheme(DataType)
FILESTREAM_ON FSPartScheme

Now the FILESTREAM data for DataType 1 and 2 are stored in different folders as required:
INSERT INTO FSPartExampleTable
VALUES (NEWID(), 1, CAST ('Part 1 ...data' AS varbinary(max)))
INSERT INTO FSPartExampleTable
VALUES (NEWID(), 2, CAST ('Part 2 ...data' AS varbinary(max)))

Wednesday, 2 May 2012

The Quality of Data

With the introduction of SQL Server 2012 we now have a dedicated facility for the management of data quality, aptly named Data Quality Services. So what exactly does this entail?

The first question to ask is “what exactly is data quality?”. SQL Server is a relational database and by its very design it maintains relationships between tables through the addition of foreign key constraints, it ensures uniqueness of data columns using unique constraints, primary keys and indexes. We have data types to ensure that the correct type of value is entered into columns and we can add check constraints and triggers to apply logical validations. So what could possibly go wrong with the data quality?

Well interestingly quite a lot can still go wrong. Let’s take the user registration database for a typical ecommerce site as an example. Users register when they wish to make a purchase, the system records the email address, client name and delivery and billing addresses. So far so good SQL server will ensure that this data is all stored correctly in the database. Then 6 months later the user returns to the site to buy something else, and forgets that they already registered, so they register again, with a different email address, a middle initial added to the name and the address formatted slightly different with Street instead of St and no postcode. SQL Server stores the user and address and all looks good, but all is not good. The user’s details have been stored twice, the address has been duplicated and the formatting of the fields is inconsistent. The above example may sound trivial but it is a big issue for online retailers and just about every other business. The database server is rigorous about storing structured data, but it can’t handle the inconsistencies of the users that enter the data. Having many thousands of duplications in a database is not uncommon.

Data Quality Services is designed to address the above issues. It comes as a server program that manages the data and client program used to manage the data quality process. Part of the client program user interface is shown below.

The stages of use are:
·         Firstly rules are established in the knowledge base. Rules can be imported from a standard set of rules, or derived from sample data or setup by the user. Rules include things such as:
o   Spell checking data
o   Validating or standardising data
o   Defining matching rules that test the similarity of data to check for duplicates
·         Secondly a Data Quality project is created to apply the rules in the knowledge base to a particular set of data in a database. The project applies the rules to cleanse the data and then provides an interactive facility for the user to approve or reject its changes. It can also apply matching rules to the data to detect and resolve duplicates
·         Finally Data Quality Services will report on data completeness and quality through its Profiling and Notification facilities

Data cleaning is not a one-time activity. Once the Knowledge Base and Data Projects have been created they are repeatedly used and evolved over time to keep the data quality levels high.

Data Quality Services comes as part of SQL Server 2012 Enterprise and Business Intelligence Editions.

Monday, 30 January 2012

Cloud Data and SQL Azure

Applications on laptops that travel outside the office, such the order application used by the company sales representative, need to store data and historically the solution has been to have a local database server on the laptop. When the sales representative returns to the office, at the end of the week, the laptop is plugged into the network and is synchronised with the main database, typically using SQL Server Merge Replication.
The main issue with this approach is that changes made to the head office database and to the order on the laptop database are only synchronised once per week. This means price and stock changes could be missed and new orders are received late.
So why not have the order application on the laptop connect directly to the database server in head office? The reason is usually connectivity and security. The head office firewall would need to allow external connections, which in some organisations is not desirable.

There are a couple of cloud based solutions to this issue:
1.       Store the database in the cloud and access it from the laptop and head office. That way the laptop can access it directly and so can head office. This will work for some organisations but others will still want the main database to be on the premises in the head office

2.       Store a copy of the part of the database that the order application needs in the cloud and keep it synchronised with the main database in the head office
SQL Azure, Microsoft’s SQL in the Cloud offering, can support both of these options, but option 2 is most the likely solution.
In addition SQL Azure offers a fully managed service, including backups, maintenance and automatic failover to an alternate server in the event of failure. Meaning that there is no need for a buy a new server or to have to install and manage it.

From a development perspective the order application connects to the database on SQL Azure in exactly the same way as it does with an on premise database and the programmer uses exactly the same SQL. Azure uses a login to gain access to the server and also has support for firewall rules to restrict where connections can come from.
The synchronisation requires the installation of the “Microsoft SQL Azure Data Sync Agent” on the laptop and the head office server. The agent connects to Azure to carry out the synchronisation and can be configured to do this at a designated frequency (such as every 30 minutes).
All will remain reasonably up to date and no need to allow connection in through the head office firewall.

Tuesday, 13 December 2011

Sequential Unique Identifiers

Using Globally Unique Identifiers as the primary key for SQL Server database tables has become common place in physical database design. It replaces the use of int identifiers and the IDENTITY clause which date back to the first versions of SQL Server. So what are the pros and cons:

Using int and IDENTITY is efficient because an int is only 4 bytes long and so does not take up much space in the index associated with the primary key. IDENTITY also generates sequential numbers and so new records are appended to the end of the index and so the table grows in a controlled manner. This is particularly important when the primary key is based on a clustered index, which they usually are, because the clustered index requires that the keys are sorted in order. An example table is:
CREATE TABLE table1 (
      ID int IDENTITY(1,1) NOT NULL
PRIMARY KEY CLUSTERED,
      OtherData varchar(1000)
)            
Using a uniqueidentifier ID and the NEWID() function requires  16 bytes and the values are not sequential and so when used as a clustered primary key the  associated  index will quickly become fragmented as new records are inserted between existing records. This just sounds like disadvantages so far, but the big advantage is that the keys are unique and the remain unique even if the record is copied to another table in the database, or even to a table in another database or even to a table on another server altogether. This means that  uniqueidentifiers are very good for any data that needs to be replicated in any way and this cannot be said for int IDENTITY columns. The extra 12 bytes is also not a big deal either because storage is relatively cheap.  An example table is:
CREATE TABLE table2 (
      ID uniqueidentifier DEFAULT NEWID() NOT NULL
PRIMARY KEY CLUSTERED,
      OtherData varchar(1000)
)
The non-sequential nature of the uniqueidentifier and the associated index / table fragmentation remained an issue until it was resolved in SQL Server 2005 with the introduction of the NEWSEQUENTIALID() function. NEWSEQUENTIALID()generates sequential identifiers  thereby solving the fragmentation problems associated with NEWID(). It can only be used in the DEFAULT clause of a column in a table but given its purpose this is hardly a big limitation.
An example of sequential identifiers in action is shown below:
CREATE TABLE table3 (
      ID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL
            PRIMARY KEY CLUSTERED,
      OtherData varchar(1000)
)
GO

INSERT INTO table3 (OtherData)
VALUES ('Test1'), ('Test2'), ('Test3')
GO

SELECT ID, OtherData
FROM table3
ORDER BY ID