Tuesday, 13 December 2011

Sequential Unique Identifiers

Using Globally Unique Identifiers as the primary key for SQL Server database tables has become common place in physical database design. It replaces the use of int identifiers and the IDENTITY clause which date back to the first versions of SQL Server. So what are the pros and cons:

Using int and IDENTITY is efficient because an int is only 4 bytes long and so does not take up much space in the index associated with the primary key. IDENTITY also generates sequential numbers and so new records are appended to the end of the index and so the table grows in a controlled manner. This is particularly important when the primary key is based on a clustered index, which they usually are, because the clustered index requires that the keys are sorted in order. An example table is:
CREATE TABLE table1 (
      ID int IDENTITY(1,1) NOT NULL
PRIMARY KEY CLUSTERED,
      OtherData varchar(1000)
)            
Using a uniqueidentifier ID and the NEWID() function requires  16 bytes and the values are not sequential and so when used as a clustered primary key the  associated  index will quickly become fragmented as new records are inserted between existing records. This just sounds like disadvantages so far, but the big advantage is that the keys are unique and the remain unique even if the record is copied to another table in the database, or even to a table in another database or even to a table on another server altogether. This means that  uniqueidentifiers are very good for any data that needs to be replicated in any way and this cannot be said for int IDENTITY columns. The extra 12 bytes is also not a big deal either because storage is relatively cheap.  An example table is:
CREATE TABLE table2 (
      ID uniqueidentifier DEFAULT NEWID() NOT NULL
PRIMARY KEY CLUSTERED,
      OtherData varchar(1000)
)
The non-sequential nature of the uniqueidentifier and the associated index / table fragmentation remained an issue until it was resolved in SQL Server 2005 with the introduction of the NEWSEQUENTIALID() function. NEWSEQUENTIALID()generates sequential identifiers  thereby solving the fragmentation problems associated with NEWID(). It can only be used in the DEFAULT clause of a column in a table but given its purpose this is hardly a big limitation.
An example of sequential identifiers in action is shown below:
CREATE TABLE table3 (
      ID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL
            PRIMARY KEY CLUSTERED,
      OtherData varchar(1000)
)
GO

INSERT INTO table3 (OtherData)
VALUES ('Test1'), ('Test2'), ('Test3')
GO

SELECT ID, OtherData
FROM table3
ORDER BY ID



Sunday, 30 October 2011

DDL and Transactions

When I started in database management many years ago I worked exclusively on Oracle and I got used to the fact that transactions worked with the DML statements SELECT, INSERT, UPDATE and DELETE but that things were different when it came to DDL statements such as CREATE TABLE and DROP TABLE. In short, transactions did not support DDL.
I then started working with SQL Server and for a long time I assumed that the same was true. It came as quite a surprise when I discovered that it was not true and that SQL Server supports transactions for both DML and DDL.
Consider the following T-SQL statements:
CREATE TABLE TestTable1 (ID int NOT NULL)
INSERT INTO TestTable1 VALUES (1), (2), (3)

BEGIN TRAN
DROP TABLE TestTable1
ROLLBACK

SELECT COUNT(*) FROM TestTable1
The answer is 3. After the rollback the table still exists and still has 3 rows it it. The ROLLBACK has successfully undone the DROP TABLE.
Same applies to a rollback of CREATE TABLE. Even TRUNCATE TABLE can be rolled back.
Transactions work by storing before and after images of rows in the transaction log. This log data is then used when undoing transactions in the database.
If I have a table with 10 million rows and I delete all of them using DELETE it takes quite a while as the rows are effectively copied into the transaction log file.
If on the other hand I start a transaction and delete all the rows using TRUNCATE TABLE it happens in an instant, but the roll back still works!  It clearly is not copying all the deleted rows to the log, so how does it work?
The answer is that is still logs data in the transaction log but at a much higher level of granularity. As it deletes rows, empty 8KB data pages are de-allocated and it is these page de-allocations that are logged and not the rows. A very effective and efficient solution.

Sunday, 2 October 2011

Parsing Dates can be Trying

SQL Server and dates are often troublesome. In fact every database server I have ever worked on has had trouble with dates and with the parsing of dates in particular. The issue is that different countries have different date formats and that by looking at certain dates expressed as a string it is not always obvious exactly which date it represents. The classic example is that Americans start dates with the month, whereas Europeans usually start them with the day. So for example:
·         ‘01/02/2011’ to a European means 1 February 2011
·         ‘01/02/2011’ to an American means 2 January 2011
Quite often this results in either run time errors or misinterpreted dates. On my SQL Server I just tried:
·         SELECT CAST('12/16/2011' AS datetime) and it returned 16 December 2011
·         SELECT CAST('16/12/2011' AS datetime) and it returned a run-time error
Clearly my SQL Server is configured for US dates.
The SET DATEFORMAT dmy command offers an option to override the behaviour and use European dates, in which case the two SELECT statements above would behave differently.
But what if I wanted to check if a given date was a valid European date. I could run:
SET DATEFORMAT dmy
DECLARE @d datetime
SET @d = CAST('12/16/2011' AS datetime)
This would fail with a run-time error because the date is invalid. But a getting a runtime error is hardly a convenient way to process the date and so I would need to add error handling to recover from the error. One way to do is using a TRY CATCH which would look something like this
SET DATEFORMAT dmy
DECLARE @d datetime
BEGIN TRY
      SET @d = CAST('12/16/2011' AS datetime)
END TRY
BEGIN CATCH
END CATCH
IF @d IS NULL PRINT 'Date error'
This is a lot of trouble just to check a date format and so SQL Server Denali (I am using the CTP3 release) has extended T-SQL to make the parsing of dates a lot easier. In particular a new function TRY_PARSE has been added. For example to explicitly parse a date in UK format use:
DECLARE @d datetime
SET @d = TRY_PARSE ('12/16/2011' AS datetime2 USING 'en-GB')
IF @d IS NULL PRINT 'Date error'
Note that TRY_PARSE will return the date time if the format is good and NULL otherwise. TRY_PARSE is not just for dates and can be used for other country / language specific formats such as currency formats.
There is also a new function called PARSE (without the TRY). This does the same a TRY_PARSE but raises a run-time error if the format is invalid rather than returning NULL.
Finally we also have TRY_CONVERT which is a version of the standard CONVERT function. It too returns NULL if the conversion fails.
All in all we have 3 new T-SQL functions that make the parsing and conversion of data just that little bit easier.

Friday, 30 September 2011

SEQUENCE vs. IDENTITY

SQL Server Denali contains a new database object called SEQUENCE that delivers sequential numbers. It complements the existing IDENTITY construct that has a similar objective.
IDENTITY has been around as long as I can remember and comes from SQL Servers Sybase heritage. It is simple to use because all that is needed is an IDENTITY clause on the column:
CREATE TABLE Table1 (
    Column1 int NOT NULL IDENTITY(1,1),
    Column2 varchar(50) NULL)
After this INSERTs into the table will be automatically assigned the next number in the sequence.
IDENTITY is safe and easy to use. So why introduce a new and similar construct now?
The sceptical amongst us may be thinking that the reason SQL Server is getting SEQUENCE objects is because Oracle has them and Microsoft don’t want to be seen as second best. Another reason could be compliance with the ANSI/ISO SQL standards. Another is that IDENTITY does have some disadvantages that SEQUENCE objects do not have. Some examples of these are:
·         A column that uses IDENTITY cannot be changed with an UPDATE statement (except when the SET IDENTITY_INSERT statement is used)
·         IDENTITY cannot be added to an existing column using ALTER TABLE
·         IDENTITY allocates numbers for use in a single table and numbers cannot be shared across tables
·         IDENTITY specifically relates to new records and INSERT and cannot be used in other circumstances
An example SEQUENCE object is a defined as follows:
CREATE SEQUENCE dbo.MySequence
    AS int
    START WITH 1
    INCREMENT BY 1
    MINVALUE 1
    MAXVALUE 100000
    CYCLE
    CACHE 3

It is used to obtain values with the following clause:
NEXT VALUE FOR dbo.MySequence
This can be used in all kinds of circumstances and some of these are as follows:
·         Assigning the next value to a variable:
DECLARE @i int
SELECT @i = NEXT VALUE FOR dbo.MySequence
·         Inserting a value into two tables from the same sequence:
INSERT INTO Table2 (Column1, Column2)
VALUES (NEXT VALUE FOR dbo.MySequence, 'Data')
INSERT INTO Table3 (Column1, Column2)
VALUES (NEXT VALUE FOR dbo.MySequence, 'Data')
·         Defining a sequence as a default for a column, so that new values are assigned automatically on each INSERT:
CREATE TABLE Table4 (
    Column1 int NOT NULL DEFAULT NEXT VALUE FOR dbo.MySequence,
    Column2 varchar(50) NULL)
·         Using a sequence in a SELECT clause to number the rows returned:
SELECT NEXT VALUE FOR dbo.MySequence OVER (ORDER BY Name), Name
FROM sys.objects
Pretty powerful stuff!
Certainly SEQUENCE offers more scope than the IDENTITY alone. But it’s worth remembering that IDENTITY is not going to go away any time soon and so for simplicity and backwards compatibility we can keep on using it. When we need a little more, then we have SEQUENCE as well.
The best of both worlds

Sunday, 25 September 2011

Querying Data Streams in Real-Time

Processing and acting on high volume real-time data is a requirement for many organisations today including:
·         Financial: Processing real time trading information
·         Utilities: Monitoring power consumption, tracking supply vs. demand
·         Communications: Monitoring network usage, tracking failure rates, Processing Banner Ad clicks on web sites
However the traditional approach to data analysis is; to store the data in a database, run queries to extract and calculate the required data and then to produce output in the form of a report. But with 1000 new data records arriving per second this was clearly not practical, until StreamInsight came along that is.

StreamInsight is part of SQL Server 2008 R2 and is designed to accept and query real-time data streams with volumes of thousands of records per second and it is all based on familiar .NET and LINQ technology. There are two editions:
·         Standard edition for data volumes less that 5000 events per second
·         Premium edition for data volumes up to 100000 events per second
You need SQL Server 2008 R2 Data Centre Edition to get the Premium edition, while Standard edition comes with SQL Server 2008 R2 Standard and Enterprise Editions. So how does it work?
·         Firstly the data stream is read from the sources by an adaptor. StreamInsight comes with a set of standard Adaptors for typical data sources, but new custom ones are typical created from these
·         The data is then passed to the queries, which are written using the LINQ query language. These queries operate on the data as it arrives
·         The results of the queries are sent as a data stream to output adaptors which in turn route the results to the system that needs them. This may be a reporting solution or a trading application and so on
Architecturally it looks like this:
 StreamInsight typically runs as a Windows Service on a server, but can be embedded directly into a program if required.

All in all this is a pretty cool extension to SQL Server and real-time decision making on high volume data can now become a part of any organisations analysis and reporting capabilities.