ProfileDeferred ProcessingBlog Tools Help

Blog


    September 29

    Blog moving

    The good people at www.pluralsight.com have offered to host my blog so I am moving it there. It gives me more control over the content I post and, I think, is a more apropos place for the sort of things I blog about. The new location is:

     

    www.pluralsight.com/blogs/dan

     

    The feed is at:

     

    www.pluralsight.com/blogs/dan/rss.aspx

     

     

    August 09

    Psudo Median Aggregate

    /*
    Psuedo Median Aggregrate
     
    In an item I previously blogged, YAMC, I showed how to use the ROW_NUMBER() function in SQL Server 2005 to do a pretty efficient median calculation. Being able to calculate the median in an expression can be useful, but a median aggregate function would be much more useful.
     
    Well T-SQL, even in SQL Server 2005, doesn't provide a median aggregate function and the is no way to write one using user defined aggregates written in a CLR language either.
     
    Remember that an aggregate is set in, scalar out. YAMC was an aggregate calculation, but not an aggregate function. One of the useful things about aggregate function, like SUM(), is that they can be used in conjunction with GROUP BY phrases, so you can get an aggregate per some kind of value.
     
    We are going to look at how to extend the YAMC to use a GROUP BY phrase so that it can be used, efficiently, to produce the same results as a MEDIAN() aggregate function would.
     
    We will use the sales table that follows as example data. A script that will initialize it with some sample data is at the end of this blog item.
    */
     
    Create
    table Sales
    (
    PONumber int Primary Key,
    Value
    Money,
    State NCHAR(2)
    )
    /*
    A simple aggregate calculation could be the SUM of the values by state.
    */
     
    SELECT
    State, SUM(value) FROM Sales GROUP BY State
     
    /*
    Here is a little bit of the output...
     
    NB 891760.7953
    WI 1001671.6347
    PA 962849.0728
    IL 1055766.3988
     
    What we would like to do thougth is:
     
    SELECT State, MEDIAN(Value) FROM Sales GROUP BY State
     
    but that won't work because T-SQL has no  MEDIAN aggregate function.
     
    We could calculate the median of the whole table or a specific state using YAMC by modifying it a bit to just use rows from the state we are interest in.
     
    */
     
    with
    positions
    as
    (
    select
    (1 + Count(*)) / 2 as mid,
    1-(count(*) % 2) as even
    from
    Sales where State = 'NB'
    ),
    rows
    as
    (
    select
    value,
    ROW_NUMBER() over (order by value) as rn
    from
    Sales where State = 'NB'
    )
    select
    'NB' as State, AVG(value) from rows JOIN positions on
    rn in (positions.mid, positions.mid + positions.even)
    /*
     
    And we get...
     
    NB 4218.464
     
    It turns out ROW_NUMBER() has a friend named PARTITION BY, whose purpose in spirit is the same as GROUP BY. In fact by using PARTITION  BY and GROUP BY together we can do an aggregate calculation grouped by a value, just as you can using the built in aggregate functions.
     
    Here is a "psuedo" median aggregate that calculates the median sales value by state.
     
    */
     
    -- psuedo median aggregate
    WITH
    positions
    AS
    (
    SELECT
    (1 + count(*)) / 2 AS mid,
    1-(count(*) % 2) AS even,
    state
    FROM
    Sales GROUP BY state
    ),
    rows AS
    (
    SELECT
    value, state, row_number() OVER
    (PARTITION BY state order by value) AS rn FROM sales
    )
    SELECT
    rows.state, AVG(rows.value) AS median
    FROM
    rows JOIN positions
    ON
    rows.state = positions.state
    AND
    rows.rn IN
    (
    positions.mid, positions.mid + positions.even)
    GROUP
    BY rows.state
    /*
    This explaination assumes you have already read the YAMC blog item.
     
    Similar as in YAMC, positions calculates the mid and even values, but this time groups the values by state.
     
    Again, similar to YAMC, rows associates a row number with each value, but this time partitions the row numbers by state. This means that each state will have its own set of row numbers starting with one.
     
    The SELECT that follows the CTE's is a bit more complex than the one in YAMC. It joins together positions and rows just as YAMC did. However it adds to the predicate a clause that makes the row and positions have the same state. And at the end it groups the AVG() aggregate by state.
     
    Here is a little bit of what it produces...
     
    AK 5158.6132
    AL 5222.3433
    AR 5347.351
    AZ 5190.8084
    CA 5015.9836
    NB 4218.464
     
    It would be much easier, and a lot more flexible, if T-SQL just gave us a MEDIAN() aggregate function, but for now at least we can do a real Median grouped by value if we want to.
     
     
    Since this idiom accomplises pretty much what an aggregate function does, I wonder if the SQL team could cook up a syntax that used PARTITION BY and GROUP BY so that we could write our own user defined aggregates in T-SQL? One can dream...
     
    */
     
    -- genrates test data
    DECLARE
    @index int
    SET
    @index = 10000
    WHILE
    @index > 0
    BEGIN
    DECLARE
    @state NVARCHAR(2)
    DECLARE
    @rand float
    SET
    @rand =RAND() * 50
    SET
    @state = CASE
    WHEN
    @rand < 1 THEN N'AL'
    WHEN
    @rand < 2 THEN N'AK'
    WHEN
    @rand < 3 THEN N'AZ'
    WHEN
    @rand < 4 THEN N'CA'
    WHEN
    @rand < 5 THEN N'CO'
    WHEN
    @rand < 6 THEN N'CT'
    WHEN
    @rand < 7 THEN N'DE'
    WHEN
    @rand < 8 THEN N'FL'
    WHEN
    @rand < 9 THEN N'GA'
    WHEN
    @rand < 10 THEN N'HI'
    WHEN
    @rand < 11 THEN N'ID'
    WHEN
    @rand < 12 THEN N'IL'
    WHEN
    @rand < 13 THEN N'IA'
    WHEN
    @rand < 14 THEN N'IN'
    WHEN
    @rand < 15 THEN N'KS'
    WHEN
    @rand < 16 THEN N'KY'
    WHEN
    @rand < 17 THEN N'LA'
    WHEN
    @rand < 18 THEN N'MA'
    WHEN
    @rand < 19 THEN N'MD'
    WHEN
    @rand < 20 THEN N'MN'
    WHEN
    @rand < 21 THEN N'MI'
    WHEN
    @rand < 22 THEN N'MS'
    WHEN
    @rand < 23 THEN N'MO'
    WHEN
    @rand < 24 THEN N'MT'
    WHEN
    @rand < 25 THEN N'NB'
    WHEN
    @rand < 26 THEN N'NV'
    WHEN
    @rand < 27 THEN N'NH'
    WHEN
    @rand < 28 THEN N'NJ'
    WHEN
    @rand < 29 THEN N'NM'
    WHEN
    @rand < 30 THEN N'NY'
    WHEN
    @rand < 31 THEN N'NC'
    WHEN
    @rand < 32 THEN N'ND'
    WHEN
    @rand < 33 THEN N'OH'
    WHEN
    @rand < 34 THEN N'OK'
    WHEN
    @rand < 35 THEN N'OR'
    WHEN
    @rand < 36 THEN N'PA'
    WHEN
    @rand < 37 THEN N'RI'
    WHEN
    @rand < 38 THEN N'SC'
    WHEN
    @rand < 39 THEN N'SD'
    WHEN
    @rand < 40 THEN N'TN'
    WHEN
    @rand < 41 THEN N'TX'
    WHEN
    @rand < 42 THEN N'UT'
    WHEN
    @rand < 43 THEN N'VT'
    WHEN
    @rand < 44 THEN N'VA'
    WHEN
    @rand < 45 THEN N'WA'
    WHEN
    @rand < 46 THEN N'WV'
    WHEN
    @rand < 47 THEN N'WI'
    WHEN
    @rand < 48 THEN N'WY'
    WHEN
    @rand < 49 THEN N'AR'
    WHEN
    @rand < 50 THEN N'AL'
    END
    DECLARE
    @value MONEY
    SET
    @value = RAND() * 10000
    INSERT
    INTO SALES VALUES (@index, @value, @state)
    SET
    @index = @index - 1
    END
    August 05

    YAMC

    /*

    YAMC

     

    Yet Another Median Calculation

     

    SQL Server 2005 gives us yet another way to calculate a median. The secret is the ROW_NUMBER() function, it can be used to find the "middle" row. ROW_NUMBER() is a bit special because an alias for it cannot be used in a predicate in the same query. Either a CTE or sub-query can be used to get around this.

    The basic idea is to order the data on the value for which you need to calculate a median, then select the row whose ROW_NUMBER() is half the size of the result of the query. There is a problem though, if the number of rows is even, there is no single "middle" row.

    The options here are to pick a value between the "middle" rows or just pick one of the rows. The value between the two "middle" values is called the financial median. If the first of the two "middle" rows is used it is  called the statistical median.

    */

    -- this is a very simple example table

    -- we will use as a source of data for

    -- median calculations

    create table data

    (

    id int Primary Key,

    value

    float

    )

    -- see then end of this note for a helper

    -- script that will fill the table with

    -- some random data.

     

    /*

    Now we look at some solutions using CTE's

    */

    -- this calculates the financial median

    with

    positions

    as

    (

    select

    (1 + Count(*)) / 2 as mid,

    1-(count(*) % 2) as even

    from

    data

    ),

    rows

    as

    (

    select value,

    row_number() over (order by value) as rn

    from

    data

    )

    select

    AVG(value) from rows JOIN positions on

    rn in (positions.mid, positions.mid + positions.even)

     

    /*

    The first CTE, positions, calculates the a couple of aggregates on our sample data.

    position.mid is the number of the middle row if the number of rows is odd, otherwise it is the number of the first middle row.

    position.even is a marker that is one if the number of rows is even and zero otherwise.

    The second CTE, rows, calculates the row number of each value in our sample data when ordered by value. The ROW_NUMBER() is aliased as rows.rn. You might be tempted use this alias in a predicate of this CTE to shortcut the calculation of the median, but an alias for the ROW_NUMBER() function cannot be used in a predicate of the expression that defines it.

    The final expression selects the average value of the values from the positions CTE whose row number is either position.mid or position.mid + position.even. Note that when the number of rows in data is odd both position.mid and position.mid + position.even will be the same value.

    The result of averaging one or two rows will be the financial median for the values in the data table.

    Another way to calculate the median, which can be used in versions of SQL Server prior to 2005, is to move the sorted values into a temporary table that has an identity column. The identity column is, in effect, the same as the ROW_NUMBER(). Then, as the ROW_NUMBER based solution does, pick the "middle" row

    */

    -- temp table median calculation of median

    declare

    @positions TABLE

    (id int identity, value float)

    insert

    into @positions

    select value from data order by value

    DECLARE @mid int

    DECLARE @even int

    SELECT @mid = (1 + count(*))/2,

    @even = 1 - (count(*) % 2)

    from @positions

    select

    AVG(value) from @positions

    where id in (@mid, @mid+@even)

     

    /*

    Is there any advantage to using the ROW_NUMBER based solution? If you compare the actual execution plan of the two calculations run in a single batch, the ROW_NUMBER based one has an insignificant advantage, 48% vs. 52% of the overall plan on my test system.

    The ROW_NUMBER solution does have one nice feature though, it is a SQL expression and the temporary table solution is a batch, not an expression. You can use this fact to improve the relative performance of the ROW_NUMBER based calculation. If you know you are going to do this median calculation very often you can add an index to the value column of the data table. This, of course, is not for free, you are making a space/time tradeoff by spending space to cut down time.

    */

    -- index to speed up ROW_NUMBER

    -- based median calculation

    create

    index value_indx on data(value)

    /*

    With this index in place you will find a notable difference between the execution plans for the ROW_NUMBER and temporary table calculations, the ROW_NUMBER calculation takes up only 37% of the overall plan on my test system.

    The reason for this is that when there is no index that sorts the values the ROW_NUMBER calculation must build a temporary table itself and sort it, just like temporary table calculation does. But the temporary table solution always builds the temporary table, even if there is a index available to sort it. The ROW_NUMBER solution mearly has to walk over the index to find the middle value.

    Calculating the statistical mean, the one that always uses one of the values from the table, takes a bit more work, but not much.

    */

    -- calculation of statistical mean

    with

    positions

    as

    (

    select (1 + Count(*)) / 2 as mid,

    1-(count(*) % 2) as even

    from

    data

    ),

    rows

    as

    (

    select

    value,

    row_number() over (order by value) as rn

    from

    data

    )

    select

    TOP 1 value from rows JOIN positions on

    rn in (positions.mid, positions.mid + positions.even)

    ORDER

    BY rn

     

    /*

    For the statistical mean you must sort the results by row number and pick the first one. If you want the last one then do an DESC sort.

    This calculation isn't a real aggregate, but you can get some of the effect of an aggregate by wrapping it inside of an inline table valued user defined function. Note that because the ROW_NUMBER calculation is a SQL expression you can do this. The temporary table solution can only be wrapped into a multi-line user defined function.

    */

    -- median TVF

    -- note that this function

    -- always returns a table

    -- with a single row

    CREATE

    FUNCTION dataMedian()

    returns

    table

    return

    with

    positions

    as

    (

    select (1 + Count(*)) / 2 as mid,

    1-(count(*) % 2) as even

    from

    data

    ),

    rows

    as

    (

    select

    value,

    row_number() over (order by value) as rn

    from

    data

    )

    select

    AVG(value) as median from rows JOIN positions on

    rn in (positions.mid, positions.mid + positions.even)

     

    /*

    So if you now want to get all the values in the data table that are greater than the median you can do something like:

    */

    Select id, value from data JOIN dbo.dataMedian()

    ON

    value > median

    /*

    So that is yet another median calculation for SQL Server. In some cases it will may provide better performance than using a temporary table, but at the cost of space. It not worse that using a temporary table and gives you the option of making the space/time tradeoff whenever you feel it is necessary.

     

    */

     

    -- helper script to fill sample TABLE data

    set

    nocount on

    declare

    @index int

    set

    @index = 20001

    while

    @index > 0

    BEGIN

    Insert

    into data values (@index, RAND() * 1000)

    set

    @index = @index - 1

    END

     

    May 30

    Well-formed XML AUTO

     

    I've put together a short paper at http://www.sqlservicebroker.com/samples/wellformedxmlauto.zip about how to deal with the fact that in SQL Server 2000 FOR XML AUTO returns a document fragment, not a well formed xml doucment. The paper finishes up by discussing a way to, in fact, get SQL Server 2000 to stream back a well-formed XML document.

     

    http://www.sqlservicebroker.com

    SQL Server 2005 for Developers http://www.sqlservicebroker.com/commingsoon.html

     

    Impersonation on an ADO.NET SqlConnection

     

    ADO.NET lets you create a database connection to SQL Server using either a SQL login or a Windows login with the credentials your program is running under. What if you want to impersonate another Windows user?
     
    Given that you have the users' password and appropriate privleges and some effort, at runtime you can slip another users credentials into an ADO.NET connection.
     
    ...Or you can let SMO (SQL Server Managemement objects) do that work for you.
     
    The ServerConnection object from the Microsoft.SqlServer.Management.Common namespace gives you a bit more flexibility in how to make your connection. A SeverConnection  will login to SQL Server with Windows credentials if you set the ConnectAsUser property to true.
     
    An ADO.NET SqlCommand cannot use a ServerConnection directly, but ServerConnection does have a property named SqlConnectionObject that returns a SqlConnection.
     
    Below is some example console application that shows using a ServerConnection to execute an ADO.NET SqlCommand with an arbitrary Windows identity.

    Note that ServerConnection.Connect() does not send the name and password to SQL Server for authentication, the name and password are authenticated on the machine that executes the Connect() method.This limits the utility of this technique somewhat, but in a later post I will look at a use case where this technique is pretty handy.
     
    Of course this code is doing no error checking and glosses over some of the differences between a ServerConnection and a SqlConnection that is constructed directly.
     
    To use this example you will have to add a reference  to Microsoft.SqlServer.ConnectionInfo.

    using System;
    using System.Collections.Generic;
    using System.Text;
    using Microsoft.SqlServer.Management.Common;
    using System.Data.SqlClient;
    namespace AutoConnect
    {
    class Program
    {
    static void Main(string[] args)
    {
    ServerConnection serverConnection = new ServerConnection();
    SqlCommand cmd =
      new SqlCommand("SELECT SUSER_NAME()", // get login name 
      serverConnection.SqlConnectionObject);
    serverConnection.ConnectAsUser =
    true;
    serverConnection.ConnectAsUserName = args[0]
    ;
    serverConnection.ConnectAsUserPassword = args[1]
    ;
    serverConnection.Connect();
    Console.WriteLine(cmd.ExecuteScalar());
    }
    }
    }

    Program run in a command shell...

    C:>Connect "MiniDan", "P@ssw0rd"

    CANOPUS5\MiniDan

     

    http://www.SqlServiceBroker.com

     

     

    May 18

    XML Shredding in SQL Server 2005

    XML Shredding in SQL Server 2005

    I've written a short tutorial on shedding xml with SQL Server 2005.

    SQL Server 2005 does a pretty good job at mixing and matching relational and xml data. One of the things you sometimes want to do with an xml document is to shred it into tables. Shredding into tables is the process of extracting the data from an xml document and inserting it into one or more tables, and using FOREIGN KEY's to represent any parent/child relationships that exist in the xml document

    This paper is about the new shredding capability of T-SQL. The sources for all of the examples shown accompany this paper. It's available at via the link on the title at beginning of this entry.

    www.sqlservicebroker.com

     

     

    May 13

    Primary key on xml column

    The xml data has a few limitations that other datatypes in SQL Server 2005 do not have, for example you cannot compare two xml values. This has to with the nature of xml, but as a result it prevents you from making an xml column a primary key, sort of... 

    With little help from XQuery and a persisted computed column you can make a primary key on an xml column. For example suppose you have an invoice table and you want to make sure that the same invoice isn't inserted twice. An invoice looks like:

    <Invoice>
      <ID>1200</ID>
      <LineItems>
       <LineItem>stuff</LineItem>
      </LineItems>
    </Invoice>

    You table for invoices would be something like:

    CREATE TABLE INVOICES
    (
    ID ??? PRIMARY KEY,
    Invoice xml
    )

    The question is what should replace ???. The answer is, of course, the content of the <ID> element in the invoice. A user defined function can be used to pull out the ID.

    create function InvId9(@i xml)
    returns varchar(10)
    with schemabinding
    as
    BEGIN
    DECLARE @inv int
    select @inv =
    @i.value('string(/Invoice[1]/ID[1])',
    'varchar(10)')
    return @inv
    END


    Now we can create the table as:

    CREATE TABLE INVOICES
    (
    ID as dbo.InvId9(Invoice) PERSISTED PRIMARY KEY,
    Invoice xml
    )


    There are a couple of gotcha's here. First of all the function used to extract the ID must be schema bound even though it is not use anything from the database schema. The second is that it may not return a MAX type, for example VARCHAR(MAX), because you cannot create an index on a MAX type.

    So there you go, a way to use an xml column as a primary key.

    Dan

    April 29

    User Defined Types

    One of the new features of SQL Server 2005 is User Defined Types. A user defined type is a new scalar type added to the SQL Server type system and is implemented using .Net. A sample of a user defined type for time durations and an article that explains the basics of creating user defined types is at:

    http://www.sqlservicebroker.com/samples/duration.zip