Data Normalizer – the SQL

In Data Normalizer I showed you how I normalize worksheet data using arrays and For/Next loops. I’ve been doing a fair amount of SQL in VBA lately, and thought I’d rewrite the code using that approach.

abnormal data

normalized data

A little searching revealed the T-SQL/SQL Server “Unpivot” command, which normalizes your data and sets the new field names all in one swell foop. It’s not available in Access/Jet SQL though, so can’t be used on Excel. Instead, the preferred method is to use a series of Selects that pick one normalizing column at a time (along with the repeating columns) and Unions them together.

I tried ADO first, using the method of SaveCopyAs’ing the workbook-to-be-normalized in order to avoid the ADO memory leak. ADO was way slower than DAO, something like four times slower with 3000 records of 16 columns. So I went with DAO, which still takes about twice as long as the array method. Turning the ADO to DAO was easy, especially with this concise sample from XL-Dennis.

As Jeff Weir pointed out in a comment, this does require a reference (Tools>References) to the Microsoft DAO 3.5 Object Library. I’ve been switching some code over to late binding, but there doesn’t seem to be much enthusiasm for this with DAO. I’m not sure if that’s because it’s so pervasive and well-established, or for some other reason.

The core logic of the routine is pretty simple. In pseudo-English:

For each column in the columns to be normalized
Select all repeating columns
and Select (create) a new column, giving it the same name each time ("Team" in this example)
and Select the column with that team's data, giving it the same name each time ("Home Runs" in this example
and Union it to the Select statement created in the next loop iteration

Without further ado (heh heh) here’s the routine.

'Requires a reference to DAO 3.5 or later
'Arguments
'List: The range to be normalized.
'RepeatingColsCount: The number of columns, starting with the leftmost,
'   whose headings remain the same.
'NormalizedColHeader: The column header for the rolled-up category.
'DataColHeader: The column header for the normalized data.
'NewWorkbook: Put the sheet with the data in a new workbook?
'
'NOTE: The data must be in a contiguous range and the
'rows that will be repeated must be to the left,
'with the rows to be normalized to the right.

Sub NormalizeList_SQL_DAO(List As Excel.Range, RepeatingColsCount As Long, _
                          NormalizedColHeader As String, DataColHeader As String, _
                          Optional NewWorkbook As Boolean = False)

Dim FirstNormalizingCol As Long, NormalizingColsCount As Long
Dim RepeatingColsHeaders As Variant, NormalizingColsHeaders As Variant
Dim RepeatingColsIndex As Long, NormalizingColsIndex As Long
Dim wbSource As Excel.Workbook, wbTarget As Excel.Workbook
Dim wsTarget As Excel.Worksheet

Dim daoWorkSpace As DAO.Workspace
Dim daoWorkbook As DAO.Database
Dim daoRecordset As DAO.Recordset
Dim strSql As String
Dim strExtendedProperties As String

With List
    'If the normalized list won't fit, you must quit.
    If .Rows.Count * (.Columns.Count - RepeatingColsCount) > .Parent.Rows.Count Then
        MsgBox "The normalized list will be too many rows.", _
               vbExclamation + vbOKOnly, "Sorry"
        Exit Sub
    End If
    'List.Parent.Parent is the lists Workbook
    Set wbSource = List.Parent.Parent
    'The columns to normalize must be to the right of the columns that will repeat
    FirstNormalizingCol = RepeatingColsCount + 1
    NormalizingColsCount = .Columns.Count - RepeatingColsCount
    'Get the header names of the repeating columns
    RepeatingColsHeaders = List.Cells(1).Resize(1, RepeatingColsCount).Value
    'Get the header names of the normalizing columns
    NormalizingColsHeaders = List.Cells(FirstNormalizingCol).Resize(1, NormalizingColsCount).Value

    strSql = vbNullString
    'loop through each normalizing column
    For NormalizingColsIndex = 1 To NormalizingColsCount
        'Create an individual Select for the normalizing column
        strSql = strSql & " SELECT "
        'Select all the repeating columns
        For RepeatingColsIndex = 1 To RepeatingColsCount
            strSql = strSql & RepeatingColsHeaders(1, RepeatingColsIndex) & ", "
        Next RepeatingColsIndex
        'Select the normalizing column and assign the NormalizedColHeader field name
        'and select the data being counted and assign it the DataColHeader field name
        strSql = strSql & "'" & NormalizingColsHeaders(1, NormalizingColsIndex) & "'" & " AS " & NormalizedColHeader & _
                 ", " & NormalizingColsHeaders(1, NormalizingColsIndex) & " AS " & DataColHeader
        strSql = strSql & " FROM [" & List.Parent.Name & _
                 "$" & List.Address(rowabsolute:=False, columnabsolute:=False) & "]"
        If NormalizingColsIndex < NormalizingColsCount Then
            'Union the Select statements created for the normalizing columns
            strSql = strSql & " UNION ALL"
        End If
    Next NormalizingColsIndex
End With
'Set up the DAO connection
strExtendedProperties = "Excel 8.0;HDR=Yes;IMEX=1"
Set daoWorkSpace = DBEngine.Workspaces(0)
Set daoWorkbook = daoWorkSpace.OpenDatabase(wbSource.FullName, False, True, strExtendedProperties)
Set daoRecordset = daoWorkbook.OpenRecordset(strSql, dbOpenForwardOnly)

'Put the normal data in the same workbook, or a new one.
If NewWorkbook Then
    Set wbTarget = Workbooks.Add
    Set wsTarget = wbTarget.Worksheets(1)
Else
    Set wbSource = List.Parent.Parent
    With wbSource.Worksheets
        Set wsTarget = .Add(after:=.Item(.Count))
    End With
End If

'copy the headers and DAO recordset to the new worksheet
With wsTarget
    .Cells(1, 1).Resize(1, RepeatingColsCount).Value = RepeatingColsHeaders
    .Cells(1, RepeatingColsCount + 1) = NormalizedColHeader
    .Cells(1, RepeatingColsCount + 2) = DataColHeader
    .Cells(2, 1).CopyFromRecordset daoRecordset
End With

'clean up
daoRecordset.Close
daoWorkbook.Close
daoWorkSpace.Close
Set daoRecordset = Nothing
Set daoWorkbook = Nothing
Set daoWorkSpace = Nothing
End Sub

If you break after strSql is created, it looks like this:

SELECT League, Year, 'ATL' AS Team, ATL AS HomeRuns FROM [HR-NL$A1:R110]
 UNION ALL
 SELECT League, Year, 'CHC' AS Team, CHC AS HomeRuns FROM [HR-NL$A1:R110]
 UNION ALL
 ...
 SELECT League, Year, 'MIL' AS Team, MIL AS HomeRuns FROM [HR-NL$A1:R110]

Call it like this:

NormalizeList_SQL_DAO ActiveSheet.UsedRange, 2, "Team", "HomeRuns", False

One pitfall of this SQL version is the mixed data-type issue with the Excel ISAM driver. In this example it converts all the home run counts from numbers to text because of the blanks in the data.

All in all, I think the array approach is better than SQL for this use. The core skill of creating SQL in VBA is a valuable one though, and one I’m glad to be developing.

I updated the Data Normalizer .xls to include this code.

6 thoughts on “Data Normalizer – the SQL

  1. Hi Doug. Great post. Couple of thoughts:
    1. It might pay to explicitly set out in your post and code comments that a reference to Microsoft DAO 3.5 Object Library (or later version) is required (although hopefully they’ll see this from the XL-Dennis post you link to
    2. I believe the memory leak issue is fixed in 2010 (and possibly in 2007 also). A while ago I did some testing by querying open workbooks with DAO and ADO. When I queried an open workbook using MSQuery ODBC, in excel 2010 I get no memory leak altogether, but in excel 2003 I got a sizable memory leak. When I queried an open workbook using DAO, in excel 2010 and in excel 2003 I got small (and tolerable) memory leaks. I haven’t tested this with your spreadsheet.
    3. I didn’t know you could reference a range directly in SQL with the [HR-NL$A1:R110] syntax. Cool!
    4. Do you have any other approaches to data normalisation? I’ve got one up my sleeve that you haven’t touched on, and I have a friend that has another great approach, but I won’t detail them here just yet because I don’t want to steal your thunder.

      • Hi again. I’m wondering if it’s possible to instead pull an entire crosstab from the worksheet into an ADO disconnected recordset, then effectively unpivot it by looping through that recordset once for each crosstab column and creating new ‘Description’ and ‘Amount’ columns.

        Any idea if this is possible?

        The reason why I’m interested in this approach is that:
        a) Doing all those UNION ALL queries can be tedious
        b) you may strike Excel’s hard limit whereby it errors out if you have more than 49 UNION ALL clauses (although you can get around that…see http://www.mrexcel.com/forum/excel-questions/315768-creating-pivot-table-multiple-sheets-6.html – particularly comments 65 to 68 )
        c) depending on the complexity of your query, you may get an ‘Out of resources’ error.

        So I thought I’d explore options for manipulating the disconnected recordset. This would require me to add two calculated columns to the recordset: one to record the type of expense (e.g. ‘Depn, Capital Charge’ etc) and the other to record the amount.

        Any idea if that is possible?

        • Thinking about this some more, here’s a simple example and also an alternate approach.

          Say I have a Recordset with 3 fields:
          • Net_Income
          • Tax
          • Gross_Income

          I’d like to unpivot this by:
          1. creating new ‘Description’ and ‘Amount’ fields in the recordset (or a new recordset)
          2. Populate the ‘Description’ field of that recordset with the string “Net_Income” any time that there is an amount in the Net_Income field, and populate the ‘Amount’ field with the amount
          3. Populate the ‘Description’ field with the string “Tax” any time that there is an amount in the Tax field, and populate the ‘Amount’ field with the amount

          I guess that if it’s not possible to do all this within a recordset object itself, I could instead:
          1. Filter the recordset on Net_Income > 0, copy the amount from the Net_Income field to the spreadsheet, create a new column “Description”, then populate that Description column with the word “Net_Income”.
          2. Filter the recordset on Tax > 0, copy the amounts from the Tax field to the spreadsheet, then populate that Description column with the word “Tax”.

          But it would still be informative to know if all this can be achieved within a recordset object itself…that way I’d only be writing to the spreadsheet once.

          • Some more reading on the www makes me think that I’ll have to loop through each crosstab field in the rs one row at a time, copy each value in that field to a ‘Value’ column in the spreadsheet, then populate the ‘Description’ field with the current rs record’s field name.

            That is, I don’t believe I can do this operation on an entire field basis in one go.

            Either that, or I’ll have to execute seperate SQL statements – one for each calculated field – rather than try and do everything in one go, given the complexity of this query.

            Or I’ll just unpivot the data AFTER it’s been returned as a crosstab. Which will probably be slowest.

Speak Your Mind

Your email address will not be published. Required fields are marked *

To post code, do this: <code> your vba here </code>