Get Unique Per-Row Values With RemoveDuplicates

Get Unique Per-Row Values With RemoveDuplicates

This post consists of two topics, both involving VBA’s RemoveDuplicates method. First I discuss an error and a shortcoming in the RemoveDuplicates documentation that have confounded myself and others. After those are cleared up I’ll show how to use RemoveDuplicates to return a two-dimensional array of unique items from a range containing one or more columns.

RemoveDuplicates was added to VBA (along with it’s front-end counterpart Data>Remove Duplicates) in Excel 2007. It’s very handy, allowing you to select one or more columns in a range and and winnow it down to the rows where the combined values of those columns are unique. It does this “in-place,” leaving other areas of the sheet intact.

RemoveDuplicates’ Columns Argument isn’t Really Optional

The Excel 2013 MSDN documentation describes the variant Columns argument as an optional “Array of indexes of the columns that contain the duplicate information. If nothing is passed then it assumes all columns contain duplicate information.” (my emphasis)

Later it gives an example of leaving out the Columns argument to remove duplicates from the entire range:

“The following code sample removes duplicates with all columns from a range.”

ActiveSheet.Range("A1:C100").RemoveDuplicates

However, testing reveals that the above code does nothing. This is confirmed by the perplexed comments of various forum visitors. In truth, you have to specify the columns variant array in all cases. The help example should look like this:

ActiveSheet.Range("A1:C100").RemoveDuplicates Columns:=Array(1,2,3)

If you want to base the removal on just some of the columns in a range, for example columns A and C, you do it like this:

ActiveSheet.Range("A1:C100").RemoveDuplicates Columns:=Array(1,3)

Using a Dynamic Columns Array

If you use RemoveDuplicates in code, there’s a good chance you’ll want to determine the array of column numbers at runtime and not hard-code them as above. As a simple test, I tried something like:

Dim ColumnsToCheck as Variant
ColumnsToCheck = Array(1,3)
ActiveSheet.Range("A1:C100").RemoveDuplicates Columns:=ColumnsToCheck

However this results in a Runtime Error ‘5’ – Invalid Procedure Call or Argument.

After looking around the web a while I found that the answer is to wrap ColumnsToCheck in parentheses, so the code becomes:

Dim ColumnsToCheck as Variant
ColumnsToCheck = Array(1,3)
ActiveSheet.Range("A1:C100").RemoveDuplicates Columns:=(ColumnsToCheck)

I know, from one of my all-time favorite DDOE posts that parentheses force evaluation of what’s inside them. Others on the web said that these parens “coerce” the evaluation of the array. I don’t really see how that applies here, so if you can explain why this works, please leave a comment.

One other note about the Columns variant array – it must be zero-based.

Unique Unique Per-Row Values Array Function

Once I’d sorted out the above points, I was able to use RemoveDuplicates as the basis of a function that returns a two-dimensional variant array containing the unique rows in a range. The range can consist of multiple columns that aren’t necessarily adjacent, as in this table of hometowns and sports of 2008 US Olympic athletes:

Olympics table

In the example above I want to return a variant array of all the unique combinations of the State and Sport fields. The specific use is for a worksheet splitter, where I prompt the user for one or two columns and the cycle through the array of unique combinations, creating a separate workbook for each combination.

The code is simple enough. It takes a range as it’s argument, for example the State and Sport fields above. It copies the range it to a worksheet in a newly added workbook. We then use RemoveDuplicates on the entire UsedRange of that new sheet. Because the columns are now contiguous the code can easily assign the modified range back to a variant array. The temporary workbook is then closed without saving.

Function GetUniqueRowValues(rng As Excel.Range) As Variant
Dim wbTemp As Excel.Workbook
Dim wsTemp As Excel.Worksheet
Dim ColNums As Variant
Dim i As Long

'if only one cell selected then can't pass to two-dimensional array
If rng.Cells.Count = 1 Then
    'if the single cell is blank
    If rng.Value = vbEmpty Then
        Exit Function
    Else
        GetUniqueRowValues = rng.Value
        Exit Function
    End If
End If

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
'create a temp workbook
'and copy range to it
Set wbTemp = Workbooks.Add
Set wsTemp = wbTemp.Worksheets(1)
With wsTemp
    rng.Copy
    'rng can consist of non-adjactent columns
    'pasting to a new workbook results in a
    'new range with only those columns
    .Range("A1").PasteSpecial xlPasteValues
    ReDim ColNums(0 To wsTemp.UsedRange.Columns.Count - 1)
    'Create the array of column numbers for the
    'RemoveDuplicates' Columns parameter
    For i = LBound(ColNums) To UBound(ColNums)
        ColNums(i) = i + 1
    Next i
    .UsedRange.RemoveDuplicates Columns:=(ColNums), Header:=xlGuess
    'assign the values of the remaining,
    'non-duplicated rows to the function result
    GetUniqueRowValues = .UsedRange
End With
wbTemp.Close False
Application.ScreenUpdating = True
Application.Calculation = xlCalculationAutomatic
End Function

Note that is the original range was just one cell, the returned variant will not be an array. But you’d test for that before running this in the first place, wouldn’t you?

This code is fast for big ranges, handling over one million rows in less than five seconds on my laptop. On the other hand, it has to open and close a workbook, which takes part of a second and causes screen flicker. It’s a lot more flexible at least than my attempts to modify the Collection technique to handle large ranges with non-adjacent columns.

25 thoughts on “Get Unique Per-Row Values With RemoveDuplicates

  1. Enclosing an argument in parentheses causes the argument to be passed “by value” instead of “by reference.” In VBA arrays need to be passed “by reference.” But apparently RemoveDuplicates needs a “by value” value. So, my best guess would be that is why you get an error when you try and pass the array “by reference” instead of “by value.”

    Some links to look at:

    Under the header: Converting A ByRef Parameter To ByVal
    http://www.cpearson.com/excel/byrefbyval.aspx

    Under the header: Passing Arrays To Procedures
    http://www.cpearson.com/excel/passingandreturningarrays.htm

    • Jon, thanks for the references. (And for your really nice comment on a previous post!)

      So if arrays have to be passed ByRef, then how can it require ByVal here?

      I’m trying to think of any other function arguments that act this way. Can you think of any?

      • It would be interesting to look at any functions that were introduced to Excel 2007 and newer that introduced array arguments and see if they behave similarly. Isn’t it a relatively “new” thing to programming to pass arrays by value, at least in VB?

        That is my best guess, and of course, it is only a guess right now as to why it might work this way.

  2. Thank you for the thoughtful post and comments.

    The insistence on 0-based arrays is what confuses/disappoints me the most. Having learned the coding craft (FORTRAN/COBOL/others) too many years ago, the 0-base doesn’t bother me. I don’t like the extra code associated with subtracting 1 from a ctr when loading the array. Furthermore, given that someone is going to have to, one day, modify my code who may not be 0-based conversant I, as a rule, alway use OPTION BASE 1 to make my code more “readable.”

    I love VBA and its implementation, but I do think MS should take a serious look at the REMOVEDUPLICATES method and clean it up. There are too many aspects of the COLUMN parameter that are confusing and counter-intuitive.

  3. This might be faster:

    Sub M_snb()
        sn = Sheet1.Cells(1).CurrentRegion
       
        With CreateObject("scripting.dictionary")
            For j = 1 To UBound(sn)
              .Item(sn(j, 2) & "_" & sn(j, 4)) = Array(sn(j, 2), sn(j, 4))
            Next
            Sheet1.Cells(1, 10).Resize(.Count, 2) = Application.Index(.items, 0, 0)
        End With
    End Sub
    • Nice code. I like the use of Index to dump it back into the sheet. I really need to read more stuff on your website, as it seems to be one of the best, if not the best, sources for array handling code.

      It seems to be about the same speed for a million rows.

      Mine does an extra step of dumping the results back into an array, but I think often it would be just as good, or better, to work directly with the dictionary created by your code.

      I notice that neither of them distinguish between a cell with `1 and 1. They both distinguish between ’01 a 1.

      • Note that using INDEX is potentially dangerous: you’ll get an error if the array is over 65536 rows, as outlined at http://dailydoseofexcel.com/archives/2013/10/11/worksheetfunction-index-limitations/

        And getting an array over that size out of a dictionary and into Excel is also problematic, as per my post at http://dailydoseofexcel.com/archives/2013/10/23/dictionaries-can-be-rude/

        In order to transfer the Dictionary from VBA to Excel, we’ve got to first Transpose the array to a Column Vector. And we use Application.Transpose to do that. Which fails if you end up with more than 65536 non-duplicate values in the dictionary.

        I found that it was fastest to simply use the dictionary as a means to identify non-duplicates, and then write those non-duplicates to an appropriately oriented variant array one by one as they were identified, so you could then transfer that array to Excel without having to transpose it.

  4. Thanks !

    In Excel, as far as I know, ‘1 and 1 have the intention to be identical, while “01” and 1 are not.
    To treat all those as identical you could use Val().
    It all depends on the criteria you want to use for ‘identicalness’.

  5. An alternative that could also explain the ByVal character of the reference to the columns that have to be taken into account in the comparison.

    Sub M_snb()
        sn = Sheet1.Cells(1).CurrentRegion
       
        With CreateObject("scripting.dictionary")
            For j = 1 To UBound(sn)
              .Item(sn(j, 2) & "_" & sn(j, 4)) = j
            Next

            Sheet1.Cells(1, 10).Resize(.Count, 2) = Application.Index(sn, Application.Transpose(.items), Array(2, 4))
        End With
    End Sub

    NB. Now I use the dictionary only as the row-filtering argument.
    The array ‘sn’ will be filterd according to a 2-dimensional array (because the 1-dimensional Dictionary array ‘.items’ has been converted to a 2-dimensional one by ‘transpose’) and according to the array ‘array(2,4)’ that indicates that columns 2 and 4 have to be filtered by the ‘Application.Index’ function.
    When I compare the previous method I posted to this one they both run at the same speed.

  6. Nice post. Another approach would be to create a PivotTable out of the sucker, and use the Show Report Filter Pages to spit out a seperate Pivot containing just the data for each unique combination.

  7. Doug, I just saw this today and wondered how the following stacks up:

    Sub ExcelHero_Test()

        v = GetUniqueRowValues([a1:z1000000], Array(1, 5, 2))
        MsgBox UBound(v)

    End Sub

    Function GetUniqueRowValues(rData As Range, Columns)
        Dim j&, c
        Application.DisplayAlerts = False
        Application.ScreenUpdating = False
        Application.Calculation = xlCalculationManual
            With Sheets.Add
                For Each c In Columns
                    j = j + 1:  .Cells(, j) = rData(, c)
                Next
                rData.AdvancedFilter 2, , .[a1].Resize(, j), 1
                GetUniqueRowValues = .UsedRange
                .Delete
            End With
        Application.Calculation = xlCalculationAutomatic
        Application.ScreenUpdating = True
        Application.DisplayAlerts = True
    End Function
    • Hey Daniel, thanks for dropping by!

      It seems to take many minutes. A test of a three entire-column range getting the dupes from cols 1 and 3 takes about 50 seconds with the code in this post, and is still going after four minutes with yours (at which point it looks like it’s about 20% through).

      If you compare the two what do you get?

      • That’s odd. I tested the code in this post and snb’s dictionary code pulling four columns of unique rows from 26 columns of one million rows.

        Both took approx 20 seconds.

        My code took approx 1 second.

        • This sounds like a conversation I had with Craig Hatmaker at http://dailydoseofexcel.com/archives/2013/10/23/dictionaries-can-be-rude/ where we were getting wildly different results, and we finally tracked down the issue: very different datasets.

          While I was looking at de-duping just one column in that post, I found that:
          Remove Duplicates is very fast on datasets with lots of unique values if they are sorted
          Remove Duplicates is very slow on datasets with lots of unique values if they are not sorted
          Dictionaries are fastest on datasets sorted A to Z, slower on datasets sorted Z to A, and slowest on unsorted datasets. That’s my surname with a ‘d’ on the end. Weird.
          The Advanced Filter simply can’t handle big arrays with numerous non-duplicates quickly.

          • Yup, it looks like it’s the same issue: the Advanced Filter doesn’t like lots of variability.

            Fill 5 columns with =RANDBETWEEN(1,100), then convert to values. You’ll find your code runs in seconds.

            Now fill 5 columns with =RANDBETWEEN(1,100000) then convert to numbers, and you’ll find that you won’t be able to use Excel for a long, long time.

          • Hey Jeff, thanks for the info. FYI, when I ran it last night I was using RANDBETWEEN 1 and 999. Also FYI I didn’t convert to values. Today when I run it with RB 1 and 100,000 it’s about 130 seconds, with about 5,500 duplicates. For 1,000,000 it’s also about 130 seconds (and no dupes anyways).

            Also, Daniel, if you’re still with us, the Advanced Filter didn’t like that I just had numbers in the header row (or maybe it was that they were formulas).

          • Today when I run it with RB 1 and 100,000 it’s about 130 seconds, with about 5,500 duplicates.

            Doug: when you say ‘duplicates’ I take it you mean ‘unique values’?

            Those numbers don’t sound right to me. I suspect we’re talking at cross purposes on something, because with three columns of =RANDBETWEEN(1,100000) then there are 100000^3 possible combinations that each line could take. So the odds of getting a duplicate of a particular given combination in Excel’s grid are 1048576/100000^3 = incredibly remote. So given this, every single row should be identified as unique. . So run that sub on this data, and you should get out as many rows as you fed in.

            Let’s look at a smaller subset: using =RANDBETWEEN(1,10) across three columns. There are 10^3 = 1000 possible combinations. For 1048576 rows, the odds of any one set of numbers occurring twice are 1/10^3 * 1048576 = 1048 to one. At such high odds, when we dedupe we are almost certainly going to get out all 1000 combinations.

            For =RANDBETWEEN(1,100) across three columns, there are 100^3 = 1,000,000 possible combinations. So we would expect to see one particular combination come up on average 1048576/100000 times. Pretty much one to one. In other words, it’s unlikely that many combinations would appear more than once, meaning we’re probably not going to see that many duplicates.

            For =RANDBETWEEN(1,1000) across three columns, there are 1000^3 = 1,000,000,000 possible combinations. We’re almost certainty not going to get duplicates. In other words, given a particular combination, then you could expect to push f9 1000 times before that line comes up in any of those 1048576 rows.

          • Jeff, it’s actually only two columns – column A and C with column B not selected. By “duplicates” I mean the same thing Excel means when it says “5,500 duplicates removed.” I tried it again just now and it was only 38. Perhaps my lab assistant miscounted the zeroes last night, or maybe I should go to Las Vegas.

      • It is interesting that it really depends on the data. Just to prove my point, please have a look at this workbook:

        http://www.excelhero.com/samples/uniques_advancedfilter.xlsm

        It’s 112 MB and has over 900k rows. But it is highly repetitive and this goes to support Jeff’s assertion. But to phrase it in reverse, ActiveFilter really likes repetitive data. But to be fair we should add a caveat… that assertion has been demonstrated when exclusively using the optional Unique parameter when no Criteria are being used. I’m not sure it holds once criteria are thrown in the mix.

        I have your code and mine in Module 1. Run them and you’ll see a dramatic difference. On my machine, my code competes in 1 second, while the RemoveDuplicates routine requires 19 seconds.

        This very bizarre data set came from this challenge:

        http://stackoverflow.com/questions/30959315/excel-vba-performance-1-million-rows-delete-rows-containing-a-value-in-less/33192437#33192437

        …where the AdvancedFilter proved nimble.

        But I see that changing the data to include less duplicates very quickly smacks the AdvancedFilter into virtual uselessness.

        …So I went Old School and wrote this:

        Sub GetArrayUniqueRowsOldSchool_Test()
           
            Dim t: t = Now: Debug.Print: Debug.Print t: Beep
           
            v = GetArrayUniqueRowsOldSchool([a1:e1000001], 1)
           
            Debug.Print (Now - t) * 86400: Debug.Print UBound(v): Beep
               
        End Sub


        Function GetArrayUniqueRowsOldSchool(Data As Range, Optional Header As Long = xlNo)
            Dim c&, i&, j&, k&, cols&, v, w, bMatch As Boolean, r As Range
            Dim a: Set a = Application
            a.ScreenUpdating = 0
            a.Calculation = xlCalculationManual
                Workbooks.Add
                Data.Copy
                With ActiveSheet
                    .[a1].PasteSpecial xlPasteValues
                    Set r = .UsedRange
                    Set r = r.Resize(r.Rows.Count + 1)
                    cols = r.Columns.Count
                    With .Sort
                        .SortFields.Clear
                        For j = 1 To cols
                            .SortFields.Add r.Columns(j)
                        Next
                        .SetRange r
                        .Header = Header
                        .Apply
                    End With
                    v = r
                    Workbooks(Workbooks.Count).Close 0
                    For i = 2 To UBound(v)
                        k = i - 1
                        bMatch = True
                        For j = 1 To cols
                            If v(i, j) <> v(k, j) Then
                                bMatch = False: Exit For
                            End If
                        Next
                        If Not bMatch Then
                            c = c + 1
                            For j = 1 To cols
                                v(c, j) = v(k, j)
                            Next
                        End If
                    Next
                    ReDim w(1 To c, 1 To cols)
                    For i = 1 To c
                        For j = 1 To cols
                            w(i, j) = v(i, j)
                        Next
                    Next
                    GetArrayUniqueRowsOldSchool = w
                End With
            a.Calculation = xlCalculationAutomatic
            a.ScreenUpdating = 1
        End Function

        This code has the exact opposite characteristic. It is very fast under every circumstance I’ve thrown at it. For example, in the workbook below, on one million rows with nearly a million uniques it takes just 6 seconds. The RemoveDuplicates routine requires 179 seconds to return the same array:

        http://www.excelhero.com/samples/unique_showdown.xlsm

        • Daniel, Very nice! I tried it with two almost-full non-contiguous columns with values from 1 to 100,000 and it completed, as you said, in the six second range.

          It’s a nice simple concept too. Just sort and compare.

          My only quibble is that it breaks if I select full columns, because it needs to resize by one more row.

          And thinking some more, I wondered what would happen if I added a Sort to my routine. I tried it first just in Excel and it was quick. Sure enough, adding a sort to my original routine makes it also run in six seconds. I’ll be re-doing this post and thanking you.

          • Daniel was unable to post his response to my quibble about not being able to select full columns, so asked me to post this code for him. Thanks Daniel:

            Sub UniqueRowsOldSchool_Test()

                Dim t: t = Now: Debug.Print: Debug.Print t: Beep

                v = UniqueRowsOldSchool([a:e], 1)

                Debug.Print (Now - t) * 86400: Debug.Print UBound(v): Beep

            End Sub


            Function UniqueRowsOldSchool(Data As Range, Optional Header As Long = xlNo)
                Dim c&, i&, j&, k&, cols&, v, w, bMatch As Boolean, bDone As Boolean, r As Range
                Dim a: Set a = Application
                a.ScreenUpdating = 0
                a.Calculation = xlCalculationManual
                    Workbooks.Add
                    Data.Copy
                    With ActiveSheet
                        .[a1].PasteSpecial xlPasteValues
                        Set r = .UsedRange
                        Set r = r.Resize(r.Rows.Count)
                        cols = r.Columns.Count
                        With .Sort
                            For j = 1 To cols
                                .SortFields.Add r.Columns(j)
                            Next
                            .SetRange r
                            .Header = Header
                            .Apply
                        End With
                        v = r
                        Workbooks(Workbooks.Count).Close 0
                        For i = 2 To UBound(v)
                            If bDone Then Exit For
                            k = i - 1
            Compare:
                            bMatch = True
                            For j = 1 To cols
                                If v(i, j) <> v(k, j) Then
                                    bMatch = False: Exit For
                                End If
                            Next
                            If Not bMatch Then
                                c = c + 1
                                For j = 1 To cols
                                    v(c, j) = v(k, j)
                                Next
                            End If
                        Next
                        If Not bDone Then bDone = 1: i = k: k = UBound(v): GoTo Compare
                        ReDim w(1 To c, 1 To cols)
                        For i = 1 To c
                            For j = 1 To cols
                                w(i, j) = v(i, j)
                            Next
                        Next
                        UniqueRowsOldSchool = w
                    End With
                a.Calculation = xlCalculationAutomatic
                a.ScreenUpdating = 1
            End Function
        • Nice approach, Daniel. I love the idea to sort on all 5 columns in order to group duplicates together, so that they could be easily identified by virtue of being identical to the row below, and then discarded. Or rather, included as soon as one element doesn’t match the other (which is even faster).

          Previously I’d used sorting to speed things up such as the time it takes for a Dictionary or Remove Duplicates to process things. I hadn’t thought about using sort to separate the wheat from the chaff right there in the field.

          Thanks for the insight.

Speak Your Mind

Your email address will not be published. Required fields are marked *

To post code, do this: <code> your vba here </code>