Get Unique Per-Row Values With RemoveDuplicates

Posted on October 10, 2013 by Doug Glancy

This post consists of two topics, both involving VBA’s RemoveDuplicates method. First I discuss an error and a shortcoming in the RemoveDuplicates documentation that have confounded myself and others. After those are cleared up I’ll show how to use RemoveDuplicates to return a two-dimensional array of unique items from a range containing one or more columns.

RemoveDuplicates was added to VBA (along with it’s front-end counterpart Data>Remove Duplicates) in Excel 2007. It’s very handy, allowing you to select one or more columns in a range and and winnow it down to the rows where the combined values of those columns are unique. It does this “in-place,” leaving other areas of the sheet intact.

RemoveDuplicates’ Columns Argument isn’t Really Optional

The Excel 2013 MSDN documentation describes the variant Columns argument as an optional “Array of indexes of the columns that contain the duplicate information. If nothing is passed then it assumes all columns contain duplicate information.” (my emphasis)

Later it gives an example of leaving out the Columns argument to remove duplicates from the entire range:

“The following code sample removes duplicates with all columns from a range.”

ActiveSheet.Range("A1:C100").RemoveDuplicates

However, testing reveals that the above code does nothing. This is confirmed by the perplexed comments of various forum visitors. In truth, you have to specify the columns variant array in all cases. The help example should look like this:

ActiveSheet.Range("A1:C100").RemoveDuplicates Columns:=Array(1,2,3)

If you want to base the removal on just some of the columns in a range, for example columns A and C, you do it like this:

ActiveSheet.Range("A1:C100").RemoveDuplicates Columns:=Array(1,3)

Using a Dynamic Columns Array

If you use RemoveDuplicates in code, there’s a good chance you’ll want to determine the array of column numbers at runtime and not hard-code them as above. As a simple test, I tried something like:

Dim ColumnsToCheck as Variant
ColumnsToCheck = Array(1,3)
ActiveSheet.Range("A1:C100").RemoveDuplicates Columns:=ColumnsToCheck

However this results in a Runtime Error ‘5’ – Invalid Procedure Call or Argument.

After looking around the web a while I found that the answer is to wrap ColumnsToCheck in parentheses, so the code becomes:

Dim ColumnsToCheck as Variant
ColumnsToCheck = Array(1,3)
ActiveSheet.Range("A1:C100").RemoveDuplicates Columns:=(ColumnsToCheck)

I know, from one of my all-time favorite DDOE posts that parentheses force evaluation of what’s inside them. Others on the web said that these parens “coerce” the evaluation of the array. I don’t really see how that applies here, so if you can explain why this works, please leave a comment.

One other note about the Columns variant array – it must be zero-based.

Unique Unique Per-Row Values Array Function

Once I’d sorted out the above points, I was able to use RemoveDuplicates as the basis of a function that returns a two-dimensional variant array containing the unique rows in a range. The range can consist of multiple columns that aren’t necessarily adjacent, as in this table of hometowns and sports of 2008 US Olympic athletes:

In the example above I want to return a variant array of all the unique combinations of the State and Sport fields. The specific use is for a worksheet splitter, where I prompt the user for one or two columns and the cycle through the array of unique combinations, creating a separate workbook for each combination.

The code is simple enough. It takes a range as it’s argument, for example the State and Sport fields above. It copies the range it to a worksheet in a newly added workbook. We then use RemoveDuplicates on the entire UsedRange of that new sheet. Because the columns are now contiguous the code can easily assign the modified range back to a variant array. The temporary workbook is then closed without saving.

Function GetUniqueRowValues(rng As Excel.Range) As Variant
Dim wbTemp As Excel.Workbook
Dim wsTemp As Excel.Worksheet
Dim ColNums As Variant
Dim i As Long

'if only one cell selected then can't pass to two-dimensional array
If rng.Cells.Count = 1 Then
'if the single cell is blank
If rng.Value = vbEmpty Then
Exit Function
Else
GetUniqueRowValues = rng.Value
Exit Function
End If
End If

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
'create a temp workbook
'and copy range to it
Set wbTemp = Workbooks.Add
Set wsTemp = wbTemp.Worksheets(1)
With wsTemp
rng.Copy
'rng can consist of non-adjactent columns
'pasting to a new workbook results in a
'new range with only those columns
.Range("A1").PasteSpecial xlPasteValues
ReDim ColNums(0 To wsTemp.UsedRange.Columns.Count - 1)
'Create the array of column numbers for the
'RemoveDuplicates' Columns parameter
For i = LBound(ColNums) To UBound(ColNums)
ColNums(i) = i + 1
Next i
.UsedRange.RemoveDuplicates Columns:=(ColNums), Header:=xlGuess
'assign the values of the remaining,
'non-duplicated rows to the function result
GetUniqueRowValues = .UsedRange
End With
wbTemp.Close False
Application.ScreenUpdating = True
Application.Calculation = xlCalculationAutomatic
End Function

Note that is the original range was just one cell, the returned variant will not be an array. But you’d test for that before running this in the first place, wouldn’t you?

This code is fast for big ranges, handling over one million rows in less than five seconds on my laptop. On the other hand, it has to open and close a workbook, which takes part of a second and causes screen flicker. It’s a lot more flexible at least than my attempts to modify the Collection technique to handle large ranges with non-adjacent columns.

27 thoughts on “Get Unique Per-Row Values With RemoveDuplicates”

jon on October 11, 2013 at 10:22 am said:

Enclosing an argument in parentheses causes the argument to be passed “by value” instead of “by reference.” In VBA arrays need to be passed “by reference.” But apparently RemoveDuplicates needs a “by value” value. So, my best guess would be that is why you get an error when you try and pass the array “by reference” instead of “by value.”

Some links to look at:

Under the header: Converting A ByRef Parameter To ByVal
http://www.cpearson.com/excel/byrefbyval.aspx

Under the header: Passing Arrays To Procedures
http://www.cpearson.com/excel/passingandreturningarrays.htm

Reply ↓
- Doug Glancy on October 12, 2013 at 8:46 am said:
  
  Jon, thanks for the references. (And for your really nice comment on a previous post!)
  
  So if arrays have to be passed ByRef, then how can it require ByVal here?
  
  I’m trying to think of any other function arguments that act this way. Can you think of any?
  
  Reply ↓
  - jon on October 12, 2013 at 9:55 am said:
    
    It would be interesting to look at any functions that were introduced to Excel 2007 and newer that introduced array arguments and see if they behave similarly. Isn’t it a relatively “new” thing to programming to pass arrays by value, at least in VB?
    
    That is my best guess, and of course, it is only a guess right now as to why it might work this way.
    
    Reply ↓
Nicolas on March 7, 2014 at 6:19 am said:

Thanks!!!!!!

Reply ↓
Polarisking on January 16, 2015 at 2:29 am said:

Thank you for the thoughtful post and comments.

The insistence on 0-based arrays is what confuses/disappoints me the most. Having learned the coding craft (FORTRAN/COBOL/others) too many years ago, the 0-base doesn’t bother me. I don’t like the extra code associated with subtracting 1 from a ctr when loading the array. Furthermore, given that someone is going to have to, one day, modify my code who may not be 0-based conversant I, as a rule, alway use OPTION BASE 1 to make my code more “readable.”

I love VBA and its implementation, but I do think MS should take a serious look at the REMOVEDUPLICATES method and clean it up. There are too many aspects of the COLUMN parameter that are confusing and counter-intuitive.

Reply ↓
- Doug Glancy on January 17, 2015 at 12:23 pm said:
  
  Thanks Polarisking. I saw your post on the Chandoo forum where you did some testing with this: http://chandoo.org/forum/threads/remove-duplicates-base-0-vs-base-1.21320/
  
  I agree that RemoveDuplicates could use a little TLC. I’ve also seen some discussion that it doesn’t always work: http://superuser.com/questions/572226/remove-duplicates-feature-does-not-remove-all-duplicates.
  
  Reply ↓
snb on January 17, 2015 at 9:36 am said:

This might be faster:

Sub M_snb()
sn = Sheet1.Cells(1).CurrentRegion

With CreateObject("scripting.dictionary")
For j = 1 To UBound(sn)
.Item(sn(j, 2) & "_" & sn(j, 4)) = Array(sn(j, 2), sn(j, 4))
Next
Sheet1.Cells(1, 10).Resize(.Count, 2) = Application.Index(.items, 0, 0)
End With
End Sub

Reply ↓
- Doug Glancy on January 17, 2015 at 12:14 pm said:
  
  Nice code. I like the use of Index to dump it back into the sheet. I really need to read more stuff on your website, as it seems to be one of the best, if not the best, sources for array handling code.
  
  It seems to be about the same speed for a million rows.
  
  Mine does an extra step of dumping the results back into an array, but I think often it would be just as good, or better, to work directly with the dictionary created by your code.
  
  I notice that neither of them distinguish between a cell with `1 and 1. They both distinguish between ’01 a 1.
  
  Reply ↓
  - Jeff Weir on October 30, 2015 at 2:36 am said:
    
    Note that using INDEX is potentially dangerous: you’ll get an error if the array is over 65536 rows, as outlined at http://dailydoseofexcel.com/archives/2013/10/11/worksheetfunction-index-limitations/
    
    And getting an array over that size out of a dictionary and into Excel is also problematic, as per my post at http://dailydoseofexcel.com/archives/2013/10/23/dictionaries-can-be-rude/
    
    In order to transfer the Dictionary from VBA to Excel, we’ve got to first Transpose the array to a Column Vector. And we use Application.Transpose to do that. Which fails if you end up with more than 65536 non-duplicate values in the dictionary.
    
    I found that it was fastest to simply use the dictionary as a means to identify non-duplicates, and then write those non-duplicates to an appropriately oriented variant array one by one as they were identified, so you could then transfer that array to Excel without having to transpose it.
    
    Reply ↓
snb on January 17, 2015 at 1:55 pm said:

Thanks !

In Excel, as far as I know, ‘1 and 1 have the intention to be identical, while “01” and 1 are not.
To treat all those as identical you could use Val().
It all depends on the criteria you want to use for ‘identicalness’.

Reply ↓
snb on January 18, 2015 at 11:20 am said:

An alternative that could also explain the ByVal character of the reference to the columns that have to be taken into account in the comparison.

Sub M_snb()
sn = Sheet1.Cells(1).CurrentRegion

With CreateObject("scripting.dictionary")
For j = 1 To UBound(sn)
.Item(sn(j, 2) & "_" & sn(j, 4)) = j
Next

Sheet1.Cells(1, 10).Resize(.Count, 2) = Application.Index(sn, Application.Transpose(.items), Array(2, 4))
End With
End Sub

NB. Now I use the dictionary only as the row-filtering argument.
The array ‘sn’ will be filterd according to a 2-dimensional array (because the 1-dimensional Dictionary array ‘.items’ has been converted to a 2-dimensional one by ‘transpose’) and according to the array ‘array(2,4)’ that indicates that columns 2 and 4 have to be filtered by the ‘Application.Index’ function.
When I compare the previous method I posted to this one they both run at the same speed.

Reply ↓
Jeff Weir on February 10, 2015 at 11:44 am said:

Nice post. Another approach would be to create a PivotTable out of the sucker, and use the Show Report Filter Pages to spit out a seperate Pivot containing just the data for each unique combination.

Reply ↓
Excel Hero on October 28, 2015 at 2:04 pm said:

Doug, I just saw this today and wondered how the following stacks up:

Sub ExcelHero_Test()

v = GetUniqueRowValues([a1:z1000000], Array(1, 5, 2))
MsgBox UBound(v)

End Sub

Function GetUniqueRowValues(rData As Range, Columns)
Dim j&, c
Application.DisplayAlerts = False
Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
With Sheets.Add
For Each c In Columns
j = j + 1: .Cells(, j) = rData(, c)
Next
rData.AdvancedFilter 2, , .[a1].Resize(, j), 1
GetUniqueRowValues = .UsedRange
.Delete
End With
Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True
Application.DisplayAlerts = True
End Function

Reply ↓
- Doug Glancy on October 29, 2015 at 9:45 pm said:
  
  Hey Daniel, thanks for dropping by!
  
  It seems to take many minutes. A test of a three entire-column range getting the dupes from cols 1 and 3 takes about 50 seconds with the code in this post, and is still going after four minutes with yours (at which point it looks like it’s about 20% through).
  
  If you compare the two what do you get?
  
  Reply ↓
  - Excel Hero on October 29, 2015 at 10:05 pm said:
    
    That’s odd. I tested the code in this post and snb’s dictionary code pulling four columns of unique rows from 26 columns of one million rows.
    
    Both took approx 20 seconds.
    
    My code took approx 1 second.
    
    Reply ↓
    - Jeff Weir on October 29, 2015 at 11:34 pm said:
      
      This sounds like a conversation I had with Craig Hatmaker at http://dailydoseofexcel.com/archives/2013/10/23/dictionaries-can-be-rude/ where we were getting wildly different results, and we finally tracked down the issue: very different datasets.
      
      While I was looking at de-duping just one column in that post, I found that:
      Remove Duplicates is very fast on datasets with lots of unique values if they are sorted
      Remove Duplicates is very slow on datasets with lots of unique values if they are not sorted
      Dictionaries are fastest on datasets sorted A to Z, slower on datasets sorted Z to A, and slowest on unsorted datasets. That’s my surname with a ‘d’ on the end. Weird.
      The Advanced Filter simply can’t handle big arrays with numerous non-duplicates quickly.
      
      Reply ↓
      - Jeff Weir on October 30, 2015 at 2:24 am said:
        
        Yup, it looks like it’s the same issue: the Advanced Filter doesn’t like lots of variability.
        
        Fill 5 columns with =RANDBETWEEN(1,100), then convert to values. You’ll find your code runs in seconds.
        
        Now fill 5 columns with =RANDBETWEEN(1,100000) then convert to numbers, and you’ll find that you won’t be able to use Excel for a long, long time.
      - Doug Glancy on October 30, 2015 at 9:07 pm said:
        
        Hey Jeff, thanks for the info. FYI, when I ran it last night I was using RANDBETWEEN 1 and 999. Also FYI I didn’t convert to values. Today when I run it with RB 1 and 100,000 it’s about 130 seconds, with about 5,500 duplicates. For 1,000,000 it’s also about 130 seconds (and no dupes anyways).
        
        Also, Daniel, if you’re still with us, the Advanced Filter didn’t like that I just had numbers in the header row (or maybe it was that they were formulas).
      - Jeff Weir on October 31, 2015 at 4:14 pm said:
        
        Today when I run it with RB 1 and 100,000 it’s about 130 seconds, with about 5,500 duplicates.
        
        Doug: when you say ‘duplicates’ I take it you mean ‘unique values’?
        
        Those numbers don’t sound right to me. I suspect we’re talking at cross purposes on something, because with three columns of =RANDBETWEEN(1,100000) then there are 100000^3 possible combinations that each line could take. So the odds of getting a duplicate of a particular given combination in Excel’s grid are 1048576/100000^3 = incredibly remote. So given this, every single row should be identified as unique. . So run that sub on this data, and you should get out as many rows as you fed in.
        
        Let’s look at a smaller subset: using =RANDBETWEEN(1,10) across three columns. There are 10^3 = 1000 possible combinations. For 1048576 rows, the odds of any one set of numbers occurring twice are 1/10^3 * 1048576 = 1048 to one. At such high odds, when we dedupe we are almost certainly going to get out all 1000 combinations.
        
        For =RANDBETWEEN(1,100) across three columns, there are 100^3 = 1,000,000 possible combinations. So we would expect to see one particular combination come up on average 1048576/100000 times. Pretty much one to one. In other words, it’s unlikely that many combinations would appear more than once, meaning we’re probably not going to see that many duplicates.
        
        For =RANDBETWEEN(1,1000) across three columns, there are 1000^3 = 1,000,000,000 possible combinations. We’re almost certainty not going to get duplicates. In other words, given a particular combination, then you could expect to push f9 1000 times before that line comes up in any of those 1048576 rows.
      - Doug Glancy on October 31, 2015 at 4:41 pm said:
        
        Jeff, it’s actually only two columns – column A and C with column B not selected. By “duplicates” I mean the same thing Excel means when it says “5,500 duplicates removed.” I tried it again just now and it was only 38. Perhaps my lab assistant miscounted the zeroes last night, or maybe I should go to Las Vegas.
  - Excel Hero on October 31, 2015 at 6:55 pm said:
    
    It is interesting that it really depends on the data. Just to prove my point, please have a look at this workbook:
    
    http://www.excelhero.com/samples/uniques_advancedfilter.xlsm
    
    It’s 112 MB and has over 900k rows. But it is highly repetitive and this goes to support Jeff’s assertion. But to phrase it in reverse, ActiveFilter really likes repetitive data. But to be fair we should add a caveat… that assertion has been demonstrated when exclusively using the optional Unique parameter when no Criteria are being used. I’m not sure it holds once criteria are thrown in the mix.
    
    I have your code and mine in Module 1. Run them and you’ll see a dramatic difference. On my machine, my code competes in 1 second, while the RemoveDuplicates routine requires 19 seconds.
    
    This very bizarre data set came from this challenge:
    
    http://stackoverflow.com/questions/30959315/excel-vba-performance-1-million-rows-delete-rows-containing-a-value-in-less/33192437#33192437
    
    …where the AdvancedFilter proved nimble.
    
    But I see that changing the data to include less duplicates very quickly smacks the AdvancedFilter into virtual uselessness.
    
    …So I went Old School and wrote this:
    
    Sub GetArrayUniqueRowsOldSchool_Test()
    
    Dim t: t = Now: Debug.Print: Debug.Print t: Beep
    
    v = GetArrayUniqueRowsOldSchool([a1:e1000001], 1)
    
    Debug.Print (Now - t) * 86400: Debug.Print UBound(v): Beep
    
    End Sub
    
    Function GetArrayUniqueRowsOldSchool(Data As Range, Optional Header As Long = xlNo)
    Dim c&, i&, j&, k&, cols&, v, w, bMatch As Boolean, r As Range
    Dim a: Set a = Application
    a.ScreenUpdating = 0
    a.Calculation = xlCalculationManual
    Workbooks.Add
    Data.Copy
    With ActiveSheet
    .[a1].PasteSpecial xlPasteValues
    Set r = .UsedRange
    Set r = r.Resize(r.Rows.Count + 1)
    cols = r.Columns.Count
    With .Sort
    .SortFields.Clear
    For j = 1 To cols
    .SortFields.Add r.Columns(j)
    Next
    .SetRange r
    .Header = Header
    .Apply
    End With
    v = r
    Workbooks(Workbooks.Count).Close 0
    For i = 2 To UBound(v)
    k = i - 1
    bMatch = True
    For j = 1 To cols
    If v(i, j) <> v(k, j) Then
    bMatch = False: Exit For
    End If
    Next
    If Not bMatch Then
    c = c + 1
    For j = 1 To cols
    v(c, j) = v(k, j)
    Next
    End If
    Next
    ReDim w(1 To c, 1 To cols)
    For i = 1 To c
    For j = 1 To cols
    w(i, j) = v(i, j)
    Next
    Next
    GetArrayUniqueRowsOldSchool = w
    End With
    a.Calculation = xlCalculationAutomatic
    a.ScreenUpdating = 1
    End Function
    
    This code has the exact opposite characteristic. It is very fast under every circumstance I’ve thrown at it. For example, in the workbook below, on one million rows with nearly a million uniques it takes just 6 seconds. The RemoveDuplicates routine requires 179 seconds to return the same array:
    
    http://www.excelhero.com/samples/unique_showdown.xlsm
    
    Reply ↓
    - Doug Glancy on November 1, 2015 at 7:52 am said:
      
      Daniel, Very nice! I tried it with two almost-full non-contiguous columns with values from 1 to 100,000 and it completed, as you said, in the six second range.
      
      It’s a nice simple concept too. Just sort and compare.
      
      My only quibble is that it breaks if I select full columns, because it needs to resize by one more row.
      
      And thinking some more, I wondered what would happen if I added a Sort to my routine. I tried it first just in Excel and it was quick. Sure enough, adding a sort to my original routine makes it also run in six seconds. I’ll be re-doing this post and thanking you.
      
      Reply ↓
      - Jeff Weir on November 1, 2015 at 1:49 pm said:
        
        Doug: I covered the effects of sorting before adding things to a dicitonary or RemoveDuplicates at that DDOE post some time back.
        
        Given the small overhead of sorting in Excel, it’s well worth the bother, as per the table I posted here:
        http://dailydoseofexcel.com/wp-content/uploads/2013/10/DDOE_Dictionaries_20131025.gif for some
      - Doug Glancy on November 1, 2015 at 6:27 pm said:
        
        Daniel was unable to post his response to my quibble about not being able to select full columns, so asked me to post this code for him. Thanks Daniel:
        
        Sub UniqueRowsOldSchool_Test()
        
        Dim t: t = Now: Debug.Print: Debug.Print t: Beep
        
        v = UniqueRowsOldSchool([a:e], 1)
        
        Debug.Print (Now - t) * 86400: Debug.Print UBound(v): Beep
        
        End Sub
        
        Function UniqueRowsOldSchool(Data As Range, Optional Header As Long = xlNo)
        Dim c&, i&, j&, k&, cols&, v, w, bMatch As Boolean, bDone As Boolean, r As Range
        Dim a: Set a = Application
        a.ScreenUpdating = 0
        a.Calculation = xlCalculationManual
        Workbooks.Add
        Data.Copy
        With ActiveSheet
        .[a1].PasteSpecial xlPasteValues
        Set r = .UsedRange
        Set r = r.Resize(r.Rows.Count)
        cols = r.Columns.Count
        With .Sort
        For j = 1 To cols
        .SortFields.Add r.Columns(j)
        Next
        .SetRange r
        .Header = Header
        .Apply
        End With
        v = r
        Workbooks(Workbooks.Count).Close 0
        For i = 2 To UBound(v)
        If bDone Then Exit For
        k = i - 1
        Compare:
        bMatch = True
        For j = 1 To cols
        If v(i, j) <> v(k, j) Then
        bMatch = False: Exit For
        End If
        Next
        If Not bMatch Then
        c = c + 1
        For j = 1 To cols
        v(c, j) = v(k, j)
        Next
        End If
        Next
        If Not bDone Then bDone = 1: i = k: k = UBound(v): GoTo Compare
        ReDim w(1 To c, 1 To cols)
        For i = 1 To c
        For j = 1 To cols
        w(i, j) = v(i, j)
        Next
        Next
        UniqueRowsOldSchool = w
        End With
        a.Calculation = xlCalculationAutomatic
        a.ScreenUpdating = 1
        End Function
    - Jeff Weir on November 1, 2015 at 1:45 pm said:
      
      Nice approach, Daniel. I love the idea to sort on all 5 columns in order to group duplicates together, so that they could be easily identified by virtue of being identical to the row below, and then discarded. Or rather, included as soon as one element doesn’t match the other (which is even faster).
      
      Previously I’d used sorting to speed things up such as the time it takes for a Dictionary or Remove Duplicates to process things. I hadn’t thought about using sort to separate the wheat from the chaff right there in the field.
      
      Thanks for the insight.
      
      Reply ↓
Michael on March 5, 2020 at 12:31 pm said:

Thank you so much! I was going crazy trying to understand why it wouldn’t accept the array.

Reply ↓
- Doug Glancy on March 5, 2020 at 7:02 pm said:
  
  Glad to help Michael!
  
  Reply ↓