Regex Function to Sum Numbers in String

Regex Function to Sum Numbers in String

I recently needed to sum the numeric parts of strings in cells. For example, a cell with “4 calling birds, 3 dog night” would equal seven. So I came up with a regex function to sum numbers in strings. Actually, the regex identified the numbers, and the rest was easy.

The original version worked for things like the following: positive integers with no commas:

easy regex with just positive integers

For those of you not familiar with the basics of regex matching, here’s a short sample that takes a string like those above, applies a simple regex pattern for positive integers and sums the matches. It uses early binding, so you need to set a reference to “Microsoft VBScript Regular Expressions 5.5” in the VBE.

Sub BasicRegexFind()
'Set a reference to "Microsoft VBScript Regular Expressions 5.5"
Dim regex As VBScript_RegExp_55.RegExp
Dim rgxMatch As VBScript_RegExp_55.Match
Dim rgxMatches As VBScript_RegExp_55.MatchCollection
Dim StringToSearch As String

StringToSearch = "4 calling birds, 76 Trombones"

Set regex = New VBScript_RegExp_55.RegExp
With regex
    'Find all matches, not just the first
   .Global = True
    'search for any integer matches
   '"\d+" is the same as "[0-9]+"
   .Pattern = "\d+"
    'built-in test for matches!
   If .Test(StringToSearch) Then
        'if matches, create a collection of them
       Set rgxMatches = .Execute(StringToSearch)
        For Each rgxMatch In rgxMatches
          Debug.Print rgxMatch
        Next rgxMatch
    End If
End With

End Sub

The VBScript regex object is easy to work with, complete with a “Test” method that tells you if there’s any matches, and a collection of matches you can loop through with For/Next.

While the object is pretty straightforward, the concepts are confusing, and the syntax is nuts! I don’t know how much I’ll ever memorize. So before moving on to my voyage of discovery in developing a better function, here’s a couple of resources that helped me. The first site I turn to is regular-expressions.info. This link deals specifically with VBScript regex engine, but there’s many pages of tutorials on syntax and concepts. This tutorial by Patrick Matthews on Experts Exchange focuses on VBA and also contains a bunch of powerful regex-based “wrapper” functions you can use to match and replace text, without having to know how they work. Finally, you can take a look at the many masterful VBA/regex solutions provided by brettdj to real-world questions asked on stackoverflow.

The match pattern used above – “\d+” – worked for my original task, as I was adding only positive integers in strings with no other numbers of any sort. But what about…

  • sub-strings with numbers that shouldn’t be counted, like “Catch-22” or “7-Eleven”
  • decimal number
  • negative numbers
  • numbers with commas
  • non-numeric strings with nothing but numerals and periods, like IP addresses

In other words, we’re looking for substrings containing only numerals, periods, commas, or plus or minus signs. Further, after the optional plus or minus sign, any legitimate match must start with either a single digit or a decimal point followed by a digit.

Regular expressions includes a zero-length match construct – “\b” – that matches a “word” boundary. I thought something like “\b\d+\b” would match a positive integer bounded by a space. But it turns out that a period is a “non-word” character, so “192” and all the other numeric sections of “192.168.0.1” are seen as words and matched. Also, it would split decimal numbers into their integer and decimal parts, so 3.14159 would yield two matches of 3 and 14159, without the decimal. Alas, “\b” was no use to me. In addition, I realized that I was going to have to capture entire strings, such as IP addresses, because otherwise I’d generate a bunch of false positives from their parts. They’d need to be deleted from the real positives with an IsNumeric test after the regex matching was done.

Then I figured I could just check for a space preceding and following the match. That almost works, but since the space is part of the match, it only works for the first occurrence. With a string like “22 33” the space between 22 and 33 only gets counted as the space after 22. The regex doesn’t recognize that 33 has a space in front of it because it’s already moved on down the road.

What ended up working was a “Lookahead.” This is another zero-length match construct that checks the character following the pattern to be matched, without including it in the match. It sees the space at the end of 22, and it’s still available to be matched as the space at the beginning of 33. This is key, since the VbScript regex engine, unlike others, has no Lookbehind construct. So the pattern includes a Lookahead for a space. The positive Lookahead pattern for a space is (?= ).

Lookaheads also help ensure that commas only appear in reasonable places. One states that commas match only if they’re followed by three digits, the second only allows decimal points that aren’t followed by a comma. The negative Lookahead pattern for a comma is (!=,). (All of my comma and period usage here is US-centric and would need to be adjusted for those using different decimal marks or thousands separators).

Finally, the pattern only matches if it starts with a space or the zero-length begin-of-string construct, “^”. And, if the positive Lookahead for a space failed, it must end at the end-of-string ($). Here’s the full pattern:

(^| )[-+]?(\d|\.\d)(\d+|\.(?![.,])|(,(?=\d{3})))*((?= )|$)"

Broken down it says:

(^| )

– must begin with a space or be at the beginning of the string

[-+]?

– followed by zero or one occurrences of either a plus or minus sign

(\d|\.\d)

– followed by a single digit, or
a decimal point that’s followed by a single digit

(\d+|\.(?![.,])|(,(?=\d{3})))*

– followed by zero or more instances of
one or more integers, or
a decimal point that’s not followed by a decimal point or comma, or
a comma that’s followed by three integers

((?= )|$)

– match only if all of the above is followed by a space,
or if it’s at the end of the string

Since a valid match can have a space at the beginning the code includes a trim statement. It also strips out commas, which are allowed in the regex, but won’t pass IsNumeric. Here’s the complete function. It’s late-bound:

Function SumNumsInString(StringToSearch As String) As Double
'Finds numbers within a string and sums them
'Late-binding, so no reference needed

Dim regex As Object
Dim rgxMatch As Object
Dim rgxMatches As Object
Dim NumSum As Double

Set regex = CreateObject("vbScript.RegExp")
With regex
    .Global = True
.Pattern = "(^| )[-+]?(\d|\.\d)(\d+|\.(?![.,])|(,(?=\d{3})))*((?= )|$)"
'non-submatch-capuring version
'"(?:^| )[-+]?(?:\d|\.\d)(?:\d+|\.(?![.,])|(?:,(?=\d{3})))*(?:(?= )|$)"
If .Test(StringToSearch) Then
        Set rgxMatches = .Execute(StringToSearch)
        For Each rgxMatch In rgxMatches
            If IsNumeric(Replace(rgxMatch, ",", "")) Then
                NumSum = NumSum + Replace(rgxMatch, ",", "")
            End If
        Next rgxMatch
    End If
End With
SumNumsInString = NumSum
End Function

There’s also a Regexp.Submatch property – a collection that contains every submatch in the match, where a submatch is a piece of the pattern inside parentheses. So I could have checked if the first submatch was a space and only used the succeeding submatches. This would have eliminated the need to Trim the string, but seems more complex.

Since I didn’t use the submatches I could have included regex characters that tell the engine not the store them, speeding up the regex. Then the pattern would look like the commented one in the code, where the “?:” after each opening paren performs that function:

(?:^| )[-+]?(?:\d|\.\d)(?:\d+|\.(?![.,])|(?:,(?=\d{3})))*(?:(?= )|$)

That means there’s three types of question marks in one pattern: the ones just mentioned, the ones that follow “[-+}” and means to match it zero or one times, and the one that’s part of the Lookahead “?=” pattern. Whew!

Anyways, here’s a more complex version of the first table, with the intended numbers being found:

Clearly this isn’t a foolproof function (D’oh!). My intent was firstly to learn about regexes while having fun, and also to outline my trial-and-error process in a way that may help others. So, although I researched concepts and syntax on the web, I didn’t look at anybody’s actual solutions for this type of function, as I wanted to just hack away on my own.

As always, I’m sure there’s a better way, and I’d love to hear yours!

Speak Your Mind

Your email address will not be published. Required fields are marked *

To post code, do this: <code> your vba here </code>