REGEXTRACT Function in Excel


Introduction


Early in 2024, Microsoft announced the REGEX suite of functions that they had put into the Insiders' Channel. I have chosen not to be a member of that channel because new functions showing there can be changed or even abandoned before ordinary channel members learn about them.

Anyway, I found out two days ago that I now have access to the three REGEX functions and this page discusses REGEXTRACT() and REGEXTEST() and how they have solved a rather tricky problem for me ... and, therefore, for you!

REGEX = Regular Expressions

Regular expressions are patterns of characters that match, or fail to match, sequences of characters in text. 

Regular expressions, at the most basic level, allow computer users and developers to find desired pieces of text and, often, to replace those pieces of text with something that is preferred. At other times, regular expressions are used to test whether a sequence of characters that might be intended to be a credit card number or a National Insurance number has an allowed pattern of characters. Whether it’s finding existing sequences of characters or testing sequences of characters for their suitability (or not) for storage, the key aspect of regular expressions is matching a pattern against a sequence of characters.

Source: Andrew Watt (2005) Beginning Regular Expressions Wrox page 2

Note: that book is OLD but rest assured, I have worked on a series of examples that Andrew Watt presented and they still work in Excel. I am sure there are a lot more things that have been said since 2005 but for what I am presenting here, Andrew's book works!

Three New REGEX Functions

  • REGEXTRACT()
  • REGEXREPLACE()
  • REGEXTEST()

I have spent about a whole working day on these functions now and my initial conclusions are that, at first sight, one might wonder why we need them. I tested out REGEXTEST() for a while and got the hang of it to some extent and I started to warm to the idea.

I learned a while ago that there are entire books devoted to Regular Expressions and they have been a feature of the lives of many database and data analysis wallahs for a long time!

Having perused one or two of those books, I found that a knowledge of regular expressions and how we might use these new functions could prove really worthwhile: hence this page!

What I did

REGEXTEST(text,pattern,[case_sensitivity])

What I was doing with REGEXTEST() was to carry out some pattern matching in some data ... simple examples, as you will see


The crucial aspect of this example is to understand what is meant by the pattern. In this example, I was testing the codes in the range A5:A9 to see if they matched this code:

"DOR[0-9][0-9][0-9]"
That is, did they begin with DOR?
and, if yes, was that text followed by three digits, 0 to 9?

Two out of five of those examples were reported to be TRUE, they matched the pattern.

Two countries divided by a common language! Here is an example to help with that. In the UK, we spell colour, in the USA they spell color. Imagine I am working on a file and I need to be able to identify where colour OR color is being used. Take a look:



Notice, REGEXTEST has found two cases that are TRUE and it fails with colours (D13).

From a different source, this time, consider these examples:


And, finally, for a bit more variety, also from exceljet:


Work with these examples and see what you think!

REGEXTRACT(text,pattern,[return_mode],[case_sensitivity])

return_mode

A number that specifies what strings you want to extract. By default, return mode is 0. The possible values are:

  • 0: Return the first string that matches the pattern
  • 1: Return all strings that match the pattern as an array
  • 2: Return capturing groups from the first match as an array

Note: Capturing groups are parts of a regex pattern surrounded by parentheses "(...)". They allow you to return separate parts of a single match individually

Here is a problem I have been wrestling with for several years.


Those entries come from a now defunct web site but their meaning is real: in 2019, they were the fastest growing companies in the UK and the problem I wanted to solve was this.
In the column headed Company, it gives us the name of a company and, with no separator or delimiter, it then tells us the Objective of that company ... what it is doing and/or offering.

We could solve this problem manually since there are only 100 entries per year for this example. But why do that when, surely, Excel can do it for us. Hmm, I thought that but it proved both simple and very difficult. Here's why

Power Query


Using Power Query, we can easily separate name and objective for companies 1 - 3 because it has a function that will split a cell where two words are conjoined by ... a lower case letter and then an upper case letter.

Power Query will also split a cell in which a number precedes an upper case letter ...

BUT, it will not do these things at the same time, so it becomes messy.

In this example names and objectives are linked in this way

lower case to upper case
digit to upper case
. to uppercase
multiple upper cases to upper case, eg, FXForeign ... for the 98th company in my file, not shown here.

Solution: company names


REGEXTRACT() can do all of this for us, like this:

in cell C5==REGEXEXTRACT(B5:B104,".*(?:[a-z0-9. ]|[A-Z]{2,})(?=[A-Z])",0)

You can download my file to see that in action but there are two things to notice:

1 this is a dynamic array function in that I have told REGEXTRACT() that the text is in the range B5:B105 and it finds the answers for us dynamically, all at once
2 we have to think very carefully about the patterns that we are looking for but I need to point out here that we are looking for all of the patterns I have mentioned [a-z0-9. ] tells Excel to find a case where a lower case letter is followed by a digit ...and the | character, the pipe character, is an OR statement to Excel so, either lower case letter followed by a digit or two upper case letters followed by an uppercase letter ...

The output of that function is the list of company names for all 100 companies in the data set

Solution: company objectives

I took a really simple way of finding the company objectives for this example. I could have created another REGEXTRACT() function ... but I did it simpler and, dare I say it, smarter, than that!


cell D5=TEXTAFTER(B5:B104,C5:C104) ... again, a dynamic array formula that spills automatically to provide all of the answers all at once!

For You to do

Finally, here is your challenge: much simpler than the challenge above. You can see from the screenshot that follows and in the Excel file itself, that many of the values in the Latest Sales column contain some non printing characters that we have to get rid of.

You could easily use the Regular Expression functionality that you already know about, Find and Replace (yes, it's REGEX!) and solve that problem. However, for future reference, to provide a useful template, why not create a REGEXTRACT() formula to solve the problem?

You know you want to!


Conclusions

Regular Expressions are not new to you but these three new functions might be a game saver and a life saver for you. It is certainly true that I have solved for the first time, the example I have demonstrated here of the fastest growing companies. That's the removal of an annoyance from my work, now.

Since there are entire books written on the subject of REGEX, it should be clear that this page represented the smallest tip of a very large iceberg but it's your spur to move on to greater things!


Download my Excel file from here


Duncan Williamson
14th January 2025
 





No comments: