Wednesday 3 March 2010

RegEx and Alteryx (Part 3)


For those of you who haven't read the comments on part 2, Ned Harding has posted some extra information on how the Tokenise Method works here.  This shows how you can use the tool to split a field on a RegEx matched delimiter as the help file describes.

On with part 3...

Replace Output Method

The last and I think possibly the most powerful of the four methods in the RegEx tool.  For this method I am going to leave the e-mail example and take a look at something I seem to be needing to do more and more recently: xml processing.  So many applications today are outputting xml files or requiring xml input, that being able to process it efficiently needs to be a staple tool of any data developer.  

Before I began to learn about RegEx I struggled to process xml using Alteryx as from what I have seen, there are no specific tools for reading or writing it.

(this the stage that somebody says to me: "why didn't you just use the xxx tool?" and I say because I didn't know it existed. Oh the time I could have saved if I'd only known about that multi-field formula tool a year earlier...) 

I was using complex formulas to search for parts of xml strings and then more formulas to extract the piece of data I wanted.  With RegEx finding and extracting parts of xml suddenly becomes rather easy.

For this example I will use a piece of xml which I hope that anyone who has used Alteryx much will have at least seen, if not manipulated themselves: an Alteryx module.  Yes, for those of you that weren't aware, when you save an Alteryx module it is saved as an xml document which, with due care, can be modified in any text editor.  Great for those times when you need to rename about 100 fields using a file layout you have been supplied in Excel.

This example can be followed in the module RegEx_Example_Part3.yxmd downloadable from here:


The example module uses its own xml as an input and outputs a list of all of the tool ids in the module; along with their types and their x, y display positions.  Run the module and take a look at the output browse, then add a new tool, save the module and notice how the new tool you added is now included in the output.

So how does it work?  Well there are 3 stages to the module:
  1. Get the xml into a single field and single row.
  2. Use our RegEx replace to extract the required information from the xml.
  3. Use Text to Columns tools and a select to format the output.
Now as this article is focusing on the RegEx tools I will leave you to work out how stages 1 and 3 work for yourselves and only explain stage 2.  The module actually uses the RegEx tool with the replace output method 3 times:

The first use is to remove all of the carriage return line feeds from the data:


So \r\n is the Regular Expression for a carriage return line feed combination, we have specified that we will replace it with nothing and we have also specified that we will pass all the other unmatched text through to the output as well.  I think this could be done using the standard formula REPLACE function, but I think you would need to use the CHARFROMINT function and know the unicodes for a carriage return and line feed, which isn't really any easier.

The second use is to remove all white space from between the xml tags:


This time we are searching for one or more white space between two > < characters and replacing the whole lot with ><.  This is another example of something which I believe would be very difficult without RegEx; you could remove all spaces from the data easily enough, but that would remove spaces inside the tags which we wish to keep.  You could also remove a fixed number of spaces between the ends of tags, but to remove an unknown number of spaces I think you would struggle.

The final use, having prepared the xml, is to extract the data elements from the xml corresponding to the Alteryx tools.  


From looking at the data I identified that the block of data I was interested in always looked like this:

<Node ToolID="52"><GuiSettings Plugin="AlteryxBasePluginsGui.BrowseV2.BrowseV2"><Position x="99" y="779" />

which I can quite easily write a regular expression to match like so:

<Node ToolID="(\d+)"><GuiSettings Plugin="([^"]+)"><Position x="(\d+)" y="(\d+)"

notice that I have four marked groups (the parts of the regular expression in parenthesis) representing the parts of the xml data I wish to extract.  The power of a RegEx replace is that I can reference these named groups in the replacement text.  

For example if I made my replacement text $1 then the whole expression would be replaced by 52 (the first marked group) in the above example.

What I choose for my replacement in this example is 

$1¬$2¬\($3,$4\)|

Which for the above example results in 

52 ¬AlteryxBasePluginsGui.BrowseV2.BrowseV2¬\(99, 779\)|

This pulls out the 4 pieces of data I want in a nicely delimited format which I can then split into cells by using the text to columns tool, which is what happens in stage 3 of the module.

I think that the RegEx replace tool is possibly the most useful of the RegEx tools available in Alteryx and definitely worth spending the time getting to understand.

Other RegEx Functions

The last thing to complete our overview of the RegEx capabilities of Alteryx is to briefly mention the remaining two functions available in the formula tool.

REGEX_COUNTMATCHES takes a string and a regular expression and returns the number of times the regular expression matches the string.

REGEX_REPLACE works in a similar manor to the replace method of the RegEx tool, but within the formula tool.  One exciting use of this is in the actions of macros and wizards, an example of which I plan to cover in a future post.

Conclusion

That concludes my overview of the RegEx capabilities of Alteryx, though my examples are just some starting ideas of what is possible.  I hope I have inspired you to learn some basic Regex and next time you are struggling with a complex string manipulation think about using the RegEx toolset.  

If you have any specific questions or comments on anything I have posted about then feel free to e-mail me or post a comment here.

No comments:

Post a Comment