Guest Contributor: Curtis Seare cohosts the Data Crunch podcast, edits the AI & ML Biweekly Beat newsletter, and co-founded Vault Analytics, an and research and reporting company. He’s also the Director of Analytics at Shelfbucks, a retail analytics startup located in Austin, Texas. He’s worked with data and solved problems in analytics spanning IoT, retail, , sales, competitive intelligence, nonprofit donations, and product development.

The Volcanic Dataset

There are 871 volcanoes in the Smithsonian’s “confirmed eruptions” volcano data set, spanning 9,812 eruptions – including the Kilauea volcano on the big island of Hawaii that’s currently captivating the .

In this tutorial, I’m going to share a few functions for how we can cleanse and prep with this dataset using Trifacta. Working with messy text is one of the hardest challenges in data cleanup. This will hopefully make it easier.

By the end of this tutorial, you should have a good idea how to work with Trifacta. For the purpose of this article, I’m going to be working with Trifacta Wrangler, the free edition of Trifacta, but the tutorial could apply to any edition.

There is a lot you might not know about volcanoes, but if you download the dataset, you could figure out some interesting things. To get it, head over to the Smithsonian Global Volcanism page, select “confirmed eruptions”, and “all”.

 

You’ll notice if you look at the date columns in this data that we have information on confirmed eruptions spanning back over a hundred years. An interesting note to this is that, if you go to the ‘References’ tab on the download, you’ll see that they have 6,950 papers cited as references for the eruptions in the data set. True to form, the Smithsonian is well documented.

The ‘References’ tab is where we’re going to focus in this tutorial. It gives us all of the papers discussing specific volcanoes, the authors of these papers, and where to find them. It is a very text heavy data set, and working with text is exactly the topic of this article.

What you will Learn

Our goal in this article is to work specifically with the Author column in this data set. I would like to reduce the length of the author column so it can more easily be displayed with less space, while keeping the fidelity of the data. I would like to keep the Author column at 50 characters long or less. For example, one row’s Author column contains this of authors:

Juvigne E, Bastin B, Beaulieu J L de, Etlicher M, Gewelt M, Gilot E, Goeury C, Janssen C R, Milcamps V, Van Leeuwen J

In order to accomplish this, we’ll go over several functions and transforms in Trifacta, including the following.

  • The Header( ) Function
  • The Len( ) Function and Sort command
  • The Split Transform
  • The Replace Transform
  • Trifacta String Matching Patterns

I’ll take you through this use case step by step, showing you how to work with messy text along the way.

Header( ) Function

When you load this data into Trifacta, it doesn’t quite get the header right, so go ahead and add a header transform to get it trued up.

For our example use case, we’re going to pretend that we’ve been given the task of preparing this volcanic data for presentation in a UI. One of our constraints is that we have a 50 character limit to display the authors of the paper.

Len( ) Function

Have a look at the author column, where we will find this information. At a glance, it seems we’re going to have several fields that are too long to fit. To make sure, let’s have a look at the actual character counts. We can do a Derive transform to create a new column, with a len( ) function to figure out how long each of the rows are. The len( ) function returns the number of characters in a cell, or the length of that cell.

 

And to jump right to the longest cells we can drop a sort on our new column.

It turns out we have a lot of cutting to do. Our longest column of 282 characters is well beyond the our limit of 50.

 

There are a few approaches we could take to fit within the character limit. For our purposes, we’re going to try to just keep the first three authors of the paper, and add an et. al. for any authors omitted. That will likely put us under the 50 limit.

With that in mind, our task now is to transform this author column into a new column, with just the names of the first three authors. It’s good practice, before starting to manipulate text data, to create a duplicate column of the one you are going to change. Even though you can always get back to your original data in Trifacta, this gives you a good reference point to look back at the original text and make sure your changes are keeping the integrity of the text, especially when you will be taking the field through several transformation steps (we will). Manipulating text can get complicated. Having a reference will save you time and headache.

To do this, we’ll create an author_original column by using the derive transform on the author column

Now we can get to work. How can we eliminate all but the first three authors from this text across all of the cells?

Split Transform

In text manipulation, patterns are important. We can use common patterns between cells to our advantage. One common pattern we notice when looking at the author column is that, after every author name, there appears to be a comma. Let’s see if we can exploit that pattern. One method we could use is the Split transform which, true to its name, splits the target column into new columns.

You will notice when you do the Split transform, there is a place to input the pattern you want to split on, and it gives you three options. You can use a Trifacta Pattern, a Regular Expression, or a String. More on this later, but for now we are going to use the simplest of the three, the String pattern. This simply means that whatever character or characters you tell Trifacta to split on, it will do that exactly. You tell Trifacta you are doing this pattern type by using single quotes around your characters.

It would makes sense to try and split this column by commas. And since we only want the first three authors, we can tell Trifacta to only do this split the first three times it finds a comma in the text.

Let’s have a look. Trifacta did indeed split the author column three times. In most of the first three new columns, we notice that we have achieved getting the first three names of the authors.

But it’s not always true. Have a look at the top row in Author2. We just have the letter “S”.

What happened?

Looking back at the original text data, we notice that the pattern we thought was true does not always hold. In the first row, not only are names split by commas, but also last names and initials.

So in some columns the comma is a good separator of names, and in others it is not. This is a typical problem when working with text—patterns don’t always hold across all cells. We always need to be careful of this and make sure our assumptions hold for all of the data.

It looks like our simple comma strategy is not going to work. Luckily in Trifacta, it’s easy to step backwards and try something different. Let’s delete our Split transform.

 

One approach we could now take is to try and remove all of the patterns that are breaking our simple comma pattern, and after that, try the simple comma pattern again to get to our three-author result.

To move forward with this approach, we are going to make the strategic decision to remove all of the initials from the names in the paper, and keep only the last names. This will be a simplification strategy to keep this tutorial to a reasonable length.

Replace Transform

In order to remove the initials, we are going to use the replace transform, which search a piece of text for a pattern, and then replaces that with whatever text you specify.

When you type in the replace function, you’ll notice there is a place to match on pattern, just like we saw on the Split transform. As a reminder, if you just use a string match, Trifacta will find one or all of the commas in the field, depending on what you want.

If we tell Trifacta to match on a simple comma, we notice it finds all of the instances if we check the ‘match all occurrences’ flag.

We could match commas, letters, or words. But we want to match patterns so we can better target exactly what we want to remove. For this, you could use regular expressions, which are powerful but can quickly get confusing. Unless you have a lot of experience in regular expressions, it’s a better bet to go with the more friendly Trifacta Patterns.

Trifacta Patterns

We can use Trifacta Pattern matching by clicking ‘Trifacta Pattern’, or by using a backwards quote (`) to wrap our pattern text

For Trifacta pattern matching we put the patterns we want scan for in curly brackets. Some of the most basic examples are the following, along with the pattern it will match to.

As an example, here is what it looks like when you search for {alpha}. You’ll notice that every single character that is a letter is highlighted.

 

Because Trifacta visually shows us what it’s matching to, it’s easy to play around with the patterns until it highlights the characters you would like to remove, but none of the characters you want to keep. To start, we might try ` {alpha},` in order to target the pattern any alphabetic character that is immediately followed by a comma, and immediately preceded by an empty space. (Notice we added an empty space character before the {alpha})

That looks pretty good! It highlights most of the characters that we would like to remove. But it doesn’t get all of them. You will notice some patterns like “J-C” and “III” that doesn’t capture. It also doesn’t work at the end of the string, as you can see below.

We can get back to matching other patterns like “J-C” in a bit, but for now we can at least modify our pattern to pick up initials at the end of the string. They don’t currently show up because there is no comma after the alphabetic character. To capture it, we could get rid of the comma and use the {end} pattern instead. We would use ` {alpha}{end}`.

We could do both transforms one after the other to get both of these patterns. We could first to a replace with ` {alpha},` and then do another replace with ` {alpha}{end}`.

But there is an easier way.

We can combine these two patterns into one pattern by using an OR operator, which tells Trifacta to search any one of a number of patterns we specify. We do this be first enclosing the group of patterns with parenthesis, and then separate them with a pipe (|) character. It would look like this: `(,|{end})`

This tells Trifacta to look for a comma, OR the end character. If it finds either, highlight it.

And to combined the logic, we can put this OR statement on the end of our empty-space-alphabet character statement, like so: ` {alpha}(,|{end})`

Now we are telling Trifacta to find the pattern: empty space, followed by an alphabetic character, followed by either a comma or the end of the string.

We also need to tell Trifacta what to replace the pattern with. In this case, we want to replace it with a “,” so we don’t lose the comma separating the author names, as follows.

We have successfully removed some of the initials! Notice however that names that had more than one initial were not removed. This was because they did not have a comma or end of the string after them, so our pattern passed them over.

Summary

In this article, I demonstrated how you might begin to approach a problem in Trifacta. There are sometimes multiple to the same problem, and sometimes you need to test and play around with them to find the right one.

Knowing what you know now in Trifacta, you can get started on difficult text wrangling problems in your own data set. Stay tuned for part 2 of this article, where we will finish reducing the column to 50 characters using some additional transforms and functions from Trifacta, and we’ll append et. al. to the columns where we have 3+ authors.

Got your own data prep you would like to share? Get paid to write tutorials like this one. Learn more.



Source link
Bigdata and data center
thanks you RSS link
( https://www.trifacta.com/blog/tutorial-trifacta-string-manipulation/)

LEAVE A REPLY

Please enter your comment!
Please enter your name here