Data onboarding—the preparation of unfamiliar data from disparate sources, both internal and external to the organization—is a complex process. Whether it’s combining multiple streams of marketing data into a singular dashboard or augmenting customer data with third-party information, data onboarding involves huge volumes of data and repeats every time new information and sources are found—in some cases, daily.
Data onboarding begins with understanding the source data’s purpose and structure, continues with the assessment of the data’s quality and standardization across sources, and concludes with the combination of multiple data sources into a consistent view for analysis, or for import into your business applications. For most analysts, this process consumes the bulk of their time—on average, analysts report that up to 80% of their time is spent preparing data for analysis.
Taming Unruly Data for Efficient Data Onboarding
Why is data onboarding such a laborious and time-consuming process? Namely, it’s due to unruly data. Gathering multiple types of data to combine into one standardized format is difficult for several reasons:
Data silos create duplicates.
Whether due to departmental differences in applications investments or mergers and acquisitions, data typically lives in silos. Data will often be duplicated in distributed data marts to respond to specific user demands, which makes it challenging to bring that data together.
Siloed datasets have varying formats and standards.
Because of these data silos, each dataset often has a different format and standard, which makes them difficult to combine. This is made even more challenging when the data comes from third-party vendors, customers, or public data—in that case, analysts have no control over the data structure and norms used in the data, requiring them to decipher elemental data to make it consistent across sources to finally combine it together .
Outliers and errors are difficult to spot in large, unknown datasets.
Buried within all datasets are inconsistencies, such as an age value of 250 years, and errors, such as an invalid zip code or SKU format, but these are even more difficult to spot in large data volumes. If these anomalies aren’t surfaced until analysis, they cause huge delays to accurate insights.
Data volume and formats are growing—fast.
The variety of available data in most organizations has exploded, but with more data comes an increased demand to manipulate and process that data. Adding to the complexity, these new formats are usually not tabular by nature, but hierarchical—value/pairs or free forms—making it difficult for analysts to get an immediate understanding of the data.
Unfortunately, overcoming these data challenges isn’t the finish line, but the beginning of a marathon. In data onboarding cases, organizations receive new versions of the same data each month, week or even day, with slight changes from the previous batch that introduce new hazards to the onboarding process.
When onboarding data, analysts face these challenges again and again in the midst of high pressure from internal stakeholders or customers to get the data right as quick as possible. To their credit, it’s not a fault of their own skills, but the common tools and process used to do this work.
Why Common Data Onboarding Tools Can’t Help
Data onboarding is typically either delegated to an army of highly-skilled data engineers delivering a robust, but rigid, solution or by hacking Excel, SAS or its equivalent to produce a result which may have errors and may not scale. No matter which data onboarding path a business analyst team has traditionally taken, the risks are high and data issues may come to the surface as the data is being integrated, delaying the entire process. This can cause the project to be delivered late, leading to loss of business opportunity. The second most common scenario is that the analysis is wrong—a nightmare scenario for every analyst team.
To get insights faster, there has to be a way to standardize and automate the process of cleaning, transforming, and combining this data quickly and accurately, without investing in expensive engineering labor. For the business, not only are data onboarding bottlenecks a threat to profitability, they’re demoralizing for analysts, too. Even the Excel and SAS power users on your analyst team would rather be spending their time analyzing data, not acting as an engineer preparing it. As Dr. McCoy in Star Trek would say: “Dammit Jim, I’m an analyst, not a developer. “
Trifacta: Data Onboarding In Record Time
If you’re tired of devoting so much time to combining data sources and want to spend more time discovering transformative insights, Trifacta can help. Using Trifacta, you can now expedite your data onboarding process with an intuitive, scalable, collaborative solution—with no loss in data accuracy.
To learn more about wrangling data for data onboarding, read our brief, Data Onboarding: A Survivor’s Guide To Combining Unfamiliar, Disparate Data; or download the free Principles of Data Wrangling eBook here.
Bigdata and data center