Tips on building a data analytics warehouse

Published on 12 September 2018, in #data-engineering, #design-process, #system-design

By a data analytics warehouse we mean a solution that combines and cleans data from various internal and external sources and offers reporting and querying capabilities to business users. Sometimes also referred to as a big data platform.

Based on failures and hard-won lessons, here's 21 tips for teams and individuals who are responsible for making a company more data-driven.

Solve the right problem #

Understand that building an analytical warehouse is not about the technical tooling.
Get an in-depth understanding of the root problem to minimize the risk of delivering a solution that doesn't solve a tangible business problem.

Choose the right tool for the job #

Consider the most straightforward solutions to resolve the root problem. Perhaps, an excel would suffice.
Do not choose a solution according to personal preference.
Be aware of maintenance costs. Consider off-the-shelf solutions before building your own. If a cloud platform is an option, consider no-ops and SaaS offerings. There is a plethora of solutions out there for ETL, storage and BI tools. Be pragmatic about Hadoop.
Start with delivering a narrow vertical slice of functionality to discover possible problems early and to validate the chosen approach.
Validate new deliverables as frequently as possible through an intimate collaboration with business.

Provide correct and consistent data #

Check for data correctness: integrity violations (foreign keys, unique values), invalid value types, errorneus ranges (values out of bound, text too long or too short), missing data, etc.
Check for suspicious data: values out of expected bounds, the number of new records too large/low, etc.
Treat incorrect and suspicious data as exceptions and warnings, respectively. Report on them automatically and periodically.
Use consistent prefixes and suffices per field type to limit guesswork. For example, use is_ for booleans and _at or _time for time.
Ensure consistency in values. For example, use null for missing data and ISO codes for countries among all tables.
Convert values to native types as early as possible. For example, convert date strings to dates, and binary integers to booleans.

Bring data closer to business #

Add as much semantics to field names as possible. Where applicable, include units, timeframes, and aggregation functions. For example, use avg_kwh_consumed_per_year instead of energy.
Establish and use common terminology with business. Rely on ubiquitous language for table names, field names, and values. Reconsider using abbreviations.
If possible, encode domain knowledge and undocumented business rules into data to increase value for data users.
If relevant, teach SQL to data users to enable advanced self-service querying.
Understand that data is an abstract model of reality. Identify and communicate data limitations.

Develop efficiently #

Learn window functions and with queries to simplify complex SQL queries.
Create views for frequently reused SQL queries to ensure users don’t repeat themselves.
Follow software engineering best practices, incl. version control, clean code, testing, peer reviews and boring and predictable deployments.

← Previous post: Taking things personally
→ Next post: Drawings for technical explanations: my first steps

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter