Transition your career to big data

Published on 04 October 2016, in #data-engineering

You have decided to do a career switch to THE BIG DATA, but you don't know how.

Here's what to do:

Choose your role
Build a sexy demo and job hunt guerilla-style

Step One: Choose your role #

There are four roles in this big-data world:

Finding insights in the past
Setting up & delivering a data warehouse
Predicting future from past data
Coding data-driven product features

Role 1: Finding insights in the past #

This is a key role in jobs titled Data analyst and Marketing analyst. Also, it's a key prerequisite for Growth Hackers.

You answer questions about the past to help steer the business forward. For example:

How does the cohorts look like in our customer base?
Which airport gates lead to the most late passangers?
Which customer micro-segment is driving the most revenue?

You use BI tools (like Tableau, Looker) to interactively explore and visualize data. If you are more technical, you reach for SQL and/or R.

After a while, you start impacting metrics that you report. You use growth hacking and A/B testing. And you provably improve company's profit.

Learn the first steps by:

Read the first 50-pages of the Pyramid principle to get yourself acquainted with top-down reasoning
Search for online-sources for Data Analysis (example: edX, Coursera)

Role 2: Setting up & delivering a data warehouse #

This is a key role in jobs titled Data Engineer, ETL engineer and Hadoop developer.

Typically, your responsiblity is to combine multiple company data together from various sources.

For example, you combine customer data from Salesforce, support data from Intercom, product usage from a few production Postgres databases, real-time customer events from Kafka and you sprinkle it with historical weather information.

Then you create an abstraction on top of this data to provide a single source of truth. You operate between two departments:

the business: to align on metric definitions and
the IT: to get the integrations right.

Having all data together in this manner has HUGE benefits - all of a sudden, the time-to-answer drops from days/weeks to zero. Connect a self-service BI tool on top of it (e.g Looker, Metabase) and you are there.

Note that there are cloud solutions out there which provide the warehousing and data shoveling for you. Checkout Redshift and BigQuery for warehousing and Alooma, Stitch and Fivetran for shoveling.

You need to do the data merging yourself - that's the company domain. You do this either with SQL or a data processing tool:

Learn SQL here, especially the cool features
Learn Spark, it's sexy these days. Learn it here and here

Notes:

Software engineering skills are expected. Learn principles of clean-coding and learn Python or Scala.
Scala is hot. Get the certificate from this legendary course and put it on your LinkedIn.
You don't need to learn Hadoop. Most companies don't need Hadoop.

Role 3: Predicting future from past data #

This is a key role in jobs titled Data Scientist and Quants.

Your role is to predict behavior by learning from the past data:

What are the products that customer X is most interested in?
How much is product X going to be traded on the exchange tomorrow?
How to optimally coordinate airplane routes to minimize fuel costs?
How utilized will the Tesla super-chargers be tomorrow at location X?

You are the math guy in the company. You know machine-learning, simulations or numerical computing. Typically, machine-learning is required. This includes:

supervised/unsupervised machine-learning models
ability to prevent over-fitting
ability to engineer and select the right features

Daily bread is Pandas, Scikit-learn (my favourite), Matlab, R or SparkML. However, I know a guy in Booking.com who uses C++ for his machine-learning experiments.

Learn the first steps by:

Follow the legendary Andrew Ng's lecture on machine-learning
Go to Kaggle to practice your skills.
The Kaggle forums are legendary, you will learn a LOT of practical info there

Notes:

Focus is on understanding the business and the underlying data so you can come up with the right data features.
Once the features are there, a simple model would do.
Less software engineering skills is expected. Clean-code is not a priority. There is another role for translation the model into production.

Role 4: Developing data-driven product features #

This is a common role in jobs titled Data Engineer.

You are putting into production features driven by prediction and machine-learning models:

compute product recommendations and expose them via API, which is consumed by team-members building the mobile app
implement a stock prediction algorithm within a production trading platform

If there is a dedicated data scientist, you work closely with him/her to translate their model to production code. If there is a growth-hacker, you work closely to implement their ideas.

You are a software engineer by heart: you know a few languages, DRY/YAGNI, clean-code, continuous delivery, unit/functional tests, monitoring - the whole shebang.
On top of it, you understand the big-data ecosystem used by the team (e.g. Kafka, Spark, ELK Stack)
On top of it, you understand machine-learning and can translate data scientists model into production.
Sometimes, data scientists are also engineers who put their models into production

Learn the first steps:

If you are not a coder, get an engineering job first. I recommend Scala. Try to grasp the functional approach of it. Scala is used heavily in big-data community. Another option would be Python. Once you are good in engineering, go to the next step:
If you are a coder, learn Spark and a few other technologies. It's the tech-stack you know that will get you this role.
Although companies think they need Hadoop, they most probably don't. But it's good to learn it, so you understand situations when Hadoop is useful. Checkout the major Hadoop distributions (e.g. Cloudera, Hortonworks, MapR), install it locally and try to get the basic ETL up and running. Or checkout cloud solutions like AWS EMR and Google Dataproc.

Choose your role by your current strengths: #

If you have an economic background and no engineering skills yet, go for Data Analyst.
If you have system administration background, go for setting up a warehouse.
If you have software engineering background, go for data engineering.
If you have strong mathematical, model, statistics background, go for Data Scientist.
If you are a student, go learn machine-learning, clean coding, Scala and Clojure.

Don't worry too much with the initial choice. Especially in smaller companies, the roles overlap and you have freedom to move between the roles. I know a Data Scientist doing warehousing stuff for two years and a Data Engineer doing growth hacking.

Still not sure which role? [Drop me](mailto://m@marcel.is?subject=Which role to choose) an email.

Step 2: Build a sexy demo and go guerilla #

Do this:

Read this article and this article and choose one of these two approaches.
Choose your target company.
Build a (targeted) demo

For choosing the target company:

aim for a smaller one, ideally a startup. There is much more freedom to experiment and switch roles.
aim for a team where only 1-2 people do what you do. This gives you a lot of space to experiment. Plus the opportunity of taking up the full responsibility once they decide to move to something else.

For the demo, here are a few examples:

Example 1 (for data analysts): Data insights and recommended next-steps #

Find interesting data about the company you are hunting, e.g. in this collection on GitHub.
Get Excel, Tableau Public or a similar tool and dive deep into data
Deliver a top-down structured research story with recommendations on next steps

Example 2 (for data analysts): Tailored growth-hacking recommendations #

Read more about growth-hacking and lean startup
Guess problems the target company is facing by studying their products
Deliver a list of 5-10 actions you would advise them and metrics on how to evaluate progress

Example 3: (for data engineers): Spark Streaming with real-time visualizations #

Spark is hot, Spark Streaming especially.
Play with combination of Twitter and Spark Streaming, like here
Or play locally on your computer, example here.

Example 4: (for data engineers): Click-stream processing with Kafka & Alooma #

Build a simple web app, collect clicks via Kafka
From Kafka, send the clicks to Alooma. Sessionize the clicks in Alooma and pipe it to BigQuery.
Connect Metabase to BigQuery to show reports on basic usage
You have just build a simple version of Google Analytics!
Bonus: Find which SaaS product your target company is using (e.g. for customer support) and merge that data with the clickstream.

Example 5: (for data scientists): Open-source your Kaggle solution #

Compete on Kaggle
Once you have a decent model, put it on Github and LinkedIn and blog about it.

Example 6: (for data scientists): Deliver tailored prediction #

Get hold of data which might be interesting for your target company, for example here
Find a problem which can be framed as a machine-learning problem.
Solve it, package the result in a top-down story in a Jupyter notebook and share it only. Put it on Linked and blog and it.

FAQ: #

Shall I do pro-bono? #

Sure. If you don't feel confident about your current skills, do a pro-bono project. Reach to your network of friends, family and colleagues. Try to get a contact within your network that would tell you about their company problems and are willing to share their data. In exchange for your work (if done well), you'd ask for a referral that you can post on your LinkedIn/blog.

Enjoy the ride and [ping me](mailto:m@marcel.is?subject=Big data transition) with your transition story. I'm very much interested.

I'll leave the last words to the marketing guru Seth Godin:

Make something happen

If I had to pick one piece of marketing advice to give you, that would be it.

Now.

Make something happen today, before you go home, before the end of the week. Launch that idea, post that post, run that ad, call that customer. Go the edge, that edge you've been holding back from... and do it today. Without waiting for the committee or your boss or the market. Just go.

← Previous post: Resist the quench for coding
→ Next post: Strive for focus & simplicity

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter