If you have been in the field for a while, you’ll have realized by now that every single job you’ve had as a “data scientist” consists of different responsibilities and the expectations they have of you are also vastly different. Here I’ve listed some of the things I have learned along the way that have helped me keep my sanity.
the tech-y side of things
holism* is overrated
Holism on an individual level is overrated*. To become what I would consider a base level data scientist you must know how to code, you must have a strong statistical background and you must have some business knowledge. It is unrealistic to expect one single data scientist to be a subject matter expert in all of the models, to know all the developing tools, to know everything there is to know about a business. If you go to Twitter, Kaggle, Medium, etc, you will read a gazillion tweets and articles telling you that if you don’t want to be a scrub, you NEED to know ALL the things. At this point, your anxiety has sky-rocketed because you suddenly feel that you need to be a statistician, a front-end developer, a back-end developer and a data architect. You don’t. There are base skills you must know and others that can wait. For example, you should know Git whether you want it or not, but. CSS can wait.
Focus on an area of data science you like, expand your knowledge of that area, become a subject matter expert in said area and get a role that fits those needs. If a role wants you to have 40 different hats, that role is most likely not for you. If you see a job application asking you to know 15 different technologies and to do 10 unrelated tasks, you RUN. Don’t try to be a holistic data scientist.
Ideally, you do want to work in a holistic team. A team where you have different types of data scientists collaborating together on a shared goal. When you interview at a new place, ask questions. Ask them about the culture, ask them about how projects are assigned, what the expected outputs are, ask them to let you spend a day in the office. You do not need to say yes to every offer you get. It is completely fine to reject a role if it is not a good match for YOU.
If you’re new to field, it is completely okay to try out different areas of data science and roles, but once you know what you’re looking for, stick to it. It will help you keep your sanity and you can focus on growing the skills you actually need.
just switch tools already
Look, I get it, we all prefer one tool over the other, it’s human nature. But, for god’s sake, if there’s a better tool, just use the better tool. If you accept a new role and said role requires you to use another technology, then SWITCH. This is not a hill worth dying on.
I began programming almost 10 years ago using FORTRAN. My background is in astrophysics and when I started my undergrad degree FORTRAN was still the preferred language in the field. Fast-forward a couple years later, I discovered Python and it became my language of choice- if I had a choice. A few years later, I worked for an R-shop team and the team preferred that most, if not all, of the work be done in R. Did I know R beforehand? No. Did I learn it? Yes. Is Python better than R? No. Are there things that Python can do that R can’t and vice-versa? Absolutely, yes.
Look, it is okay to prefer or, heck, to be better at one language over the other, BUT, if you are refusing projects or you’re refusing a role because of the petty reason of not wanting to work on another programming language, then you’re doing yourself a disfavor. If you feel tool A is better than tool B, and if your project timeline allows it, build then project in both tools and prove whether tool A is indeed better. Comparing tools that way goes down a LOT smoother than just plainly saying “oh, I just hate Python.” This a much better hill to die on.
learn company data
Honestly, this may be one of the hardest parts of any job. It is the one aspect of every I wish was easier. The reality is that most companies have messy, messy data. Even worse, most companies don’t have proper on-boarding tools or documentation to really train a new employee with their data, so it’s really up to you to come up with a plan. Sometimes, it can take up to 4-6 months just get familiar with the data and THAT IS OKAY. Sometimes, there’s no one to show you the way.
I am not going to lie, it’s going to suck…for a while. When you join a new team, make time to meet with people who can teach you about the data- if you’re lucky, you’ll meet someone who will be willing to help you out. More often than not, there isn’t anyone who really knows the data either or they just don’t have time for you (true story). Regardless of the situation, start looking at the tables yourself, making simple pulls to get yourself acquainted with it. After all, you can’t do data science without data.
learn to make quick plots and use descriptive statistics
Every time you get a new data set, the first thing you should do is plot it out and apply some descriptive statistics to it. One of my favorite things to do, if it’s applicable, is run and plot a PCA (principal component analysis) as it immediately tells me how similar or how different my populations are and it helps me with model selection. I want to emphasize that these plots do not have to be complicated – make line plots, scatter plots, bar plot, etc – all you want is to get an idea of what is going on with your variables. I also love to make quick bar plots, histograms and line plots to visualize my variables.
run multiple models
Choosing a model is hard. While fun, Kaggle data sets are not real world data sets, so get in the habit of running multiple models. If the data allows me to, I like to start simple; maybe a regression? Then, select a few more applicable models and take a look at the accuracy. Select the model that most accurately fits your data. I see a lot of folks focusing on fancy models and that is fine, but there’s no shame in a regression. I personally believe all data science is based on regressions. If a regression is best, then a regression is best.
A common and straightforward way of checking the performance of a model is to calculate the r² value. Do keep in mind that there are several types of models and the r² value may not be appropriate, e.g. neural networks. If you are indeed comparing multiple models that are vastly different from one another a cross-validation would probably be your best bet.
learn to read documentation
Y’all…this is probably the most important thing you will do in your career. I mean, I know sounds absolutely basic, but the sheer number of people that do not know how to find or read documentation is shocking. Every function of every package carries some sort of documentation.
It is easy to google an answer to a bug you have and copy it directly from Stack Overflow, BUT if you don’t understand what those lines of code do or how those functions work, you will inevitably run into the same issue again. This is particularly important when you’re running a model, you must understand what parameters you are allowed to change and how said parameters will affect your model. You can’t assume the standard parameters set by your package will work for your data.
the business-y side of things
please, learn about the business
I always tell people that the hard part about doing data science is not the coding or the math, but knowing how those skills can help the business/client. If you work for a corporate company, or really any company, spend time with people outside of your line of work and learn about how the business works. Talk to HR, Risk, E-commerce, etc, try to get a 360 degree view of how the company works.
The last thing you, as the data expert, want to do is miss an insight or report an insight that may not be an insight because you don’t understand the basics of how the company operates. If you work with sales data and you do your analysis and conclude that sales drop on Sundays, but you are unaware that your store is closed on Sundays then your analysis provides no insight and you’ll have wasted yours and your client’s time.
consider the fact that your co-workers may not be data people
It took me quite a bit of time to get better at this. As data people, we tend to go deep in the vines about what we did, the models we chose, our reasonings, showing the client our Jupyter notebooks- stop, they don’t care and they’re losing interest.
Most data science teams work in “islands”, as in that they’re completely separated from the rest of the business. Furthermore, a lot of people who do not work in data science, simply don’t know how to use it or understand it. Hence, it is crucial that you, as a data scientist, are able to dilute down the complexity of our jobs and successfully explain to your clients what you did, how you did it and what the results mean. It will save YOU and them a whole lot of frustration.
How you choose to do that is up to you. Personally, I am a fan of infographics- if the project permits it. Sometimes, the client will want a dashboard or a powerpoint. Get good at explaining your work with several tools.
understand the business question and offer solutions
As I mentioned in the previous blurb, a lot of the people you may work with do not know about data, data science, etc. They will; however, have heard about the keywords…AI, machine learning, NLP, themes, etc. They will be coming to you and whether the data permits it or not, ask you to run random models to provide answers to business questions they have.
As the subject matter expert, it is your responsibility to guide the conversation. If they are asking you to predict turnover, but they don’t have nearly enough data, make a recommendation to what they can do instead. They want an LDA on some text data, but the text data is trash, then recommend another type of analysis. Explain to them why you can’t run models without data or why AI or ML is not the solution to all their problems.
More often than not, they are willing to change course if you present the client with another option that can get an answer to their question. I find that a lot of times it is a matter and asking the question in a different way.