Over the course of the past year, I have been continuing to refine my career as a data scientist and expand my technological skillset. One aspect of this career growth is joining Game Hive as its first Data Scientist and being part of a hardworking team to jumpstart its own internal data science and business intelligence processes. This new beginning has given me exposure to digital marketing where I work with a super rich mobile gaming user-level and device-level data by using different digital marketing platform API’s and game app-related data. With this new kind of exposure, I have engaged in some really interesting data pipelining experiences and have really pushed myself to learn more and apply these learnings to further my growth in data analytics.
From gathering digital marketing data to building out a data warehouse to building dashboards and performing predictive modelling, here are 5 things I learned from building data pipelines from the ground up.
1. Finding the right tools and technologies for the pipeline is a real balancing act
There needs to be some pre-existing knowledge of the available technologies out there to be able to piece together how the data pipeline is going to fit together. I needed to consider the value of time, human resources, and technological costs when selecting the appropriate tools for the job and preparing for an execution plan. The following were valid questions to ask when I prepared to build out a pipeline:
- Is a data warehouse a viable solution to store your data?
- Are there going to be multiple databases or a big data warehouse or both?
- Who is going to be responsible and accountable for the data warehouse?
- What are the costs for this data warehouse? Do we know our data ingestion and storage needs enough to propose a customizable plan? Is it within budget?
- What skills are needed to build the data warehouse? Is this a one person job or do we need a team? What human resources currently exist to support this project?
- Who are the end users of this data? How much of a learning curve is there if we adopt new analytical technologies?
- Do end-users require fast and quick data insights? Do they require some flexibility to dive into the data themselves?
- How much will it cost to adopt a new analytics platform?
- How long will the pipeline take to build? What is a reasonable timeline to drive insights as soon as possible?
- How long will it take to test and QA the pipeline?
By taking some ownership in building out the data pipeline, there are some decisions that were needed to be made based on my current technical skill levels and understanding the needs of the organization. Buying the right tools and technologies that suited the workflows and processes of the organization while staying within budget were also important considerations.
2. It is important to have a deeper understanding of the components of a data file, the best practices for file storage, and how these affect data loading
To most professionals that digest or analyze data, this may come as trivial or obvious, but I definitely learned to not take these things for granted (I appreciate the little data things!). I learned to pay closer attention to the components of a csv file that help define the difference between one data point to another, and also the difference between the positions of data points in the document. Sometimes csv files contain messy data such as columns that have unpredictable data strings. These strings might contain the same character as the delimiter that tells us how columns are separated from one another. We either have to learn how to handle these throughout the entire document, or see if a data ingestion tool has the means to identify a delimiter from a normal string of the same character. The most popular delimiter is the “,” comma, and countless times have I seen the comma act as a delimiter but also co-existing inside a string.
Best practices for file storage include knowing the best ways to organize and retrieve data depending on its use. If you have data that depends on time and most likely will be used to perform a time series analysis, or somehow your historical data needs to be represented visually, then it may be beneficial to actually include the date in your file name. This becomes extremely useful when your data begins to scale with the organization. For example, a database of sales won’t probably contain data from 3 years ago but it might be useful to want that data within this specific time range for future analyses. Loading and retrieving that data becomes easier since it is easily identifiable by the file name.
3. It is necessary to have deeper knowledge about metadata
When ingesting data into a database, it becomes so important to know not only the contents of your data, but the data that describes your data! For example, a table I have created contained different brand names of phones. Here I was thinking that phone names could possibly not be any larger than maybe 15 to 20 characters. My own personal bias has gotten me to believe that there was no possibility that phone names could be any larger than names such as “iPhone X”, or “Samsung Galaxy Note” and was I wrong. I encountered a situation where the name of a device had 72 characters in it! I learned that the best practices in building out tables within a database requires deeper understanding the contents of your data columns, the data types and the appropriate amount of space needed to store that data.
When ingesting data coming from other data sources or API’s, your best bet is to read documentation! It is so important to familiarize yourself on the details of their metadata, so that you set yourself up for success by translating the appropriate data types and schemas into your own tables. This is completely opposite to what I describe in the next paragraph. There are things where I will actually disagree to do in practice, but nonetheless have done during my learning experiences because I have not thought of a better solution.
I have learned to prepare for new metadata in which I have not seen before and could not determine pre-existing data types. For example, there was a situation where I attempted to store a string of length 90 but the column in which it is stored is set to be varchar(75) and so the table will fail to record it. This simply occurred because I had no way of determining the maximum size of the string to be ingested. It may be wise to prepare for unpredictable data ingestions due to lack of documentation or questionable data architecture. In this situation, if I have never seen a string of length 100 before, I would probably set the column data type to be varchar(200). You might be wondering how a business with this data strategy survived without any proper data architecture, and my only response is that some businesses thrive on their core service and product. Sometimes it takes years to redevelop or redesign an internal data infrastructure, and some may not see the benefit in doing so or shy away from treating it as urgent. Why fix something when it isn’t broken?
4. It is possible to seamlessly handle errors while monitoring data ingestion and data manipulation
One of the cooler parts of the data pipelining process is the ability to monitor how successful your data ingestion processes were. One important thing I learned is to manage the data pipeline in such a way that in case any errors occurred during processing, I would be notified right away and I could directly fix the problem with ease. For example, I worked on a data pipeline that downloaded data using an API and pushed it towards a table inside a database. Sometimes, the Python script would fail to communicate with the third party data provider’s server resulting in a server error. One way to handle future occurrences of this error was to set up error handling cases within the script where I could easily re-run the API call again without having to touch the code.
Another thing I learned is that the job of monitoring and validating your data can seamlessly fit into your workflows. At Game Hive, most of my communication with the team is done through Slack, and it was here that I realized that Slack was my main form of communication with my team, but could also be my main form of communication with myself! Within my data ingestion Python scripts, I would include a little messaging snippet where whenever an error occurred, I would receive a Slack notification indicating the source of the error, the reason behind it, and how to address it quickly.
5. The best kinds of data validation practices is a team effort, not a solo endeavour
There are many points along the data pipeline that allow stakeholders to really maximize the integrity of the data being used. Reiterating my learning point 4., at the programming and engineering front, error notifications can be used even more to verify that the correct number of rows and columns from the data set was uploaded to the database. Any uploading errors that occurred in that process could also be sent as notifications to ensure that all the correct data has been uploaded.
During the analytical processes of end users, it is also crucial to have experienced and number-savvy members of your team look at the data that is being ingested through dashboards and analytics. Even the slightest of numerical trends can be easily detected by an experienced professional and it becomes good practice to communicate those internal gut feelings about whether the data is wrong.
Obviously, the less human made checkpoints there are in the data pipeline, then more time and effort could be used to actually use the data to derive invaluable insights. It does not hurt to have different stages of data validation throughout the pipeline but there are great ways to maximize accuracy, efficiency, and credibility among your data and peers.
A special thanks to the lovely Michael Ma for inspiring me to write about my experiences in data pipelining.