If you have chosen to click on this blog I bet you have heard the term ‘BigData’, a ‘Data Lake’ and a ‘Data Warehouse’.
Even if you are not a Data professional (Data Scientist or Data Engineer or..) BigData isn’t too difficult to understand: it refers to extremely ‘big’ sets of ‘data’ that are treated differently to normal sets of data due to their sheer size.
For many mature businesses, making a data-driven decision means taking implementing BigData. Quite often the ‘BigData’ term will be accompanied by ‘Data Warehouse’ and ‘Data Lake’ terms. These terms are not quite as transparent so today I am going to dive deep into these terms and give you an understanding of the key differences between them.
Humans tend to use things not the way they are supposed to – this is what makes us human. By design, the primary purpose of a computer was to compute (or process) not store. This came as a secondary, temporary or even optional component in many designs. And this is how mathematicians and engineers used computers sized bigger than your bedroom to solve their tasks.
But just a decade later we started to realise computers can locate a necessary piece of information much faster than humans browsing through copious folders of paper. Nowadays, we process mountains of information that need to be collected, stored and made available for analysis. This task is solved by both concepts of Data Warehouse and Data Lake. So what actually are these?
A Data Warehouse is similar to a traditional warehouse – perfectly organised with full control of what is inside and where all items in the inventory are located.
Data Warehousing has a long history in the enterprise sector to store, managing and analyse structured datasets. Usually, the data that is stored in Data Warehouses are cleaned, pre-aggregated and organised for specific business purposes.
This data is then made accessible directly by different BI tools. Similar to in a physical warehouse, inventory (the data) is organised and mapped in a way that it makes complete sense and serves a predefined specific purpose. In many, cases Data Warehouse is used in “write once, read multiple times” mode.
A Data Lake includes multiple streams of data that all flow together to produce a ‘lake’ of different data types.
Data Lakes are a newer technology that is usually built with an open-source ecosystem such as Hadoop. Data Lakes allow the aggregation of structured, unstructured, or even raw data sets without any pre-processing. The Data Schema definition, processing, and filtering are usually done if, and when, data is read.
You should already be able to tell there are some key differences between the two key terms ‘Data Lake’ and ‘Data Warehouse’. Let’s break these down in a bit more detail:
Data is structured and processed.
Data storage schema is enforced during the writing.
More expensive for large data volumes.
Less agile – fixed configuration, modernising schema could be difficult.
More “traditional” approach with mature security.
Best suited for business professionals.
CRM, financial transactions, ERP
Structured or unstructured or raw (like log files).
Schema on read.
Designed for low-cost storage.
Highly agile – configure and reconfigure as needed.
Overall still under development. Less granular security control.
Best suited for data scientists and data engineers.
social media, web server logs, sensor data, documents, media files.
Why Would I Consider Data Warehouse?
Data Warehousing is perfect for mature but evolving businesses that have a determined set of data sources, each presenting prepared and structured data.
A traditional data warehouse is an expensive resource but is highly beneficial for enterprise organisations. Processing in Data Warehousing should be completed before or as data is written. The most common cloud example of a Data Warehouse is Google BigQuery.
Why Would I Consider A Data Lake?
A Data Lake is a data system to support innovation and insights that are agile and prepared for what the future has to offer. Data storage and retention is much easier and cheaper than in stores in a Data Warehouse.
Processing in Data Lakes is completed when the data is read, and hence Data Lakes can dynamically adapt to the analysis at hand. Data Lakes are becoming more commonly included in enterprise data strategies due to the insights and flexibility that they offer.
Want to Know More?
For more information on cloud technology solutions, get in touch with XPON today. XPON is a Premier Google Cloud Partner that helps businesses better leverage their First-Party customer data to unlock exponential growth.