what is a data lake

Big data is in place for a long time. However, to extract maximum output from the existing ecosystem, the need for flexible data architecture is always in demand. This is where the Data Lake concept comes into place, which brings a new paradigm shift in Big data architecture. Why is it called Data Lake and not Datastore or Data warehouse? 

However, as a lake caters to all kinds of users with its natural water, a data lake also serves different purposes with the unrefined data. Using those stored data, a data scientist can perform the hypothesis based on it. An analyst can analyze the data to find the pattern. A business user can explore the data based on those reports.

Nowadays, cloud Platforms provide an end-end solution for data lake architecture implementation in an economical and scalable way. This is commonly known as Data Lake as a service.


Why Data Lake is important?


Flexibility and Faster analysis

As the data volume has increased at rapid growth, the data complexity has also increased at the same rate. Undoubtedly, it is a rising challenge for companies’ operational efficiency while analyzing those vast amounts of data within a given timeframe. Thus flexibility and scalability are the two essential parameters in this regard. It has been observed that if the data is analyzed in its native format, it supports flexibility and scalability. Though the analysis may not be accurate when compared to perform in a standard format, it retains user satisfaction and timeliness. No doubt, these are essential factors for any business.

Supports different forms of data analytics

A wide variety of data analytics can be performed which includes:

-Dashboards

-Real-time analytics

-Visualizations

-Machine learning

-Big data processing


Helps to explore diverse data types

As Data Lake supports all data types in native format, companies can leverage these data’s potential even if it is a new format. Besides, it supports legacy systems to offload diverse data types.


Increased ROI

With the enhanced analytical ability, companies get better business agility. Better analytical activities translate to substantial ROI for the companies, which brings increased growth along with profit.


360-degree analysis

By implementing Data Lake, companies can get 360 degrees view of customers instead of a data silo structure. It makes the analysis more robust.


Related post – 6 Factors which will change the Big data landscape 2020


What are the key concepts of Data Lake?


Before moving into its architecture, knowing key concepts of the data lake is essential, which are closely related to the architecture.


Data Ingestion


This is the process through which data flows from its origin to the data lake. The key functions that happen during ingestion are –

-Collecting data from the source

-Parsing the data

-Routing to one or more data stores

The data formats could be any

-Structured

-Semi-Structured

-Unstructured data

Also, the ingestion can be in the form of

-Batch

-Real-Time

-One-time load

Data ingestion is the most critical step because if the data ingestion does not happen correctly, the data quality in Data Lake will degrade. Consequently, the analysis would be wrong downstream.


Data Sources


The origin of data from where data ingestion happens is known as data sources. It could be –

-Clickstreams

-Data center logs

-Sensors

-APIs

-Databases

-Web servers

-Emails


Key attributes


Shaun Connolly, Vice President of Corporate Strategy for Hortonworks, defines a Data Lake in his blog post, Enterprise Hadoop and the Journey to a Data Lake:

“A Data Lake is characterized by three key attributes:

Collect everything. A Data Lake contains all data, both raw sources over extended periods as well as any processed data.

Dive in anywhere. Data Lake enables users across multiple business units to refine, explore, and enrich data on their terms.

Flexible access. A Data Lake enables multiple data to access patterns across a shared infrastructure: batch, interactive, online, search, in-memory, and other processing engines.”


Data Storage

As per Chris Campbell, the BlueGranite blogger and Cloud Data Solutions Architect for Microsoft, “The Data Lake retains ALL data. Not just data that is in use today but data that may be used, and even data that may never be used just because it MIGHT be used someday. Data is also kept for all time so that we can go back in time to any point to do analysis.”


Landing zone/Transient zone

This is the contact point or ingestion layer where data is sourced from external data sources into Data Lake.


Standardization zone/Curated Zone

Data cleaning and required data version creation for the downstream system happen at this zone.


Analytics Sandbox/Production zone

Analytic model from the refined data is created in this zone


Data Governance

Data governance is a process to manage the availability, usability, integrity, and security of data used in an organization.


Data Management

Data management needs to be done at all the stages of data operations in Data Lake.


Data operation

It relates to both Standardization and Analytic activities of data.


Data Lake architecture


A Data Lake architecture can be divided into three zones –

-Landing Zone

-Staging/ Cleansing zone

-Analytics box/Transformation zone


** For data science analytical purpose often raw data directly gets consumed from the Landing zone to Production layer to get more insights. In that case no standardization is required.**


Landing Zone: In a Data Lake, the Landing Zone is the integration point between the data sources and Data Lake. The Landing zone must be integrated with MDM and ETL platforms for data ingestion. Data ingestion at this point follows Lambda architecture, which means it takes two processing paths to ingest data in the Data Lake in raw format-

1.Batch processing – Data stores in most native format through this processing

2.Real-time processing – Data is processed nearly real time in this case.


What is not allowed in the Landing zone?

-No data cleansing or logical operation happens in this zone. One of the main reasons behind this is –the raw data is used for predictive analysis purposes to get more insights about the data.

-Also, no data modeling is performed at this stage


What is allowed in the Landing zone?

-Data governance

-Data access

-Integration with MDM and ETL to know about the data sources.


Standardization Zone: Data exploration and discovery is performed at this zone. It is not necessary that all data needs to be pre-processed before it enters into Analytical zone. However, data cleansing is crucial for Machine learning and Natural Language Processing.  

As part of data standardization, defining proper data types for the schema, data transformation, data cleansing is performed using various tools. Data gets prepared at this stage for the transition into downstream systems or production systems.

Analytical zone: This is the exploratory zone for the data scientists and analysts where they can experiment on different use cases, creates prototypes, tests different hypotheses, etc. Discovering data at this stage helps to transform business in many cases.


Data cataloguing – an essential part for Data Lake implementation


Data cataloging is a disruptive trend today.  The data catalog is a glossary of business metadata that describes the shared terms, data usage policies, data definition, etc. Data catalogs automate the mapping process of the inventory data. This, in another way, helps data engineers to know about the correct descriptions of the data.

Data cataloging is the first important task before the implementation of Data Lake. As a Data Lake stores massive and variety of raw data, it is essential to keep track of all of them with versions during transformation, processing, and analysis.


Technology Stack for Data Lake


In the Data Lake technology stack, the first thing is – it is all about storage. So, in this regard, the storage can be of two types –


On-premises


Hadoop Distributed File System (HDFS)

Hadoop Data Lake solution or HDFS is the sole on-premise option available in the Data Lake storage solution. Primarily Cloudera Manager or Hortonworks Ambari maintain HDFS efficiently.


Cloud-based


The three major technology giants Amazon, Microsoft, and Google provides Data Lake as a service offer through their related cloud platforms –

1.AWS S3

2.Azure Data Lake

3.Google Cloud storage


Conclusion:


With a proper Data Lake set up, companies can achieve faster growth with high-quality deliverable insights. Also, it assures on-time delivery and frictionless data movement. However, it’s not only advantages but also risks which are associated with it. Data governance is a critical factor here. Also, security and access control are the two biggest risk areas.

Please share your valuable inputs in comment area to make the article more informative.

Leave a comment