According to the trend and technology expert publisher Visual Capitalist, 2.5 exabytes of data are generated every day, that is, about 12.5 billion pages of text. The volume is difficult to understand for the dimensions that our brain handles: every minute 500 hours of videos are uploaded to YouTube, 350.000 Instagram stories are created and more than 41 million WhatsApp messages are sent.
But not only in social networks and on the internet this data is being generated: companies, through their transactional systems, software tools for production management, customer experience applications or internet of things’ sensors, to cite just a set of examples, contribute to this tide.
Those who are able to understand such data, that is, identify what can be useful, compare it, analyze it and draw valid conclusions from it, have in their hands the ability to take their business to the next level.
To handle this complex framework, manage multiple types of data from a wide variety of sources and store it in a centralized repository (including both structured and unstructured or semi-structured) there is the concept of a data lake.
Schema on read
A data lake is a more agile and flexible data storage and analysis solution than traditional repositories. It is characterized by preserving raw data in a flat architecture, unlike data warehouses, which use folders and files to configure a hierarchical structure.
They are only transformed at the time they are going to be used, in an approach known as schema on read. There is no predefined scheme in which the data must be previously embedded: it is analyzed and adapted to the most convenient format at the time of reading. In a data warehouse the ‘’schema on write’’ model is used, schematization during writing.
Compared to a data warehouse, a data lake holds all data – even data that is useless today, but might at one time or another. This means a huge saving of effort in terms of data profiling and in decision making. In addition, when data is not used it can be excluded from the warehouse to save storage costs, which implies a new effort that is not necessary when working with data lake.
Unique identifiers
Each element in the data lake has a unique identifier and is tagged from a set of extended metadata. Therefore, each time a business problem needs to be solved, all related data can be retrieved from the data lake for focused analysis on that subset.
Thus, for example, if the company needs to perform an analysis of its customers’ feelings on social networks or a credit risk assessment of a person applying for a bank loan, data lake will retrieve only the data that is tagged in such a way that they are unequivocally related with that request.
Data lake benefits
Data lake benefits include the ability to combine and process disparate data sources and the ability to make essential data available, exactly when it is needed, in the hands of those who need it, while maintaining very high standards. of security. Another distinctive advantage is speed: the data lake architecture enables immediate access to data.
When implementing a data lake, it is important to first define a strategic vision so that it is fully aligned with the needs of the business.
It will be necessary to define the architecture and the technological platform: in general, hardware clusters of economic consumption and high levels of scalability are used, to be able to dump data without having to worry about the capacity storage. In this sense, solutions such as Plug & Play Data Lake stand out, which provide all instances, from the storage points to the centralized query console, to facilitate the implementation and use of the data lake.
Ultimately, data lake is the ideal solution to find the really relevant data for businesses, to be able to share it in a collaborative way and to reuse it as many times as necessary. In other words, it is the key to understanding, from a data perspective, the world we live in.