Handling semi-structured and unstructured data in an enterprise data catalog

Large enterprises today are realizing the importance of a data catalog. An well-organized enterprise data catalog can helps organizations manage, discover, and utilize their data more effectively. More specifically a data catalog can assist with:

Data Catalog Diagram

Data discovery – A data catalog enables users to quickly locate relevant data assets across the organization. It provides a centralized repository with metadata, descriptions, and other information that makes it easier to find the right datasets for specific business needs.

Data governance – A data catalog supports data governance initiatives by providing visibility into data lineage, data quality, and data ownership. This enables organizations to maintain data consistency, compliance, and adherence to data policies and regulations.

Collaboration – By offering a centralized platform, an enterprise data catalog encourages collaboration among different teams and departments. Users can share knowledge, insights, and best practices around data usage, fostering a data-driven culture.

Data understanding – A data catalog enriches data with metadata, annotations, and other contextual information. This helps users better understand the data’s purpose, structure, and relationships, making it easier to use the data for analysis and decision-making.

Reducing data silos – A data catalog helps break down data silos by providing a unified view of data across the organization. This facilitates data sharing and integration, leading to better analytics and insights.

Improved productivity – With a data catalog, users can quickly find and access the data they need, reducing the time spent searching for data and increasing productivity.

Data security and privacy – A data catalog helps organizations maintain data access controls and track sensitive data, ensuring that only authorized users can access and utilize specific datasets.

User empowerment and data democratization – A data catalog promotes data democratization by making data more accessible to non-technical users. By providing a user-friendly interface and rich metadata, users can easily understand and work with data, empowering them to make data-driven decisions.

Data quality – A data catalog can help identify data quality issues and inconsistencies, enabling organizations to address them proactively and maintain a high level of data integrity.

Cost savings – By reducing the time spent searching for data, eliminating redundancies, and ensuring data quality, a data catalog can lead to significant cost savings for an organization.

Metadata in a data catalog

Data catalogs store metadata, which is data about the data. The metadata stored in a data catalog can vary depending on the organization’s needs and the type of data catalog used. However, some common types of metadata stored in data catalogs include:

Technical metadata – The names of schemas, tables and columns present in traditional relational databases or other technical information for non-traditional databases such as graph and document databases.

Dataset descriptions – Basic information about datasets, such as their names, descriptions, source systems, and creation dates.

Data lineage – Information about how data has been processed, transformed, or moved from its origin to its current state, helping users understand the data’s history and dependencies.

Data schema – The structure of the datasets, including column names, data types, constraints, and relationships between tables or entities.

Data owner and steward information – Details about the individuals or teams responsible for maintaining, updating, and ensuring the quality of datasets.

Data quality information – Metrics and indicators that provide insights into the accuracy, consistency, completeness, and timeliness of the data.

Data classification and sensitivity – Information about the categorization of data based on its sensitivity, criticality, or compliance requirements (e.g., personal data, confidential, public).

Data usage – Information about how datasets are used within the organization, including popular queries, reports, or applications that rely on the data.

Tags and keywords – Descriptive terms or phrases associated with datasets, making it easier for users to find relevant data based on their search criteria.

Access and security information – Details about the permissions, access controls, and data privacy policies associated with the datasets.

User-generated content – Annotations, comments, or ratings added by users to enrich the metadata and provide additional context or insights.

By storing this metadata, data catalogs enable users to discover, understand, and manage their organization’s data assets more effectively. It’s important to note that the actual data itself is not typically stored in the data catalog; instead, the catalog provides a means to locate and access the data stored elsewhere (e.g., in databases, data lakes, or data warehouses).

Handling non-traditional data in a data catalog

Many of most popular data catalogs where designed to handle traditional structured data sources such as relational databases and comma-separated files. But enterprises are increasingly storing and managing semi-structured files and other sources such as:

JSON (JavaScript Object Notation)
XML (eXtensible Markup Language)
YAML (YAML Ain’t Markup Language)
Log files
NoSQL databases – BSON (Binary JSON) or JSON

As well as unstructured data like:

Text documents – Word documents, PDF files, and plain text files.
Emails and instant messages
Social media posts, comments, and other user-generated content on social media platforms
Web content including HTML, CSS, and JavaScript code, as well as textual and multimedia elements
Digital images in formats like JPEG, PNG, or GIF
Audio recordings in formats like MP3, WAV, or FLAC
Digital video files in formats like MP4, AVI, or MOV
Human speech – Conversational data from interviews, phone calls, or meetings

How can data catalogs store metadata from semi-structured sources?

When data catalog came into the market, the were designed to ingest and manage metadata from traditional structured sources and they are just starting to catch up and offer functionality to deal with semi-structured and unstructured data. So, what can we do to deal with this new type of data? One possibility is to not catalog the data. Obviously, if we take this approach, we will end up with a big blind spot in our catalog. What else can we do? Some ideas are:

Adopt new data parsers and connectors – Some new data parsers are beginning to hit the market that are specifically designed to extract metadata from semi-structured sources such as JSON, XML, or CSV files. These tools can help identify and extract relevant metadata elements, such as keys, tags, or attributes, and incorporate them into the data catalog.

Adopt custom metadata extraction – We can do custom development to perform metadata extraction.

Use schema inference – For semi-structured sources that lack a predefined schema, we can use schema inference techniques to automatically detect and generate a schema based on the data’s structure and patterns. This inferred schema can serve as metadata in the data catalog.

Encourage user-generated metadata – WE can encourage users to contribute metadata to the data catalog by adding descriptions, tags, or annotations to semi-structured data sources. This user-generated metadata can help enrich the catalog and improve data discovery and understanding.

Standardize metadata – We can standardize the metadata extracted from semi-structured sources to ensure consistency and compatibility with the data catalog’s structure and schema. This may involve mapping metadata elements to predefined fields, using standardized vocabularies, or normalizing data formats.

Leverage metadata storage and indexing – We can customize the data catalog’s storage and indexing systems to make sure they can accommodate the unique characteristics of metadata from semi-structured sources. This may involve using flexible data storage solutions (e.g., NoSQL databases) and indexing techniques that can efficiently handle diverse metadata types and structures.

Integrate with data processing pipelines – We can integrate the data catalog population with data processing pipelines to automatically update metadata from semi-structured sources as the data is ingested, transformed, or analyzed.

Conclusion

By adopting these methods and tools, data catalogs can effectively store and manage metadata from semi-structured sources, enabling organizations to better discover, understand, and utilize ALL their diverse data assets. Not just the data coming from traditional structured data.

Since data catalog providers are just beginning to create the necessary tooling to handle semi-structured and unstructured data, it will not be trivial to populate this new data sources into the data catalog. In some ways, we are trying to fit a round peg into a square hole. The alternative (not inventorying and cataloguing this data) is not a palatable option, so, we need to find ways to handle these new sources of data. As data catalog providers mature and provide more tooling, the problem will continue to be alleviated and addressed more appropriately.