Topics in Data Science: Big Metadata
Data Science is Metadata Science
Data Science is one of those topics I can never get enough of…even the definition of Data Science as given by the the Data Science Association resonates with me; “the scientific study of the creation, validation and transformation of data to create meaning.
Data Science is a compelling topic because of the immense potential hidden in data sets, and unlocking that insight can successfully address our most significant societal challenges. The promise of Data Science is to more fully contribute to the greater good buy advancing our knowledge and leading to impactful discoveries. The impact of Big Metadata in the Data Science Framework is the topic I address in the following blog as an introductory topic in a new series of blogs called “Topics in Data Science”. Large data repositories generate massive amounts of metadata, enabling big data analytics to leverage technological and methodological advances in data science for the quantitative study of science. This blog post introduces a definition of Big Metadata in the context of data science and discusses the challenges and possibilities in Big Metadata analytics.
What is Metadata?
Metadata can more universally be thought of as value-added language that serves as an integrated layer in an information system. Metadata is structured data supporting functions associated with an object, an object being any “entity, form, or mode”. Metadata serves as connection to the lifecycle of the digital object being represented or tracks. While Big Data offers undreamed-of possibilities to find new data-driven solutions, Big Metadata can be perceived as data that encompasses information about the relationships among data, leading to the creation of a structure where data relationships can be explained.
Smart Data
Data Scientists work with large unstructured data sets, and these data sets are inherently messy, lacking the structures that make them suitable for analytics. To realize maximum value from a data lake, you must be able to ensure data quality and reliability, and make that data smart. Metadata is inherently Smart Data because it provides context and meaning for data, and enables an action that draws on the metadata enhance connections that have been made. Smart Data is high quality, trusted data. It is accessible across the enterprise. Smart Data is actionable and can be ingested and understood by humans and/or machines.
Structure is everything
Data Science endeavors rely not only on data, but accurate description of the data - hence metadata.
In the practice of data science, much of the attention is focused on the beautiful visualizations or amazing discoveries made from analyzing large data sets. Little attention is given to the process the data scientist uses to get those results, and specifically the time-consuming process of preparing data. Data preparation accounts anywhere from 80–90% of the work of data scientists. They spend 60% of their time on cleaning and organizing data and 19% of their time on collecting data sets, meaning data scientists spend around a whopping 80% of their time preparing and cleaning their data for analysis. Detecting data anomalies and ameliorating data entry errors generally involves writing code, an intuitive part of the data exploration and confirmation process.
David Lyle, VP of Business Transformative Services at Informatica, wrote that “the difference between success and failure is proportional to the investment an organization makes in its metadata management system”.
Big Metadata Management
Large data sets used in data science have unique challenges pertaining to data management. Big metadata analytics requires careful design of datasets, paying attention to data structure. A misstep in data prep may cause a stalled server, never-ending loops, or very large datasets being exploded out of proportion or cases where data causes issues like disconnection from the network due to inactivity for a long time. There are also instances where, while merging datasets, data explodes out of proportion due to data matching issues. Successfully linking or merging data fields pivots on the extent to which metadata conforms to a standardized structure. Primarily, the largest advantage of this conceptual alignment of a standardized metadata structure on top of a metadata source is the creation of a formalized conceptual definition, allowing for a defined metadata interchange across datasets, fostering stable, reproducible results. Big metadata analytics performed on an ad hoc basis without standardized metadata exchange formats is more likely to suffer from mistakes made in both conceptual and computational workflows. Elevation of the metadata from the underlying storage allows its use in the meta-mining process.