Data asset curation: The guidelines to determine information usability

Spread the love

Data catalog fundamentals: The dimensions of usability

The policies and processes for data curation are managed using a data catalog in which a cache of attributes about each data asset provides critical information about data usability. We may be accustomed to capturing data quality expectations about data sets, and data quality is important for data usability, but the primary importance for curated data assets are actually associated with use and oversight of shared data sets. Some examples of data usability dimensions include, but are not limited to:

Discoverability: Are data consumers and system designers able to get information about the data assets within the environment? Are the data assets properly classified in a consistent manner?

Searchability: Is the information about the data asset searchable? Can one find those data assets using references to business terms, phrases and concepts?

Accessibility: What are the different methods for accessing the data asset within the data environment?

Protection: With increased enterprise visibility intended to foster data reuse, protected data will be potentially vulnerable to exposure unless proper controls are in place. What mechanisms are in place for asset protection, including role-based access control, encryption and data masking?

Versioning: The curated data environment will likely maintain versions of data sets. What mechanisms are in place for retrieval of modified (or even deleted) objects, and are there allowances for rolling back to previous versions?

Provenance and lineage: When data consumers apply transformations and enhancements to existing data assets, new data assets will be created. The catalog should document the flow of data sets from their acquisition through the environment, and track the sequences of actions applied to them as the derived data assets are created. The data catalog must provide a means for logging and managing data provenance and track lineage of downstream data use.

Data quality: What are the ways that quality expectations are captured for each data asset, and what are the methods for ensuring that these expectations are met in different usage scenarios?

Data currency: How up to date is a particular version of a data asset? How are data currency requirements managed and demonstrated?

Surveying the organizational expectations for data usability is the first step to establish the policies and procedures for data curation. In essence, curation embraces data stewardship and governance in a way that operationalizes the ability to expose corporate data assets and promote their reuse, while limiting the risks of reinterpretation due to semantics.

As increased data volumes and broader varieties of digital objects are ingested into the growing data lake, instituting data curation processes will help protect that data lake from “digital pollution” that diminishes the organization’s ability to take advantage of its corporate information inventory.