What we want to achieve
For a business user, what is relevant are the business concepts and connections between them. What we want to achieve is to have a consistent and meaningful representation of data resources, along with policies to handle them. We want the end user to have access only to the governed, rich metadata.
How to obtain it?
We will use CocoPharmaceuticals database as resource sample. As a result the end user will be able to see a business view on top of the underlying resource and similarly the technical view.
By assigning a glossary term to a relational column, we will trigger the creation of two views on top of the parent table:
- business view with all the columns having business terms associated and using the business term name as column name
- technical view with all the columns having business terms associated and using the technical name as column name
Solution Overview Components
- Oracle database containing Coco Pharmaceuticals sample data. We will use table EMPSALARYANALYSIS as an example. Link: https://github.com/odpi/egeria/tree/master/open-metadata-resources/open-metadata-deployment/sample-data/coco- pharmaceuticals
- IBM Governance Catalog (IGC) as the repository containing glossary terms and the metadata for the above database. This repository is integrated into OpenMetadata world using the proxy
pattern: https://github.com/odpi/egeria/blob/master/open-metadata-publication/website/open-metadata-integration-patterns/adapter-integration-pattern.md. The proxy is used to translate from proprietary formats, events, APIs to Open Metadata standards (and the other way around). Link: https://github.com/odpi/egeria-connector-ibm-information-server
- Kafka as event bus
- Open Metadata server configured to have Open Metadata Access Services (OMAS) enabled.
Open Metadata Access Services are sets of APIS targeting certain type of consumers and use cases. For more general details check link: https://github.com/odpi/egeria/blob/master/open-metadata-implementation/access-services/README.md
In current example we will use two OMASs:
- Information View OMAS is responsible for integrating with virtualization and BI tools. For this flow we are using it because we need the structure and context (location) of the table.
- Data Platform OMAS is responsible for integration with tools that want to create new data assets. In this case its responsibility is to integrate with the data virtualization tool used in this setup, gaianDB, and model the views created as new data assets.
- Atlas repository as a repository integrated with Open Metadata natively, since it is implementing the OpenMetadata APIs, protocols and types. Using Atlas built-in UI we will explore the entities created for representing the views.
More details about native pattern can be found here: https://github.com/odpi/egeria/blob/master/open-metadata-publication/website/open-metadata-integration-patterns/native-integration-pattern.md
- Virtualizer: component designed to integrate with data virtualization tool, therefore responsible for running the actual queries and statements for creating the views. It also produces events representing the views structure, along with details about the host and platform creating the views (gaianDb in this case)
- Gaian DB as data virtualization tool. Gaian DB is a leightweight federated database over multiple sources, as a result providing a single centralized view over multiple, heterogeneous back-end data sources. More details can be found here: https://github.com/gaiandb/gaiandb.
Our setup includes a front-end Gaian node to which virtualizer will connect and also another back-end Gaian node connected to the Oracle database. Views will be created in the front-end Gaian Node and will also be linked to the real database tables from back-end Gaian node.
All OpenMetadata servers (IGC proxy, OMAS server and Atlas) are configured to be part of the same cohort, therefore they will receive and publish events on the same cohort topic.
Configuring the environment
- Setting up the cohort
At the centre of the setup is the concept of cohort. A cohort is a collection of metadata repositories sharing metadata using the Open Metadata Repository Services (OMRS).
Below is a picture describing the cohort defined for this setup. We’ve configured the three members as part of the same cohort. This implies that they listen and publish OMRS events to the same cohort topic and therefore they have in their local registry store registration details about all the other members in the cohort. As a result, this enables the repositories to act as one single virtual repository.
- Virtualization layer: Virtualizer, data virtualization tool and underlying database. Virtualizer is the component designed to interact with GaianDB. Hence its responsibility is to connect to front-end node (Gaian Node 2) and create the views in this node. The data for these views is located in the Oracle database, but available through backend node( Gaian Node 1) to the other connected nodes (Gaian Node 2 in this scenario).
- First step triggering the creation of views: business user assigns business term to column in IGC
- As a result, a new event is published on InfosphereEvent (internal IGC topic)
- IGC Event is consumed by the proxy and translated to OMRS event and types. A new OMRS event containing details about the business term and column is published on the cohort topic, thus reaching all the members of the cohort.
The event will include:
- Details about the type of event. In this case the event is a NEW_RELATIONSHIP_EVENT with type SemanticAssignment
- Mandatory properties for the 2 entities between which the relationship is created. These are necessary to be able to identify the entities uniquely in Open Metadata(OM) world. This includes the type of each entity (RelationalColumn and GlossaryTerm), the guid of each entity as unique identifiers in the repository and also qualifiedName as unique identifier at entity type level.
- Other details such as event provenance, originator, auditing and versioning info are also defined in the event.
4. The event is picked up by both Atlas and OMAS server because they are cohort members and not originators of the initial event. As a result of processing this event, Atlas cohort member will create a SEMANTIC_ASSIGNMENT relationship between the entities representing the column and the glossary term.
The OMASs server will also pick up the event. Because it is configured to have access-services enabled, all enabled access services listeners will receive this event and either process it or discard it, based on the logic and use cases for it. Starting from column guid as unique identifier in OM world, Information View OMAS will retrieve all entities describing the table, thus building the full context. This includes host details, connector type, database name, schema name, table name, columns and business terms linked to the columns, columns constraints such as primary key and foreign key.
This event containing the full context is published to Information View Out topic as input for virtualizer component.
5. Virtualizer is processing the event and creates the 2 views in front-end node (Gaian Node 2). In the screenshot below is the business view created in gaianDB:
As seen in the picture, the view includes the columns that have business terms assigned. Column names are business term names because a business user is not interested in technical name (like FNAME). As a result, what is relevant is the actual meaning: the business term name First Name.
6. Virtualizer is publishing to DataPlatform IN topic the events describing the views.
Sample event can be found here: https://github.com/odpi/egeria/blob/master/open-metadata-publication/website/java-events/information-view-EMPSALANALYSIS.json
Please note that this event contains also details about the host and platform where the view was created, along with the view columns, business terms associated to these column and underlying database columns.
7. DataPlatform OMAS consumes the events from DataPlatform IN topic
As a result, the DataPlatform OMAS will issue calls to enterpriseConnector for creating the entities and relationships modelling the views. The enterpriseConnector is responsible to trigger a federated request by calling the connectors stored in server’s registry store. Therefore it is creating the entities and relationships modelling the view.
Because the current integration for IGC doesn’t support creation of entities/relationships, the entities and relationships defining the view are created only in Atlas. The business and technical views are represented in Open Metadata as RelationalTable entities and all the columns have a relationship to the business term and also to the real database column. The data virtualization tool (gaianDB) and database asset (Gaian database) are also modelled by the entities SoftwareServer, Endpoint, Connection, ConnectorType and Database.
In picture below, all view columns are linked to RelationalTableType
Screenshot below displays the relationship meaning (or semantic assignment) between the column and glossary term
Similarly, picture below displays the connection between the view column and the actual database source column through relationship queryTarget