In this new part of the DataHub tutorial saga, we are going to work on the connectivity with the platform through the API. As data engineers, the goal is to incorporate DataHub as a Data Governance tool in our ecosystem. To do this, on multiple occasions, we will find that some more customised integrations have to be made using the develop stack offered by DataHub.
The first thing we need to have is our DataHub service turned on. In the article Tutorial DataHub 2 – Quickstart and Deployment, we saw how to do it.
DataHub Metadata Service ‘GMS’
In the first part of the DataHub Tutorial – Architecture, we already advanced what this component that we defined as the heart of DataHub consisted of.
When deploying with Docker Datahub, we will see a service called ‘acryldata/datahub-gms:head’, which maps port 8080 ‘0.0.0.0.0:8080->808080/tcp, :::8080->808080/tcp’.
The DataHub metadata service contains two different APIs, GraphQL and Rest.li. Let’s start by exploring the first one, GraphQL API.
If we go to the URL ‘http://localhost:9002/api/graphiql’, we can access the GraphQL in-browser tool web. From this interface, it is possible to launch our queries to the API and see the documentation about the different schemas registered in the framework.
It is important to note that this website works with a cookie system that allows access to the service as long as we have previously logged in to DataHub. Otherwise, it will return a 401 error.
GraphQL
When using GraphQL, there are 3 options: Query, Search and Mutation.
Query
We want to obtain the data of an entity knowing its URN (Uniform Resource Name), which is the schema that uniquely defines any resource in DataHub. When using query, we are forced to do the query by URN, since it can’t be done by any other property.
Previously, I have created a domain through the web interface, where we will find its corresponding URN and that we will use for the query: ‘urn:li:domain:0e1ca480-3163-4212-9119-c6cd8ebec259’.
Note how the query is constructed:
query {
domain(urn: "urn:li:domain:0e1ca480-3163-4212-9119-c6cd8ebec259") {
id
ownership {
owners {
type
}
lastModified {
actor
}
}
properties {
name
}
}
}
First, we have the action we want to perform. On this occasion, a query. Next, we indicate the schema we wish to consult, which, in our case, will be domain.
The schema can be consulted in the left-hand side documentation bar, where the fields that exist in each of the schemas will also appear.
Next, we must indicate the value of the URN field, which is mandatory. And finally, the data schema we want to query. In our case, it is marked in blue because those are the internal fields of the entity Domain that we want to obtain in this example.
Search
The search command is used when we don’t know the URN and we want to find entities within DataHub.
{
search (input: {type: DOMAIN, query:"M*",start: 0, count: 10}){
searchResults{
entity{
urn
type
... on Domain {
properties {
name
}
}
}
}
}
}
In the case of search, we must include an input field to perform the search. Here, it is necessary to enter the filters we want to use, for example, the DOMAIN type and everything that begins with the letter ‘M’. Within the request configuration, GraphQL includes a paging system, so we ask for the first 10 elements by means of the parameters start and count.
In the body of the request, we specify the fields we want to retrieve in the response. In this example, we ask for the URN and type fields (which are common to all entities) and, more specifically, from Domain we ask for the name property.
Mutations
The system provides us with a series of functions to create different entities through the API. Here you can find a list of all the existing Mutations.
mutation createTag {
createTag(input:
{
name: "Deprecated",
description: "Having this tag means this column or table is deprecated."
})
}
In this example, we will create a tag following the documentation. The most important things to note are the arguments and the types of arguments where some of them are optional.
Rest.li
You can find the Rest.li documentation at the uri ‘http://localhost:8080/restli/docs’.
Conclusion
In this post we have seen how to use the DataHub API. Both GraphQL and Rest.li provide data engineers with flexible and customisable tools to manage data governance in their ecosystem.
The ability to perform specific queries using GraphQL, search for entities without knowing their URN, and perform mutations to create or update entities, allows for deep and custom integration of DataHub into any data infrastructure. Through this API, DataHub’s potential as a Data Governance system can be fully exploited, facilitating centralised and secure access to organisational information.