This document explains how to use the Graph Data Interface (GDI) to onboard relational data into Graph Studio.
It covers writing GDI queries, securing credentials, optimizing performance, and leveraging ontology-driven mappings to transform tabular data into a structured knowledge graph.
Contents
- Developer Checklist
- Understanding the Dataset
- GDI Query Structure
- Best Practices for Developers
- Troubleshooting Tips
- Leveraging the Ontology for Query Optimization
- Further Examples
_______________________________________________________________________________________
Developer Checklist
- Connect to the database: Use
s:DbSource, s:url, s:username, and s:password to define the connection. - Select data: Use
s:table for a full ingestion or s:query for a more targeted load. - Generate the graph: Ensure your query includes both
s:RdfGenerator and s:OntologyGenerator to create data triples and the model automatically. - Use partitioning: To optimize for large tables, use
s:partitionBy to enable parallel ingestion. - Secure credentials: Avoid hardcoding credentials. Use a secure Query Context to store and reference sensitive information.
_______________________________________________________________________________________
Understanding the Dataset
Suppose our database contains a table called public.Movies with fields such as: MovieID, Title, Year, Genre, Director & IMDBRating.
Our goal is to extract this data and transform it into RDF triples for use in our knowledge graph.
________________________________________________________________________________________
GDI Query Structure
1. Define the Data Source and Generator
GDI uses the s:DbSource class to define a relational database source.
The generator line (?generator a s:RdfGenerator, s:OntologyGenerator ; ...) is essential for invoking the RDF and Ontology Generators, which automate the mapping of tabular data to RDF triples.
The s:RdfGenerator and s:OntologyGenerator are key components of a GDI query:
s:RdfGenerator: This is responsible for the core task of converting our source data into RDF triples, which are the fundamental building blocks of a knowledge graph. It automatically transforms rows and columns from our relational database into a subject, predicate, and object structure. This is the "process of adding" data that we want to perform.s:OntologyGenerator: This is responsible for defining the semantic structure of our data. It automatically creates a data model (or ontology) that represents the schema of our source. This includes defining classes for our tables (e.g., a Movie class) and properties for our fields (e.g., title, year, rating). This is why the ontology generator is "essential" to the process.
2. Example GDI Transformation Query
Below is a sample GDI transformation query for onboarding data from a public.Movies table.
PREFIX public: <http://altair.com/DataSource/your-datasource-id/Top200Movies/public/>
PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>
INSERT {
GRAPH ${targetGraph} {
?subject ?predicate ?object .
}
}
WHERE {
SERVICE <http://cambridgesemantics.com/services/DataToolkit> {
?generator a s:RdfGenerator, s:OntologyGenerator ;
s:as (?subject ?predicate ?object) ;
s:base ${targetGraph} .
public:Movies a s:DbSource ;
s:url "{{db.your-datasource-id.url}}" ;
s:username "{{db.your-datasource-id.user}}" ;
s:password "{{db.your-datasource-id.password}}" ;
s:database "{{db.your-datasource-id.database}}" ;
s:table "public.Movies" ;
s:model "Movies" ;
.
}
}
Key Elements
?generator a s:RdfGenerator, s:OntologyGenerator ; s:as (?subject ?predicate ?object) ; s:base ${targetGraph} .This line is required to invoke the GDI RDF and Ontology Generators, which automatically generate the graph and ontology for your data source.- s:DbSource:
Defines the database connection and extraction details, using context variables for sensitive information. - s:query:
Custom SQL query for data extraction.
________________________________________________________________________________________
Best Practices for Developers
a. Start Simple, Then Expand
Begin with a basic query to ensure connectivity and correct data extraction. For example, start by selecting just a few columns or limiting the number of rows.
b. Use Query Contexts
Never hardcode credentials. Always use context variables for URLs, usernames, and passwords for security and reusability.
c. Optimize for Performance
To effectively handle large-scale data ingestion, especially with tables that have millions of rows, we should optimize our queries to leverage GDI's parallel processing capabilities.
- Primary Optimization: Use
s:partitionBy: s:partitionBy is the most effective method for speeding up ingestion from large tables. - When we use this property, GDI automatically splits a single large query into multiple smaller queries ("buckets") based on a specified column (like a primary key).
- This prevents the source database from being overwhelmed and allows Graph Studio to process data in parallel, which is much more efficient.
- Secondary Tuning (
s:limit, s:offset): For development, testing, and pagination, we can still use s:limit to restrict the number of rows. Thes:offset property is used in combination with s:limit to page through results.
d. Advanced: Filtering and Aggregation
Suppose you want only movies with an IMDB rating above 8.0:
s:query '''
SELECT MovieID, Title, Year, Genre, Director, IMDB_Rating
FROM public.Movies
WHERE IMDB_Rating > 8.0
'''
________________________________________________________________________________________
Troubleshooting Tips
- Connection errors: Double-check your context variables and network access.
- Query errors: Validate your SQL directly in the database before using it in GDI.
- Performance issues: Use
partitionBy and limit clauses. - Mapping issues: Ensure your RDF mapping matches the column names and types.
________________________________________________________________________________________
Leveraging the Ontology for Query Optimization
Ontology awareness helps us leverage the GDI’s ability to push down filters to the source.
Here are the steps to explore the Ontology Structure .
- From the main menu, navigate to the Model section.
- In the Model section, click the filter icon.
- In the filter options, select the checkbox to show only system data.
- In the filtered list of ontologies, we will see the
Data Toolkit ontology. This is the ontology used by GDI.
- Click on the
DataToolkit ontology to inspect the classes, properties, and relationships that GDI has inferred.
- Expand the relevant options on the left, helps us understand the structure and supported parameters in our ontology.
_____________________________________________________________________________________
Further Examples
1. Using s:partitionBy for Large Datasets
Suppose you want to extract the Top 200 Movies, but your table is very large and you want to avoid timeouts and memory issues.
PREFIX movies: <http://altair.com/DataSource/your-datasource-id/Top200Movies/public/>
PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>
INSERT {
GRAPH ${targetGraph} {
?movie a movies:Movie ;
movies:title ?Title ;
movies:year ?Year .
}
}
WHERE {
SERVICE <http://cambridgesemantics.com/services/DataToolkit> {
?generator a s:RdfGenerator, s:OntologyGenerator ;
s:as (?subject ?predicate ?object) ;
s:base ${targetGraph} .
?movie a s:DbSource ;
s:url "{{db.movies.url}}" ;
s:username "{{db.movies.user}}" ;
s:password "{{db.movies.password}}" ;
s:database "{{db.movies.database}}" ;
s:table "public.Top200Movies" ;
s:model "Top200Movies" ;
s:partitionBy ("Year")) ; # Partition extraction by Year
s:limit 200 ;
s:query '''
SELECT MovieID, Title, Year
FROM public.Movies
ORDER BY IMDB_Rating DESC
LIMIT 200
''' .
}
}
Why use these?
s:partitionBy splits the extraction into separate partitions based on the values in the specified column (e.g., Year).
2. Using s:concurrency for Parallel Data Loads
To speed up ingestion by running multiple queries in parallel:
...
s:concurrency 4 ;
...
To speed up ingestion from very large tables, we can run multiple queries in parallel. This is especially useful when our database can handle multiple concurrent connections. For this to work effectively, we should always pair s:concurrency with s:partitionBy.
s:concurrency: This property sets the maximum number of parallel cores (also referred to as slices in the documentation) that GDI can use to execute our query.
3. Locale-Specific Parsing with s:locale
If your data contains locale-specific formats (e.g., dates, numbers):
...
s:locale "en_US" ; # Use US English locale for parsing
...
This ensures that numbers and dates are interpreted correctly.
4. Model and Key for Entity Mapping
If you want to specify a model and a unique key for entity resolution:
...
s:model "Movies" ;
s:key ("MovieID") ;
...
The s:model Property: Defining Classes
s:model maps a table or query to an RDF class in the ontology (e.g., s:model "Movie" creates a Movie class). This adds semantic structure so the data isn’t just triples but a queryable schema.
The s:key Property: Unique Identities
s:key defines the primary key column to generate unique URIs (e.g., s:key ("MovieID")). This guarantees each row is a distinct entity and can be linked across sources.
5. Example: Data Normalization with s:normalize
This lets us control how names (labels and URIs) are created for classes and properties when we bring data into Graph Studio from different sources.
This means we can set rules to make sure the names in our data model are consistent, clean, and follow preferred style or standards
...
s:normalize true ;
...
Or specify rules:
...
s:normalize [ s:trim true ; s:lowercase true ] ;
...
6. Using s:formats for Data Type Formatting
If you want to control how certain datatypes are formatted:
...
s:formats [
s:dateFormat "yyyy-MM-dd" ;
s:numberFormat "#,##0.##" ;
s:delimiter "," ;
s:headerRow true
] ;
...
s:formats block is especially important when your data uses non-default formats or when you want to avoid parsing errors during integration
7. Reference and Foreign Key Joins
Suppose you want to join with a related table (e.g., Directors):
...
s:reference [
s:model "Directors" ;
s:using ("DirectorID")
] ;
...
This tells GDI how to resolve relationships between tables.
_______________________________________________________________________________________
Further Reading:
____________________________________________________________________________________
Related Discussions
Pre-Requisites for Database Connectivity
Connecting Graph Studio to External Databases