This discussion introduces the basics of GDI queries - how to connect to a database, define generators, and follow best practices for secure and efficient data onboarding.
Contents
- Developer Checklist
- Understanding the Dataset
- GDI Query Structure
- Best Practices for Developers
- Troubleshooting Tips
- References
________________________________________________________________________________________
Developer Checklist
- Connect to the database: Define
s:DbSource
with s:url
, s:username
, and s:password
. - Select data: Use
s:table
for full ingestion or s:query
for targeted extraction. - Generate triples + ontology: Always include both
s:RdfGenerator
and s:OntologyGenerator
. - Scale for performance: Use
s:partitionBy
(and optionally s:concurrency
) for large tables. - Secure credentials: Store sensitive info in Query Contexts, not in queries.
______________________________________________________________________________________
Understanding the Dataset
Suppose our database contains a table called public.Movies with fields:
MovieID , Title ,Year, Genre ,Director and IMDB_Rating
Our goal is to extract this data and transform it into RDF triples for use in the knowledge graph.
______________________________________________________________________________________
GDI Query Structure
1. Define the Data Source and Generator
A GDI query starts with s:DbSource
, which specifies the database connection (JDBC URL, user, password, table/query, etc.).
To transform relational data into a knowledge graph, you must include a generator line with both s:RdfGenerator
and s:OntologyGenerator
:
s:RdfGenerator
– Converts each row and column from the source into RDF triples (subject–predicate–object
). This is how raw data becomes graph entities.s:OntologyGenerator
– Builds the semantic model for the data by creating classes for tables (e.g., Movie) and properties for fields (e.g., title, year).
Together, these generators automate both the data ingestion and the schema creation, eliminating the need to manually code mappings for every table.
2. Example GDI Transformation Query
Below is a sample GDI transformation query for onboarding data from a public.Movies
table.
PREFIX public: <http://altair.com/DataSource/your-datasource-id/Top200Movies/public/>
PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>
INSERT {
GRAPH ${targetGraph} {
?subject ?predicate ?object .
}
}
WHERE {
SERVICE <http://cambridgesemantics.com/services/DataToolkit> {
?generator a s:RdfGenerator, s:OntologyGenerator ;
s:as (?subject ?predicate ?object) ;
s:base ${targetGraph} .
public:Movies a s:DbSource ;
s:url "{{db.your-datasource-id.url}}" ;
s:username "{{db.your-datasource-id.user}}" ;
s:password "{{db.your-datasource-id.password}}" ;
s:database "{{db.your-datasource-id.database}}" ;
s:table "public.Movies" ;
s:model "Movies"
.
}
}
Key Elements
?generator a s:RdfGenerator, s:OntologyGenerator ; s:as (?subject ?predicate ?object) ; s:base ${targetGraph} .
This line is required to invoke the GDI RDF and Ontology Generators, which automatically generate the graph and ontology for your data source.s:DbSource
:
Defines the database connection and extraction details, using context variables for sensitive information.s:query
:
Custom SQL query for data extraction.
_______________________________________________________________________________________
Best Practices for Developers
a. Start Simple, Then Expand
Begin with a basic query to ensure connectivity and correct data extraction. For example, start by selecting just a few columns or limiting the number of rows.
b. Use Query Contexts
Never hardcode credentials. Always use context variables for URLs, usernames, and passwords for security and reusability.
Query contexts are a core feature for managing sensitive information like credentials and connection details. They store these as reusable key-value pairs, which we can reference in our GDI queries to avoid hardcoding secrets. This practice makes our queries more secure, portable, and easier to maintain.
c. Optimize for Performance
To effectively handle large-scale data ingestion, especially with tables that have millions of rows, we should optimize our queries to leverage GDI's parallel processing capabilities.
- Primary Optimization: Use
s:partitionBy
: s:partitionBy
is the most effective method for speeding up ingestion from large tables. - When we use this property, GDI automatically splits a single large query into multiple smaller queries ("buckets") based on a specified column (like a primary key).
- This prevents the source database from being overwhelmed and allows Graph Studio to process data in parallel, which is much more efficient.
- Secondary Tuning (
s:limit
, s:offset
): For development, testing, and pagination, we can still use s:limit
to restrict the number of rows. - The
s:offset
property is used in combination with s:limit
to page through results.
d. Advanced: Filtering and Aggregation
Suppose you want only movies with an IMDB rating above 8.0:
s:query '''
SELECT MovieID, Title, Year, Genre, Director, IMDB_Rating
FROM public.Movies
WHERE IMDB_Rating > 8.0
'''
Or, to get the highest-rated movie per year:
s:query '''
SELECT Year, MAX(IMDB_Rating) as MaxRating
FROM public.Movies
GROUP BY Year
'''
_____________________________________________________________________________________
Troubleshooting Tips
- Connection errors: Check JDBC URL, Query Context variables, and firewall/DNS access.
- Query errors: Validate SQL directly in the source database first.
- Performance issues: Apply
s:partitionBy
, s:limit
, and s:concurrency
. - Mapping issues: Confirm RDF mapping aligns with column names and types.
________________________________________________________________________________________
Next Steps
Please check Part 2 for extended examples, advanced optimization techniques, and guidance on leveraging ontology for query optimization.