Using Graph Database when dealing with connected data
Relational databases deal poorly with relationships
Do you deal with complex queries that involve many joins (anywhere up from four)?
Do you try to find unknown paths through the data?
Do you have a model that evolves frequently?
At Vadis Technologies, our data scientists often face this kind of challenges.
In partnership with Intys Data, we have been investigating graph databases to
- gain in performance
- work agile
- get a richer view of the 300M companies and 200M people we analyze
If you either work with highly connected data (social networks), recommendations (e-commerce), or pathfinding (how do I know you), then you will like our findings.
Do not get me wrong. Relational databases are great. But only for a limited number of tables, and when a rigid structure is not an issue. Here is an illustration from Max De Marzi.
A graph is a set of nodes, with relationships that connect them
A graphDB is a database with:
- An explicit graph structure;
- Nodes that know each of their adjacent nodes;
- An index for lookups;
- Local steps (hops) whose cost remains constant as the number of nodes increases.
If you are used to SQL:
- Rows in tables become nodes;
- Foreign keys become relationships;
- Link tables become relationships (possibly with properties);
- Artificial constructs (extra primary and foreign keys for example) are no longer necessary.
Fraud Detection at Vadis Technologies, a Top 100 #RegTech Company
At Vadis Technologies, we do fraud detection for accounts such as public institutions and big banks.
In order to do that, our team harvests and enriches complex business data to offer risk scoring and 360° third-party monitoring to our customers.
The graphDBs we have been investigating are
Neo4j and TigerGraph.
Gain in performance
The rise in the connectedness of our data translates into increased joins.
With relational databases, the bigger the dataset, the less performant our join-intensive queries are. Using graphDBs, the performance tends to remain constant.
As a benchmark, we created 2M nodes and 4M relationships in 40 sec in TigerGraph using an SSD and 16Go of memory. With Neo4j, we built these in about 80 sec.
Note that you can speed up loading considering index and transaction flows.
In case you work with a big amount of data and need to scale, you will most likely not be able to store the whole graph into memory.
I suggest you check this analysis comparing the performance of Tigergraph and Neo4j with a 500GB dataset. It presents metrics on loading time, querying performance as well as storage size after loading.
Work agile
The cost of change in GraphDBs is low. So you work agile across your workflow.
1. Derive the question
There is no need to grasp the whole problem domain in one go, and to turn that knowledge into a big model.
Take/Pick one concrete question that needs to be solved and adds value.
The more concrete, the better!
2. Obtain the data
What data is needed to answer your question? Get that data and only that data. If an ID is sufficient to solve the question, then do not get the name and description.
3. Develop a model and ingest the data
There are no rules to create a good graph. Be creative.
Still, here are a few performance-driven tips & tricks :
- Use Nodes for Entities, Relationships for Structure;
- Represent Facts as nodes. Fact emerges when two or more domain entities interact for a period of time;
- In general, use fine-grained relationships instead of generic relationships;
- Represent complex value types as nodes.
4. Query/prove your model
Write the query that answers your question.
Does it perform within expectations?
- YES — Excellent, you are ready for the next iteration
- NO — Backtrack to steps 2 and 3 and rethink the model
Richer picture of the data
GraphDBs make it easier to visualize data and see the links between different entities.
Both Neo4j and TigerGraph have a tool that enables you to navigate through the data and visualize your model.
Here is a graph built with TigerGraph on a dataset containing 280K users (anonymized but with demographic information) providing 1M ratings about 250K books.
You can easily query a user and see which books he has rated. Then you can expand on a rated book, and see what other users think of it.
It gets even more interesting as the number of types of relationships increases.
GraphDBs are used in many other cases
Here is a non-exhaustive list of graphDBs use cases
- Recommendation engine
- Network and IT Operations
- Search engine
- Master Data Management
- Identity and access management (internal and external)
- Machine learning and analytics
- Social networks
- Privacy and risk compliances
- Email targeting
- Knowledge Graph (for asset management, content management, inventory, workflows, cataloging…)