Graph Databases: Leveraging Cutting-Edge Technology for Fraud Detection

  • Alan Lee

August 06, 2015

In my previous post, I discussed how analytics can help identify insurance fraud, but how can this be applied in the real world? There has emerged a relatively new technology specifically made to analyze and store relationships or graphs with blazing speed and flexibility. A graph is any type of networked data. Anyone who has seen a picture of people or objects with edges connected between them that communicated who or what is connected by any relationship has seen a graph. A list of friends with lines to display friendship connections or a map of cities and the roads that connect each of the cities to others are all examples of graph data. The graph database is the data store that was specifically built to process this type of data. It is a class of NoSQL database that has, in the last decade, seen rapid development and successful deployment in production environments where relationships are the key to unlocking value in data. LinkedIn, Facebook and Google are some of the most high-profile companies that utilize this specific type of data store to mine relationship data.

So what is so special about graph databases? Relationships in data can readily be analyzed in traditional SQL databases simply by performing the correct joins between tables across all relationships you want to look at, using the database primary and foreign keys. You could join the lawyer table to all the claims tables, and then join those to all the claimants table, etc., until you have explicitly mapped out all the relationships you need to analyze. Herein lies the problem: the relational database paradigm was not meant to store relationship data for quick consumption. Expensive join operations need to be performed for relationships that may involve ten or more tables. Even with indexing to optimize performance, any SQL developer will tell you that mining and traversing graph data is the wrong use case for a traditional database, because they will slow to a crawl under so many joins. Traditional relational databases often rely on indexes and balanced search tree structures to quickly join data together. Though these operations are optimized to handle very large datasets, finding relationships that are three degrees away (“a friend of a friend of a friend”) can bottleneck any type of relationship analysis. A commercial graph database can search a network 3 degrees away in close to a tenth of a second. The same data would take 30 seconds on a normal relational database. This may not seem like an issue unless you have thousands, or hundreds of thousands, of people to search through, which is still considered small by data standards. At Guidewire, we routinely develop algorithms that search 5 to 6 degrees away, which can take literally months on a traditional database versus 2 seconds on a graph database. Though other NoSQL databases such as columnar or key value stores may perform better than a traditional database at analyzing these connections, they still pale in comparison to the speed and efficiency of graph-devoted data stores. The simple reason for this is that they were optimized for this specific use case through a combination of technologies, like index-free adjacency, intelligent caching, and specific data structures that were built to store and quickly analyze and retrieve relationship data. There are less-efficient methods that can also achieve the same speeds through map-reducing, but unless you are the government or a company that can afford more than a thousand clustered computers solely devoted to graph processing, a graph database is significantly more cost-effective.

Our results of marrying graph databases and insurance data have been overwhelming, to say the least. We’ve uncovered a lot of signals directly related to insurance fraud simply because we were able to analyze these relationships with blazing speed and power through entire databases full of claims data. None of this would be possible if we could not analyze the graph relationships with extreme efficiency, since we are dealing with graphs with at least millions of people and processes connected in many different ways. Not only does it allow us to explore complex relationships in an insurer’s SIU data, it opens the way to complex predictive analytics that allow us to use customer data to build cutting-edge predictive models to catch fraud. None of this is trivial, since the ability to leverage this technology requires an extreme discipline in data cleaning, algorithm development, and constant changes and tuning to the graph relationships themselves. There are an unlimited number of ways to translate a customer’s relational database into graph format, and only a tiny subset of those will produce meaningful insight. This type of work is most suited to a team with previous expertise in graph analysis across large datasets, as well as an extensive history of fraud analytics. With the results we are getting, I am convinced that we can continue to leverage this technology to help revolutionize the tools and data available to those with a vested interest in catching and preventing insurance fraud.