Building a KG on the Cheap

Nathan Wiegand, Sizhu Cheng, Max Allen, and Brandon Chong on 04/05/21

Search engines stopped being just about 10 blue links a long time ago. One of the core features that people have come to depend on is rich knowledge cards that show information about people, places, and things. Last August, we decided it was time for Neeva to build our own rich knowledge cards, and we gave ourselves only 4 weeks to do it. We had to source the correct ground truth, process it all into both a useful artifact and a presentation layer, and serve it in milliseconds to our customers.

Wikipedia is an invaluable resource to the internet. Not only is it globally accessible and editable, but it’s freely downloadable and under a permissive license (CC BY-SA 3.0). It was a logical place to start building our own knowledge graph.

When starting a project I like to break features down into priorities: P0 is a must have, P1 is important, and P2 is… well, maybe someone will get to it :).

For us, a KG had to have these features:

  • P0: Low latency at serve time
  • P0: Customizable templates(weight matters for boxers and football players)
  • P0: Smart units(metric for some, SI for the rest)
  • P1: Pretty pictures
  • P1: Maps for Cities/States
  • P2: Real time updates
  • P2: Complex categorical pivots (e.g. List of notable people from some city)

Most folks are familiar with the CPU vs Space trade-off of many projects. I like to think there’s another tradeoff that can be made as well: effort. We’re building a search engine from the ground up. Time is a luxury. So we set ourselves a goal of being done with this project in 4 weeks.

The Wikimedia foundation publishes Wikipedia in numerous formats. You can get the raw Wikimedia markup, pre-processed HTML files in ZIM archives, or even a pre-baked knowledge graph called Wikidata. Writing parsers is a ton of fun, but not the most valuable use of our time. Given that, we went with Wikidata.

Wikidata is structured as a graph. Each entity is a node and there are labeled edges between nodes. So you might say that (Albert Einstein, hasOccupation, Scientist). Graphs like this are a fantastic way for representing knowledge, because they allow for very complex schemas. However, they can present a challenge when you want to process the data.

Wikidata is published in two main formats: a file of triples, and a flattened JSON blob per entity. Each row in the triple file is an edge in the graph. Processing this requires either loading the data into a graph database or implementing your own graph query processor (I gave it a shot!). We looked into graph databases, and while they provide great versatility in querying, we’d still need a query per type of entity we’d want to display and would want to materialize them for fast serving. Given this, we decided to batch process the Wikidata dump, and so in the end, the JSON data was better suited for the vast majority of our use cases.

A flattened entity, from a graph perspective, is a node with all of its out-links. From our Einstein example, there would be one entry in our dataset for Albert Einstein and the JSON blob would have all out edges in its graph neighborhood, including occupation, spouses, date of birth, and notable works. It even contains references to foreign keys in other popular datasets like IMDb.

Even with this convenient format we are left with the challenge of building a presentation layer. How can we extract the attributes we want to show, and in the correct formats?

Our solution was to build a domain specific language for specifying fields. Rather than rolling our own parser (time constraint!), we represent our rules in a tree form in a YAML file (I’m pretty sure if LISP were defined today, McCarthy would have used YAML not parentheses).

This looks a little like this:

This was not designed to be compact or human readable, but rather to be trivial for a tool to read and write.

We built a scrappy tool which lets the operator specify a query that selects which entities to apply rules to (the predicate) along with enumerating the field. In the example below, you can see that we extract types, along with single values.

The predicates are simple selectors like “Instance-Of” (P31) or “Occupation” (P106) The fields have a display name, the corresponding Wikidata edge, and a type-transformation.

Here’s the final product.

Unfortunately, the normalized representation in JSON wasn’t always enough. The first time it bit us is when we wanted to include US Senators for states. Since the Wikidata graph contains directed edges, there’s not an edge from California to Alex Padilla. However, there is an edge from Senator Padilla to California via a long path that includes sub attributes on the edge itself:

Alex Padilla (position held) → United States senator → (electoral district) → California Class 3 senate seat (In territory) → California

For this case we create an“Invert” verb. Here’s the rule:

To process these rules, we first loaded the keyed JSON entity data into a system built on Scylla. Turns out that Scylla is perfect for entity joins on the fly!

For each Invert rule, we preprocess the dataset by first applying the IsInstance predicate and then recursively evaluating each edge in the EdgePath by looking up entities in Scylla. In our California example the Invert operator would process the Alex Padilla entity and emit a new(inverted) edge as <California, Alex Padilla>. This gets joined back back with the California entity to make for simple final lookups (who needs a fancy graph database?).

With our tool complete, it was time to put it to work. We used the Wikipedia page popularity to rank instance types to prioritize which rules we wrote. A small engineering team powered through using this tool to build transformations for the top 90% of entity types by popularity in just a few days.

We use PySpark to process the Wikidata (about 100GB compressed) JSON dump a row at a time. The job runs our interpreter over the specified rules we defined above to build a simple flat display representation. Using PySpark gives us good horizontal scaling and also allows us to parallelize more expensive tasks like the Invert rule above or calling other federated data providers like OpenStreetMaps.

After the job completes, we generate a diff-report with sampled popular entities and have an engineer vet the output before loading the new version of the data into serving Scylla. Using Scylla backed by flash storage for serving means that, at serve time, we can look up an entity to show in milliseconds. We also have spare time to do any last minute rendering that needs to be done (e.g. unit conversions for the query locale)!

We accomplished our goal of shipping our knowledge graph ahead of schedule. But that doesn’t mean we’re done. The world is a fast changing place. While it made sense at the time to start with the monthly Wikidata dumps, we’re now beginning to augment our KG with incremental updates. There will soon be an “Update Now” button on the KG Card for Neeva Community users to fetch the freshest versions of all of the data and publish it back to Scylla. Our Neeva Community members are a small, trusted group of real customers who have the power to affect the search results. Our customers are helping us build a better search!

The moral of this story is that there’s tons of value in spending a day or two thinking about how a scrappy tool can be used to write domain specific programs to help you get a large task done. The combination of gaining a little insight, building the right tooling, some good old elbow grease, and a scalable run-time platform allowed us to build a pretty amazing feature on time without a ton of investment.

P.S. Very recently, we learned that the Wikimedia Foundation is working on a new paid service with one of its explicit goals to help power knowledge graphs. We definitely support their efforts since we value content and the publishers who make it and provide it. We look forward to revisiting our approach when the new API is available, along with contributing back to the community.