Engineering Blog

How Quid is improving its search with Word2Vec and Wikipedia

Ben Bowles

02.14.2017

At a high level, Quid is about offering insights that answer nuanced, strategic questions.


One critical component of this involves a user’s experience when he or she types in a search query. For example, in a news search, Quid can return articles that have a keyword near another keyword, or those with particular boolean logic created with our own in-house operators. These are just some of the tools that help users perfectly specify their information need.


Sometimes users have a generic interest, like "artificial intelligence" or "Brexit." Often the key to getting the best search results is knowing all the different ways to refer to these concepts, and knowing all related terms to the topic at hand. To help with this, we are now providing suggestions below the search bar as users start to type in a query.

In this post, we will share some tricks from our ongoing experiment in adapting this well across a large number of phrases, entities, and common English words, using the free online Wikipedia corpus combined with the Word2Vec algorithm.

Search in action

With only the help of Wikipedia and Word2Vec, below are the related terms we were able to provide from queries comprised of a variety of concepts, phrases, entities, and common words.

Word2Vec implementation

Word2vec models are a well-known family of algorithms for encoding individual words into a common vector space. This common vector space can then be used to find words that occur in similar contexts, and which are also similar in meaning. Such models gained notoriety when it was discovered that Word2Vec models could naturally represent analogies (king > queen, man > woman), which seemed to suggest the presence of human-like reasoning or semantic awareness.


A free pre-trained Word2Vec model can be downloaded from Google. Anyone who has played with this particular pre-trained model, or has played with Word2Vec algorithms they've trained themselves, knows that the quality of results can vary tremendously across different input terms. The algorithms are also subject to many unwanted biases that makes them unsuitable for a live production system in their raw form.


Further, if the goal for the algorithm is to work well on phrases, entities, or concepts that involve more than one word or token, the tokenization strategy will need to identify these phrases from the raw text before the data is fed into the Word2Vec algorithm.


Indeed, improving the quality of a Word2Vec model comes down to a few simple things that can actually be difficult in practice -- like how to tokenize your raw text to feed to the model, how to identify phrases that hold meaning that a human can identify with, and how to find exactly the right dataset to perfectly capture your problem domain.


Given that an entities like "Hillary Clinton" or "The New York Times" may be referred to in multiple different ways, we also need some form of entity disambiguation. It is highly desirable to have only one vector for each entity, and we definitely do not one separate vector for every spelling-permutation or word-rearrangement for a particular entity. In addition, when someone enters some particular permutation of an entity, we should be able to disambiguate it and offer other similar entities, phrases or words.

This also allows us to ensure that when Bill Clinton is entered as an input, we don’t get five different spellings or permutations of George W. Bush as output.

Phrase and entity detection

To solve the phrase detection piece of the puzzle, we first used the phrase-detection algorithm used by Google and implemented in Python as part of the gensim Python package. But it was immediately obvious that the algorithm was not powerful enough in isolation.


The algorithm essentially looks at the frequency of co-location of words across a large corpus to decide which sets of co-occurring words correspond to phrases. But there are problems with this simple approach. For example, we might like to detect less well-known entities that occur a relatively small number of times in the training data as a whole.


If it’s a famous person we are trying to recognize, for example, the phrases model might decide that the first and last name don't occur frequently enough and might not decide to label them as an entity. Maybe we could try to experiment with different parameters for the phrases model, but the best parameters might actually differ for different categories of things that would make the solution untenable. What we needed was a better, smarter phrase-detection model.


To make a better phrase-detection model, we exploited the page-to-page hyperlinking structure in the Wikipedia database to perform an extremely reliable form of named entity detection, similar to what has been performed by previous researchers.


Essentially, during the tokenization process for each Wikipedia page, we isolated links to other Wikipedia pages, replacing the text containing the hyperlink with the canonical title for that page. This means that regardless of whether the wiki page for George W. Bush is referred to as "George", "George W", etc., we always replace these references with the canonical title for the pertaining Wikipedia page. This approach works nicely because Wikipedia is already an excellent corpus for our purpose anyway, given that it is generally high quality as compared to, say, a common crawl or news dataset.


An interesting problem concerns the use of redirects as hyperlinks. Often when someone is referring to George W. Bush's page, they actually direct the user to a redirect page rather than the actual canonical name for the page. For example, this hyperlink [https://en.wikipedia.org/wiki/G.%20W.%20Bush] and this one [https://en.wikipedia.org/wiki/George_W._Bush] land on the same page.


To appropriately handle all these cases, and also to disambiguate them, we needed to scrape all Wikipedia redirects for more than 5 million Wikipedia pages, and compile them into an extremely fast lookup table. The redirects for the George W. Bush Wikipedia can be found here.

Wikipedia and Word2Vec: a perfect pair

The Wikipedia corpus is of very high quality, and the Word2Vec algorithm is extremely powerful. Together, they can provide the perfect solution for a related terms model that includes both words and entities.


Another solution might be to create vectors out of TFIDF (term frequency–inverse document frequency) vectors for Wikipedia entities specifically, and then we could return similar entities in response to any given entity input. But, critically, if we were to do this, this would only create vectors for entities in Wikipedia. When the best set of suggestions involves words, phrases, and entities together, we are powerfully equipped to handle this case in addition to cases involving entities or words in isolation.

Intelligence in your Inbox

Sign up for the Quid newsletter for a bi-weekly look into how data and visualization are changing the way we view the world.

Get to know Quid. Chat with us for about 15 minutes about your project, and we'll create a free guided test drive to help you find solutions.

clear x