One aspect of the Quid product that differentiates it from other platforms is its incredibly diverse set of company descriptions.
When you start a search in Quid, you can choose a dataset -- Companies -- that includes some 50,000 high-growth companies with venture capital funding since 2012, along with a greater global dataset of over 1.4 million companies. Each company description is a piece of text describing that organization’s product offerings, as well as other relevant information. Some of this text may come directly from the company’s website; other parts come from online sources such as news, patents or Wikipedia.
A priority for Quid is identifying company descriptions of poor quality and improving them before they even get into our index. With such a large number of descriptions, it becomes imperative for us to actively monitor their quality. But measuring the quality of company descriptions is difficult because many factors determine whether a description passes muster. A short description, for example, might only be acceptable if there simply isn't any more information available, since the likelihood of misrepresenting a company is higher with little text.
There are also a number of reasons why the text itself may not be informative. One significant factor that hurts company descriptions is marketing language, i.e., generic text meant to evoke some feeling of excitement about the company that actually conveys relatively little information.
Here are some examples:
- "Our patent-pending support system is engineered and designed to bring comfort and style".
- "Launch quickly, with configuration in as little as a week."
- "Provides solutions to maximize digital customer experiences."
Contrast this with examples of informative sentences:
- "It has developed lens-less imaging optic, the Natural Eye Optic (N.E.O). The N.E.O. replicates the human vision system."
- "Spatial Scanning software for mobile devices."
- "It offers meals, such as chicken fingers, and macaroni and cheese."
We developed a strategic goal at Quid to use machine learning to identify marketing language that conveys very little information. In fact, we combined this goal with recent advances in deep learning and a bit of an experiment: to see whether deep learning really requires "big data," or whether small data can suffice.
Machine learning with 'small data'
In this post we will explore the machine learning problem of identifying "bad" text in company descriptions. For various reasons, it was difficult to develop a large set of training examples, so the initial exercise was a fun exercise in how far one can push modern deep learning solutions with small data and a scrappy attitude.
The task at hand was to develop a classifier that can output a likelihood of any given sentence being "bad" or "good." For this, we collected a few hundred examples of good and bad sentences, exploring some approaches to build a classifier to discriminate them.
We compared three approaches:
1. a baseline logistic regression classifier based on TFIDF (using tokens)
2. an ensemble model combining the predictions of numerous different types of classifiers with a manual feature engineering approach, leveraging TFIDF over n-grams, characters, as well as a large number of hand-crafted features around word2vec, punctuation, digits, and word frequency.
3. a one-dimensional convolutional neural network (CNN) at the word level, similar to the model reported by Kim et al. (2014).)
A few cutting-edge studies in deep learning suggest one may not even need large datasets for some problems. At the same time, the dominant mantra is that deep learning requires "big data.” It seemed useful (and fun!) to see how far we could get with deep learning versus a more typical feature-engineering approach on such a tiny dataset.
Model 1 (baseline)
Model 1 uses TFIDF features with a regularized logistic regression model using multiple n-grams features with feature selection. This model's representational capacity is very limited, but we also have a very small amount of data, so there is perhaps an argument to be made that a more complex model would just be over fitting. Finally there is a real advantage to this model - the features, each corresponding to a particular n-gram, are interpretable. This feasibly makes it possible to find justification for each individual decision the model makes (check out the open source package LIME for a relevant package, https://github.com/marcotcr/lime).
Model 2 uses hand-crafted features and a predictive modeling framework with much greater representational capacity. As we are now ensembling together multiple, powerful non-linear classifiers, we have a much greater opportunity to make better predictions. At the same time, the major downside is that many of the features are slightly less interpretable, particularly our inclusion of features from average word embeddings (using a pre-trained Word2Vec model).
Quite a few of the hand-designed features were based on the distribution of token-variables in a sentence. For example, for word frequency, by looking at the mean, median, and standard deviation, etc., of the distribution of word frequencies in a sentence, our model can detect whether bad or good sentences might be characterized by high or low mean word frequency, particularly variable word frequency, or some combination.
Whereas Model 2 required manual feature engineering, Model 3 uses a deep learning approach - specifically, one-dimensional convolutional neural networks (CNNs) with pre-trained word-embeddings. In the domain of 2D convolutional neural nets and image processing, there is a notion that "filters", or "feature detectors", come to represent the presence / absence of particular visual patterns found anywhere in the image. For 1D convolutional neural networks on text, we are now sliding a 1D window of some length (e.g., 1, 2, or 3 words), over sequences of words. Here, the "filter", or "feature map", can be thought of as particularly flexible detectors for the presence of particular n-grams. This isn't strictly akin to a Count or TFIDF -based n-gram approach because a filter might be activated strongly by multiple n-grams. This is actually much more memory-efficient because we aren't representing every single n-gram explicitly in our model, and, we also don't need a separate feature selection step where uninformative n-grams are discarded. If they aren't useful, the model won't use them.
There are a few important things we can tweak in this approach. We can tweak the number of "filter detectors", the sizes that the "filter detectors" can be (i.e., how many words in the subsequence). We can also insert pre-trained word embeddings directly into the network, as we do here. As nicely described here (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html), this helpfully allows us to insert outside information about the semantic meaning of words into the model.
We used the helpful high-level deep learning package keras to design a fairly straightforward architecture with 3 filter sizes (roughly akin to n-grams of length 3).
Inspiration for the implementation can be taken from Kim et al., 2014. See the helpful figure below from the paper for a basic illustration of what’s going on.
We then looked at model performance for the the three different models by training on 2/3 of the data and testing on the remaining 1/3.
We chose to look at plots derived from ROC, the Receiver Operating Characteristic, as well as the AUC (Area Under the Curve) metric. The F1 score requires setting some arbitrary threshold (0.5 by default), and then assigning testing examples to be one class or the other based on the model's probability that a given item is bad or good. As we intended to use the probabilities directly as a "score" for each sentence, we had no need to make a definite prediction for each sentence using this strategy.
One main promise of deep learning is that we might be able to lose the requirement to hand craft features, and let the high-level features that we would design come to be represented naturally in the model during training. By supposedly designing a neural network architecture in the right way, the model naturally learns to represent "bits" of information at increasingly high levels during the course of model training.
The interesting take-home is that we managed to achieve near the same level of performance with an approach based on deep learning as did we with an approach that relied on fancy manual feature engineering, even for very small data. This to some extent drives home the important point that one does not need "big data" to benefit from fairly complex neural network architectures.
To understand the model's performance, it’s always useful to look at where the model is making the grossest mis-predictions (we look in detail at the CNN model).
To start, let’s look at some where the model predicted a high likelihood of being good, but the sentences were actually labeled bad:
- “Download Vstory Free App from the App Store or Google Play Store.”
- “The firm also keeps its clients current by delivering a true SaaS service with all clients on the same version with each update of the platform.”
The former is easy to understand - the model wasn’t given enough data to know that many products can be downloaded in some way and that this piece of information carried little or no differentiating information. The second is similar in that many companies have SaaS services, but additionally the sentence also makes relatively no sense, which is something that requires more intelligence (whether artificial or human) to perceive than the model is really capable of.
Let’s look at an example of where the model predicted a high likelihood of being bad, but the sentence was actually labeled good:
- "Its product has a headset, natural eye optics, OLEDs, a camera, a microphone, an integrated inertial measuring unit with accelerometer/magnetometer and gyroscope, an audio jack, 60\xc2\xb0 field of view, optical see-through, software, and organic user interface/user experience, and a wrist controller; tethered computing modules, such as CPU/GPU, Wi-Fi, Bluetooth, HDMI, USB, and battery pack module; and accessories, such as protective case and headphones."
Interestingly, this sentence was extremely long (3.46 std’s above the mean), which may have thrown off the model. A downfall of CNNs for text is that unlike for images, the input sequences are varying sizes (i.e., varying size sentences), which means most text inputs must be “padded” with some number of 0’s, so that all inputs are the same size. This means that a particularly long sentence will have many fewer zeros than most sentences, and, the model may act unpredictably in this context.
Of course, there is much more we could do here - data augmentation, collect more data, fine-tune regularization parameters, maybe even try character-based CNNs or some type of recurrent neural network.
The take-away here, though, is that you can do deep learning with a very low number of training examples and still get tangible benefits in model performance and representational efficiency over manual feature engineering. At test time, deep learning can also be cheaper - it's often computationally faster to do a bunch of matrix multiplies than it is to compute features from scratch for each example.
The great thing about the current wave of deep learning: you don't have to be an expert to create extremely powerful models with high-level packages like keras. It also seems clear that the basics of deep learning will increasingly be used across all technology, even for startups where large, clean-labeled datasets are in short supply.
See more Quid engineering projects here.
Intelligence in your Inbox
Sign up for the Quid newsletter for a bi-weekly look into how data and visualization are changing the way we view the world.