Creating an Environment Where You Have License to Try New Things

One of my favorite albums of 2018 is “I’ll Be Your Girl” by the Decemberists. This week while I was driving in my car, I caught an episode of the podcast Song Exploder which features Colin Meloy…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Handling Rare Words in Machine Learning Models

The Problem

Large-scale natural language processing (NLP) problems almost always involve a large corpus, which almost always contains “unbounded” number of unique words. Here, words are defined as a string of characters, separated by space, punctuation or other meaningful boundaries.

This problem is also ubiquitous at reddit, my current employer, where we deal with Internet-scale, user-generated text. In fact, we assume that every NLP model at reddit would encounter rare words at some point, no matter how big the vocabulary is.

A large vocabulary can reasonably cover the representation space

The idea is rather simple. We build a reasonably large vocabulary (say, up to 10 million words) based on usage frequency of words, and discard words outside the vocabulary in machine learning. For instance, if `covfefe` is not a word in vocabulary, we simply ignore it. The vocabulary can be built periodically to capture language evolution.

This strategy is sufficient for applications where useful signals are not in rare words. From the representation learning perspective, we assume rare words do not affect the representations. Sentimental analysis is a good example, where people tend to express sentiments using a known set of words and structures, and rare words rarely affect the sentiments of the whole sentence or paragraph.

There is a variant of this method, where a special token called UNK is used to denote OOV (out-of-vocabulary) words. This is usually only useful when the distribution of rare words play a role in the model. For instance in Bayesian model, it might be good to model rare worlds outside the vocabulary.

2. Hashing words to a finite range of integers

One way hashing function to map words like “schadenfreude” to integers like 845432.

The above argument, of course, is a probabilistic one. No machine learning model is perfect in capturing all information from features, so is the hashing schema. Admittedly, the “how I learned to stop worrying and love the collision” mindset is not easy to grok. The following two papers provide detailed analyses on the hashing trick

3. Character and subwords model

The idea is to split word characters or subword units. The cardinality of characters or subword units are low (~100 printable characters in English and ~200 for latin languages). A natural extension here is to use Unicode bytes, with a cardinality of 256 (2⁸ per byte). Subword combines some frequently co-appearing character sequences into a subword. For English, sequences like “er”, “co” may be useful to reduce the model complexity while not blowing up the vocabulary space.

This approach may be surprising for traditional ML practitioners. How would character and subword features even work for tasks like text classification? At the end of the day, a word is not a simple combination of its character or subword features. Indeed, models with more complexity, such as deep neural networks, are required to work with these features. The idea is to use a deep neural network (with sizable layers and parameters) to learn the correct representation of words from its spelling. The idea is successfully applied in machine translation applications. The following two papers illustrate the idea.

Another closely related technique in the context of deep learning is to use a Convolutional Neural Network (as oppose to use LSTM) to effectively read words as character n-grams (with padding). A few notable papers:

4. Char n-gram

The assumption here is that we can represent (rare) words by its substring, if the substring is reasonably long. While the character- or subword-level models uses a deep neural network to map building blocks to word representation, a longer sequence of characters may capture enough information of a word. For instance, the char sequence frequ is unique enough to capture the meaning of words derived from frequent or frequency, so taking it as a feature would give us a way to represent other derived, rare words, such as frequenter.

A caveat here is that the trick works for languages with natural “stems”. The technical may not work for languages such as Japanese and Chinese, where there is no clear definition of n-gram character in a word, or that a word cannot be approximated by some of its substrings.

Tying everything discussed together, here I briefly describe how AI.codes’ system understand the meaning of classes and methods while doing code auto-completion.

For the first version, I picked Strategy 1, where I constructed a huge vocabulary of class and method names, covering top 1000 popular Java frameworks on Github. It has about 10M unique words. Language model on this is reasonable, but the model has no idea about unknown words.

I did not use the hashing trick, because I do care about the meaning of rare words when doing code completion. Instead, I picked strategy 3, where I tried to train a Siamese network to map words to their embedding. The beautiful thing about this approach is that the network picked a bunch of useful n-grams that are so indicative in predicting code. For instance, the CNN picked up Builder$ and build( , and mapped them very close, indicating that when you see a Builder class in Java, you’d likely to autocomplete it with a build() invocation.

Add a comment

Related posts:

The Person You Want To Become Is Not Far Away From You

What do you need to feel like her or him? Do you need extra millions of dollars? Do you need that penthouse in the heart of the city? Do you need that mansion you have stuck the pictures off to…

Realismo moral

Hace unos meses tuve la oportunidad de leer un excelente texto de Russ Shafer-Landau: Moral Realism: A Defense. Considerando el argumento del autor se me ocurrió que el realismo moral es uno de esos…

The development of technology

It shows different picture of how technology developed in transportation sector and make impact on us. In the past, people used horses as transportation. They ridded and uses horse in daily life…