IE

Product-first Research

Ilya Eckstein — Mon, 26 Sep 2016 00:00:00 GMT

I work at (and on) Robin Labs, where I the good fortune to be developing our Robin assistant and to see it converse with - and fulfill tasks for - some 2 million users. This work has been incredibly rewarding and it feels important: I strongly believe that, by making technology accessible via a more natural interface, we can help people remain more human. In the process, I’ve come to appreciate the great power of the conversational agent medium and the extent to which users are hungry for technology that can understand them, naturally. But for all the potential, and despite the recent progress in machine learning, our progress in this field is still hampered by the relative crudeness of our tools. It is obvious that Natural Language Understanding is in its infancy and there is much work to be done. Our goal is, therefore, to keep making impactful contributions to the field, bridging the Human-Machine gap.

The way to advance is through research, but I believe that Language and Dialogue call specifically for product-first research, i.e., rapid iteration of hypotheses in a real product setting, with live users. To product managers, that may sound like a risky proposition, but without risk, there is no innovation.

This product-first perspective make research priorities really clear. We’ve come to have some decent tools for text classification, but these are not nearly suffucient to create agents that can communicate intelligently. Here are some areas that are particularly in need of disruption right now:

Better tools to benchmark conversational agents Quality of dialogue is notoriously different to measure and the Turing test is deeply flawed. The lack of adequate tools is hampering the entire field’s progress. One idea is to offer an open “chatbot playground” to the community, where agents are exposed to live users and can get relative scores in terms of retention and engagement. See a more detailed proposal here.

Better dialogue models Strong language models are necessary but not sufficient for the creation of powerful dialogue models (intuition: dialogue is a protocol to communicate information, so it’s not just a statistical problem). For instance, sequence-to-sequence networks have been shown to work well machine translation but not in dialogue. In my view, a more promising approach is to learn higher level, language-independent discourse models ^[1] ^[2] (again, from real interactions with real users), which can also be combined with language models in differentiable end-to-end architectures. Specifically, since some discriminative discourse models are already available, we may be able to train rich generative dialogue models by using GANs (Generative Adversarial Networks) and/or Reinforcement Learning.

Reinforcement Learning (RL) of dialogue Rather than learning from prior conversations, a potentially more potent paradigm is to learn while conversing with users. Beyond computational aspects, this approach also requires devising a user experience where users are incentivized to make the agent understand them, and the agent is rewarded when it is successful. In other words, dialogue RL involves machine learning and product challenges that may seem quite daunting. Still, it is well worth the effort: imagine a talking to a bot that actually becomes smarter in the course of - and thanks to - the conversation! In summary, RL has the potential to radically transform language & dialogue learning with the users’ help, while keeping them motivated and rewarded in the process!

Now that the goals are clear, let’s get to work!

1. Stolcke, Andreas; Ries, Klaus; Coccaro, Noah; Shriberg, Elizabeth; Bates, Rebecca; Jurafsky, Daniel; Taylor, Paul; Martin, Rachel; et al. (2000), "Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech" (PDF), Computational Linguistics, 26 (3): 339, doi:10.1162/089120100561737

2. Galitsky, Boris A., and Sergei O. Kuznetsov. "Learning communicative actions of conflicting human agents." Journal of Experimental & Theoretical Artificial Intelligence 20.4 (2008): 277-317

Natural language understanding: How deep is too deep?

Ilya Eckstein — Mon, 12 Sep 2016 00:00:00 GMT

TL;DR Is Deep Learning always the best tool for Natural Language Understanding tasks? Not necessarily!

Recently, a paper from Facebook AI Research (FAIR) appeared on arXiv, under the intriguing title Bag of Tricks for Efficient Text Classification ^[1], promptly catching the NLP community’s attention. Even more intriguingly, FAIR soon followed up with an open source implementation a.k.a. fastText ^[2]. In this work (essentially a simple modification of the unsupervised Word2Vec algorithm to deal with supervised learning tasks), the authors made a convincing case for the frequent superiority of shallow networks - as opposed to deep ones - for common text understanding tasks such as sentence classification, dispelling the myth that "deeper is always better". Furthermore, this paper is but one example in the recent slew of results casting the "silver bullet" efficacy of complex neural architectures into question for text-related tasks (see some examples here ^[3], here ^[4] and here ^[5]).

Surprised? After all, isn’t Deep Learning a new disruptive force in AI, shown beyond doubt to be clearly superior to prior "shallow" learning approaches? Well, it depends who you ask. Ask a Computer Vision or a Speech Recognition expert, and you’ll get and enthusiastic Yes! In Computer Vision, novel DL architectures (such as VGG, GoogleNet, Inception, etc.) have delivered extremely impressive results on benchmarks such as ImageNet and CIFAR, even defying expectations of some DL champions ^[6]. In Speech Recognition, commercial heavyweights such as Google and Baidu have longs since switch to DL architectures. It should only be natural to expect, then, to see the same trend in NLP/NLU, shouldn’t it?

On the face of it, yes - in a way. The tide of enthusiasm in Deep Learning has of course spilled over to the NLU community, triggering a massive conversion of both academics and industry practitioners to the newfound DL religion. Impressive results from other fields, helped by the success of the seminal Word2Vec (followed by Glove and the like) were too much to resist. RNN and LSTM have since become mainstream techniques, offered by popular DL libraries such as TensorFlow, Keras, DL4J, etc. Among other big companies, Google has been at the forefront of both open-sourcing DL techniques (with TensorFlow) and adopting DL architectures in production (in the NLU domain, SmartReply is one recent example).

So - you ask - isn’t that enough? Where do I sign up?! Well, dear friend: if you are reading this, chances are, you are not Google and you likely have neither similarly massive amounts of training data nor their virtually unlimited computational resources. For the rest of us, it is important to understand the performance/computation tradeoffs that come with DL - which is what this post is really about. So let us look at some key problems one by one. In this overly high-level post, we will focus on text classification and a bit of language modeling.

Тext Classification

Clearly, text classification is a common task with plenty of applications: search, user intent determination, sentiment analysis, topic modeling (slightly different, but close), sequence labeling (related), etc. For text classification, we typically have 3options:

"Good old" classifiers such as Random Forests (RF) or Logistic Regression (LR). In this case, we’ll typically use a Bag of Word (BoW) representation such as TF/IDF.
- Pros: well understood models, relatively intuititive features you can control, speed.
- Cons: the features are "semantically blind": we’ll miss any words or synonyms that are not in the training set. Also, to push performance, manual feature engineering may be required.
Shallow Neural Networks (NNs) such as Facebook’s fastText.
- Pros: still fast, no feature engineering, possibly inherits Word2Vec’s nice semantic properties (more on that below).
- Cons: relatively black box (as all NNs are, whether deep or shallow): little intuition or control over the features.
- Insights: whether deep or shallow, there is one important distinction about NNs. Many traditional classification methods aim to learn a feature space partitioning function (as a means to separate samples of different classes), leaving feature engineering to the application developer. Conversely, NNs actually do more than that: they seek to learn the optimal representation (read: N-dimensional vector encoding of your data points: words, sentences, paragraphs, what have you…) so as to minimize some loss function on the training set. That learned representation is a byproduct that can sometimes be more valuable than the main task! For instance, in Word2Vec, the task is predicting a word based on the words around it (or vice-versa), which admittedly is not a very common problem in reality. However, as a byproduct, we get word vectors with some interesting semantic properties (e.g., analogies such as the famous king - man + woman = queen example) that become handy in text classification and other application.
Deep neural networks such as RNNs/LSTMs or ConvNets. Let us discuss this option in more detail below.

Recurrent Neural Networks (RNNs) are category of NN-based models designed specifically for sequences of arbitrary length. The basic RNN unit works by injesting and outputting one symbol at a time, where last time step’s output is also a new time step’s input - see the illustration below. In other words, RNN units have memory short-term memory! Which makes particularly attractive tool for modeling text.

Figure 1. An unrolled recurrent neural network, image courtesy of Chris Olah

However, in practice, RNNs can be hard to train and for small to medium-sized training datasets, "good old" methods can often deliver similar or even superior performance at a lower computational cost. Even in the Deep Learning category, RNNs have a strong competitor in Convolutional Neural Nets (a.k.a. ConvNets or CNNs) - just as long as your text can be treated as fixed length sequences, making them a suitable approach to represent and classify tweets, text messages, short user reviews, etc. Still, it’s too early to dismiss RNNs and their variants entirely. Where these networks (and particularly their more advanced variant called Long-Short Memory Networks or LSTMs) begin to shine are other NLU tasks that often involve prediction (i.e., generative in nature) rather than "just" classification, a fundamentally discriminative task. So, let us look beyond classification.

Language modeling

Language modeling is a fundamental NLP task, central to important problems such as speech recognition and machine translation and instrumental in many other settings. In simple terms, the goal is, given a sequence of symbols (words or characters), to predict the next symbol (word or character), sometimes more than one. Ever experienced Google’s search query autocomplete? There you go!

It turns out that RNN/LSTM networks by now have a clear edge over n-gram baseline methods when it comes to predicting sequences. Karpathy’s now famous post, The unreasonable effectiveness of recurrent neural networks has given us a glimpse into the power of character-based RNN language models, showcasing their "magic" in generating Shakespeare-like text and beyond - although certain prior techniques are in fact capable of similar magic (to an extent). Still, it as already been shown ^[7] that RNNs are capable of dramatically better performance, improving the language model perplexity by 2x over prior baselines - even on large one-billion-words datasets. The caveat is expensive computation, but as GPUs keep getting cheaper, that’s a fair trade-off.

This leads us to an interesting discussion on the applications of neural language models, such as Machine Translation, Natural Language Inference and of course Chatbots(!), as well as their limitations - namely the lack of adequate attention and memory mechanism, and the recent attempts to address them. But that is a separate topic, and by now I have likely already exhausted your attention budget for this blog. No biggie, I am about to follow up in a separate post. As for the original question, Do I need Deep Learning for my NLU/NLP problem?, here is a quick rule of thumb:

For text prediction, a.k.a generative tasks, give Deep Learning a good look. For classification, a.k.a discriminative tasks, your mileage may vary.

In any case, remember, machine learning is an empirical discipline and no two datasets are alike. So you’ll never know the answer for certain until you try!

References

1. Bag of Tricks for Efficient Text Classification, , A. Joulin, E. Grave, P. Bojanowski, T. Mikolov

2. Facebook’s fastText

3. Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks, Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. Goldberg

4. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task, D. Chen, J. Bolton, C. D. Manning

5. A Decomposable Attention Model for Natural Language Inference, A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit

6. Andrej Karpathy on human vs. machine image classification accuracy

7. Exploring the limits of language modeling, R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, Yonghui Wu

Deep Language Modeling, Part II: Applications

Ilya Eckstein — Fri, 09 Sep 2016 00:00:00 GMT

Thanks for coming, but this post is not yet ready. …Being written as we speak, come back again!

NLI and machine translation.

Why is NLI important ? It’s a key problem in natural language understanding. Many other NLU problems can be reduced to NLI: such as summarization (given a piece of text and a suggested summary, does the former entail the latter), information extraction (does the text entail the extracted fact), question answering (does the data source entail a given question and answer pair) as well as machine machine translation (does a phrase in language A entail its given translation in language B and vice versa). Why is it mentioned in the same category as machine translation? Both can be cast as an alignment problem.

Question Answering

Reading comprehension, memory and attention.

Dialogue and chatbots!

So why hs DL been more successful in ASR and vision than NLP/NLU? That’s a topic for another post!