<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[IE ]]></title><description><![CDATA[Ilya Eckstein's blog]]></description><link>https://ilyaeck.github.io</link><generator>RSS for Node</generator><lastBuildDate>Fri, 21 Oct 2016 21:12:46 GMT</lastBuildDate><atom:link href="https://ilyaeck.github.io/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Product-first Research]]></title><description><![CDATA[<div class="paragraph">
<p>I work at (and on) Robin Labs, where I the good fortune to be developing our Robin assistant and to see it converse with - and fulfill tasks for - some 2 million users. This work has been incredibly rewarding and it feels important: I strongly believe that, by making technology accessible via a more natural interface, we can help people remain more human. In the process, I’ve come to appreciate the great power of the conversational agent medium and the extent to which users are hungry for technology that can understand them, naturally. But for all the potential, and despite the recent progress in machine learning, our progress in this field is still hampered by the relative crudeness of our tools. It is obvious that Natural Language Understanding is in its infancy and there is much work to be done. Our goal is, therefore, to keep making impactful contributions to the field, bridging the Human-Machine gap.</p>
</div>
<div class="paragraph">
<p>The way to advance is through research, but I believe that Language and Dialogue call specifically for product-first research, i.e., rapid iteration of hypotheses in a real product setting, with live users. To product managers, that may sound like a risky proposition, but without risk, there is no innovation.</p>
</div>
<div class="paragraph">
<p>This product-first perspective make research priorities really clear. We&#8217;ve come to have some decent tools for text classification, but these are not nearly suffucient to create agents that can communicate intelligently. Here are some areas that are particularly in need of disruption right now:</p>
</div>
<div class="paragraph">
<p><strong>Better tools to benchmark conversational agents</strong>	Quality of dialogue is notoriously different to measure and the Turing test is deeply flawed. The lack of adequate tools is hampering the entire field’s progress.  One idea is to offer an open “chatbot playground” to the community, where agents are exposed to live users and can get relative scores in terms of retention and engagement. See a more detailed proposal <a href="https://docs.google.com/document/d/15F0rIqBYxmv-vM1z_6cIvz4RAeb0bVNhPmZQ7KjEGi8/edit?usp=sharing">here</a>.</p>
</div>
<div class="paragraph">
<p><strong>Better dialogue  models</strong>		Strong language models are necessary but not sufficient for the creation of  powerful dialogue models (intuition: dialogue is a protocol to communicate information, so it&#8217;s not just a statistical problem). For instance, sequence-to-sequence networks have been shown to work well machine translation but not in dialogue. In my view, a more promising approach is to learn higher level, language-independent discourse models <sup class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnote_1" title="View footnote.">1</a>]</sup> <sup class="footnote">[<a id="_footnoteref_2" class="footnote" href="#_footnote_2" title="View footnote.">2</a>]</sup>  (again, from real interactions with real users), which can also be combined with language models in differentiable end-to-end architectures. Specifically, since some discriminative discourse models are already available, we may be able to train rich generative dialogue models by using GANs (Generative Adversarial Networks) and/or Reinforcement Learning.</p>
</div>
<div class="paragraph">
<p><strong>Reinforcement Learning (RL) of dialogue</strong>	Rather than learning from prior conversations, a potentially more potent paradigm is to learn <em>while</em> conversing with users. Beyond computational aspects, this approach also requires devising a user experience where users are incentivized to make the agent understand them, and the agent is rewarded when it is successful. In other words, dialogue RL involves machine learning and product challenges that may seem quite daunting. Still, it is well worth the effort: imagine a talking to a bot that actually becomes smarter in the course of - and thanks to - the conversation! In summary, RL has the potential to radically transform language &amp; dialogue learning with the users’ help, while keeping them motivated and rewarded in the process!</p>
</div>
<div class="paragraph">
<p>Now that the goals are clear, let&#8217;s get to work!</p>
</div>
<div id="footnotes">
<hr>
<div class="footnote" id="_footnote_1">
<a href="#_footnoteref_1">1</a>. Stolcke, Andreas; Ries, Klaus; Coccaro, Noah; Shriberg, Elizabeth; Bates, Rebecca; Jurafsky, Daniel; Taylor, Paul; Martin, Rachel; et al. (2000), "Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech" (PDF), Computational Linguistics, 26 (3): 339, doi:10.1162/089120100561737
</div>
<div class="footnote" id="_footnote_2">
<a href="#_footnoteref_2">2</a>. Galitsky, Boris A., and Sergei O. Kuznetsov. "Learning communicative actions of conflicting human agents." Journal of Experimental &amp; Theoretical Artificial Intelligence 20.4 (2008): 277-317
</div>
</div>]]></description><link>https://ilyaeck.github.io/2016/09/26/Product-first-Research.html</link><guid isPermaLink="true">https://ilyaeck.github.io/2016/09/26/Product-first-Research.html</guid><dc:creator><![CDATA[Ilya Eckstein ]]></dc:creator><pubDate>Mon, 26 Sep 2016 00:00:00 GMT</pubDate></item><item><title><![CDATA[Natural language understanding: How deep is too deep?]]></title><description><![CDATA[<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p><strong>TL;DR</strong> Is Deep Learning always the best tool for Natural Language Understanding tasks? Not necessarily!</p>
</div>
<div class="paragraph">
<p>Recently, a paper from Facebook AI Research (FAIR) appeared on arXiv, under the intriguing title <em>Bag of Tricks for Efficient Text Classification</em> <sup class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnote_1" title="View footnote.">1</a>]</sup>, promptly catching the NLP community&#8217;s attention. Even more intriguingly, FAIR soon followed up with an open source implementation a.k.a. <em>fastText</em> <sup class="footnote">[<a id="_footnoteref_2" class="footnote" href="#_footnote_2" title="View footnote.">2</a>]</sup>. In this work (essentially a simple modification of the unsupervised <a href="https://en.wikipedia.org/wiki/Word2vec">Word2Vec</a> algorithm to deal with supervised learning tasks), the authors made a convincing case for the frequent superiority of <em>shallow</em>
networks - as opposed to deep ones - for common text understanding tasks such as sentence classification, dispelling the myth that "deeper is always better".
Furthermore, this paper is but one example in the recent slew of results casting the "silver bullet" efficacy of complex neural architectures into question for text-related tasks (see some examples here <sup class="footnote">[<a id="_footnoteref_3" class="footnote" href="#_footnote_3" title="View footnote.">3</a>]</sup>, here <sup class="footnote">[<a id="_footnoteref_4" class="footnote" href="#_footnote_4" title="View footnote.">4</a>]</sup> and here <sup class="footnote">[<a id="_footnoteref_5" class="footnote" href="#_footnote_5" title="View footnote.">5</a>]</sup>).</p>
</div>
<div class="paragraph">
<p>Surprised? After all, isn&#8217;t Deep Learning a new disruptive force in AI, shown beyond doubt to be clearly superior to prior "shallow"
learning approaches? Well, it depends who you ask. Ask a Computer Vision or a Speech Recognition expert, and you&#8217;ll get and enthusiastic Yes!
In Computer Vision, novel DL architectures (such as <a href="https://arxiv.org/abs/1409.1556">VGG</a>, <a href="https://arxiv.org/abs/1409.4842">GoogleNet</a>, <a href="https://arxiv.org/abs/1512.00567">Inception</a>, etc.) have delivered extremely impressive
results on benchmarks such as <a href="http://image-net.org/">ImageNet</a> and <a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR</a>, even defying expectations of some DL champions <sup class="footnote">[<a id="_footnoteref_6" class="footnote" href="#_footnote_6" title="View footnote.">6</a>]</sup>. In Speech Recognition, commercial heavyweights such as Google and Baidu have longs since switch to DL architectures. It should only be natural to expect, then, to see the same trend in NLP/NLU, shouldn&#8217;t it?</p>
</div>
<div class="paragraph">
<p>On the face of it, yes - in a way. The tide of enthusiasm in Deep Learning has of course spilled over to the NLU community, triggering a massive conversion of both
academics and industry practitioners to the newfound DL religion. Impressive results from other fields,
helped by the success of the seminal Word2Vec (followed by <a href="http://nlp.stanford.edu/projects/glove/">Glove</a> and the like) were too much to resist. RNN and LSTM have since become mainstream techniques, offered by popular DL libraries such as <a href="https://www.tensorflow.org/">TensorFlow</a>, <a href="https://keras.io/">Keras</a>, <a href="http://deeplearning4j.org/">DL4J</a>, etc. Among other big companies, Google has been at the forefront of both open-sourcing DL techniques (with <a href="https://www.tensorflow.org/">TensorFlow</a>) and adopting DL architectures in production (in the NLU domain, <a href="https://gmail.googleblog.com/2015/11/computer-respond-to-this-email.html">SmartReply</a> is one recent example).</p>
</div>
<div class="paragraph">
<p>So - you ask - isn&#8217;t that enough? Where do I sign up?! Well, dear friend: if you are reading this, chances are, you are not Google and you likely have neither similarly massive amounts of training data nor their virtually unlimited computational resources. For the rest of us, it is important to understand the performance/computation tradeoffs that come with DL - which is what this post is really about. So let us look at some key problems one by one. In this overly high-level post, we will focus on text classification and a bit of language modeling.</p>
</div>
</div>
</div>
<div class="sect2">
<h3 id="__ext_classification">Тext Classification</h3>
<div class="paragraph">
<p>Clearly, text classification is a common task with plenty of applications: search, user intent determination, sentiment analysis, topic modeling
(slightly different, but close), sequence labeling (related), etc.
For text classification, we typically have 3options:</p>
</div>
<div class="ulist disc">
<ul class="disc">
<li>
<p>"Good old" classifiers such as Random Forests (RF) or Logistic Regression (LR). In this case, we&#8217;ll typically use a Bag of Word (BoW) representation such as TF/IDF.</p>
<div class="ulist">
<ul>
<li>
<p>Pros: well understood models, relatively intuititive features you can control, speed.</p>
</li>
<li>
<p>Cons: the features are "semantically blind": we&#8217;ll miss any words or synonyms that are not in the training set. Also, to push performance, manual feature engineering may be required.</p>
</li>
</ul>
</div>
</li>
<li>
<p>Shallow Neural Networks (NNs) such as Facebook&#8217;s <a href="https://github.com/facebookresearch/fastText">fastText</a>.</p>
<div class="ulist">
<ul>
<li>
<p>Pros: still fast, no feature engineering, possibly inherits Word2Vec&#8217;s nice semantic properties (more on that below).</p>
</li>
<li>
<p>Cons: relatively black box (as all NNs are, whether deep or shallow): little intuition or control over the features.</p>
</li>
<li>
<p>Insights:  whether deep or shallow, there is one important distinction about NNs. Many traditional classification methods aim to learn a feature space partitioning function
(as a means to separate samples of different classes), leaving feature engineering to the application developer. Conversely, NNs
actually do more than that: they seek to <em>learn the optimal representation</em> (read: N-dimensional vector encoding of your data points:
words, sentences, paragraphs, what have you&#8230;&#8203;) so as to minimize some loss function on the training set. That learned representation is a byproduct that can sometimes be more valuable than the main task! For instance, in <a href="https://en.wikipedia.org/wiki/Word2vec">Word2Vec</a>, the task is predicting a word based on the words around it (or vice-versa), which admittedly is not a very common problem in reality. However, as a byproduct, we get word vectors with some interesting semantic properties (e.g., analogies such as the famous <em>king - man + woman = queen</em> <a href="https://www.technologyreview.com/s/541356/king-man-woman-queen-the-marvelous-mathematics-of-computational-linguistics/">example</a>) that become handy in text classification and other application.</p>
</li>
</ul>
</div>
</li>
<li>
<p>Deep neural networks such as RNNs/LSTMs or ConvNets. Let us discuss this option in more detail below.</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>Recurrent Neural Networks (RNNs) are category of NN-based models designed specifically for <em>sequences of arbitrary length</em>. The basic RNN unit works by injesting and outputting one symbol at a time, where last time step&#8217;s output is also a new time step&#8217;s input - see the illustration below. In other words, RNN units have memory short-term memory! Which makes particularly attractive tool for modeling text.</p>
</div>
<div class="imageblock center">
<div class="content">
<img src="https://ilyaeck.github.io/images/rnn-unrolled-colah.png" alt="rnn" width="600">
</div>
<div class="title">Figure 1. An unrolled recurrent neural network, image courtesy of Chris Olah</div>
</div>
<div class="paragraph">
<p>However, in practice, RNNs can be hard to train and for small to medium-sized training datasets, "good old" methods can often deliver similar or even superior
performance at a lower computational cost. Even in the Deep Learning category, RNNs have a strong competitor in Convolutional Neural Nets
(a.k.a. ConvNets or CNNs) - just as long as your text can be treated as fixed length sequences, making them a suitable approach to represent and classify tweets, text messages, short user reviews, etc. Still, it&#8217;s too early to dismiss RNNs and their variants entirely. Where these networks (and particularly their more advanced variant called <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">Long-Short Memory Networks or LSTMs</a>) begin to shine are other NLU tasks that often involve prediction (i.e., generative in nature) rather than "just" classification, a fundamentally discriminative task. So, let us look beyond classification.</p>
</div>
</div>
<div class="sect2">
<h3 id="_language_modeling">Language modeling</h3>
<div class="paragraph">
<p>Language modeling is a fundamental NLP task, central to important problems such as speech recognition and machine translation and instrumental in many other settings. In simple terms, the goal is, given a sequence of symbols (words or characters), to predict the next symbol (word or character), sometimes more than one. Ever experienced Google&#8217;s search query autocomplete? There you go!</p>
</div>
<div class="paragraph">
<p>It turns out that RNN/LSTM networks by now have a clear edge over n-gram baseline methods when it comes to predicting sequences. Karpathy&#8217;s now famous post, <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">The unreasonable effectiveness of recurrent neural networks</a> has given us a glimpse into the power of character-based RNN language models, showcasing their "magic" in generating Shakespeare-like text and beyond - although certain prior techniques <a href="http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139">are in fact capable of similar magic</a> (to an extent). Still, it as already been shown <sup class="footnote">[<a id="_footnoteref_7" class="footnote" href="#_footnote_7" title="View footnote.">7</a>]</sup> that RNNs are capable of dramatically better performance, improving the language model <a href="https://en.wikipedia.org/wiki/Perplexity">perplexity</a> by <strong>2x</strong> over prior baselines -  even on large one-billion-words datasets. The caveat is expensive computation, but as GPUs keep getting cheaper, that&#8217;s a fair trade-off.</p>
</div>
<div class="paragraph">
<p>This leads us to an interesting discussion on the applications of neural language models,  such as Machine Translation, Natural Language Inference and of course Chatbots(!), as well as their limitations - namely the lack of adequate <em>attention</em> and <em>memory</em> mechanism, and the recent attempts to address them. But that is a separate topic, and by now I have likely already exhausted your attention budget for this blog. No biggie, I am about to follow up in a separate post. As for the original question, <em>Do I need Deep Learning for my NLU/NLP problem?</em>, here is a quick rule of thumb:</p>
</div>
<div class="admonitionblock tip">
<table>
<tr>
<td class="icon">
<i class="fa icon-tip" title="Tip"></i>
</td>
<td class="content">
For text prediction, a.k.a generative tasks, give Deep Learning a good look. For classification, a.k.a discriminative tasks, your mileage may vary.
</td>
</tr>
</table>
</div>
<div class="paragraph">
<p>In any case, remember, machine learning is an empirical discipline and no two datasets are alike. So you&#8217;ll never know the answer for certain until you try!</p>
</div>
<div class="sect3">
<h4 id="_references">References</h4>

</div>
</div>
<div id="footnotes">
<hr>
<div class="footnote" id="_footnote_1">
<a href="#_footnoteref_1">1</a>. <a href="https://arxiv.org/pdf/1607.01759v2.pdf">Bag of Tricks for Efficient Text Classification</a>, , A. Joulin, E. Grave, P. Bojanowski, T. Mikolov
</div>
<div class="footnote" id="_footnote_2">
<a href="#_footnoteref_2">2</a>. <a href="https://github.com/facebookresearch/fastText">Facebook&#8217;s fastText</a>
</div>
<div class="footnote" id="_footnote_3">
<a href="#_footnoteref_3">3</a>. <a href="http://arxiv.org/abs/1608.04207v1">Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks</a>, Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. Goldberg
</div>
<div class="footnote" id="_footnote_4">
<a href="#_footnoteref_4">4</a>. <a href="http://arxiv.org/pdf/1606.02858v2.pdf">A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task</a>, D. Chen, J. Bolton, C. D. Manning
</div>
<div class="footnote" id="_footnote_5">
<a href="#_footnoteref_5">5</a>. <a href="http://arxiv.org/pdf/1606.01933v1.pdf">A Decomposable Attention Model for Natural Language Inference</a>, A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit
</div>
<div class="footnote" id="_footnote_6">
<a href="#_footnoteref_6">6</a>. <a href="https://plus.google.com/+AndrejKarpathy/posts/dwDNcBuWTWf">Andrej Karpathy on human vs. machine image classification accuracy</a>
</div>
<div class="footnote" id="_footnote_7">
<a href="#_footnoteref_7">7</a>. <a href="https://arxiv.org/abs/1602.02410v2">Exploring the limits of language modeling</a>, R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, Yonghui Wu
</div>
</div>]]></description><link>https://ilyaeck.github.io/2016/09/12/Natural-language-understanding-How-deep-is-too-deep.html</link><guid isPermaLink="true">https://ilyaeck.github.io/2016/09/12/Natural-language-understanding-How-deep-is-too-deep.html</guid><category><![CDATA[Deep Learning]]></category><category><![CDATA[ NLP]]></category><dc:creator><![CDATA[Ilya Eckstein ]]></dc:creator><pubDate>Mon, 12 Sep 2016 00:00:00 GMT</pubDate></item><item><title><![CDATA[Deep Language Modeling, Part II: Applications]]></title><description><![CDATA[<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p><strong>Thanks for coming, but this post is not yet ready.</strong>
&#8230;&#8203;Being written as we speak, come back again!</p>
</div>
</div>
</div>
<div class="sect4">
<h5 id="_nli_and_machine_translation">NLI and machine translation.</h5>
<div class="paragraph">
<p>Why is NLI important ? It&#8217;s a key problem in natural language understanding. Many other NLU problems can be reduced to NLI: such as summarization
(given a piece of text and a suggested summary, does the former entail the latter), information extraction (does the text entail the extracted fact),
question answering (does the data source entail a given question and answer pair) as well as machine machine translation
(does a phrase in language A entail its given translation in language B and vice versa).
Why is it mentioned in the same category as machine translation? Both can be cast as an alignment problem.</p>
</div>
</div>
<div class="sect4">
<h5 id="_question_answering">Question Answering</h5>

</div>
<div class="sect4">
<h5 id="_reading_comprehension_memory_and_attention">Reading comprehension, memory and attention.</h5>

</div>
<div class="sect4">
<h5 id="_dialogue_and_chatbots">Dialogue and chatbots!</h5>
<div class="paragraph">
<p>So why hs DL been more successful in ASR and vision than NLP/NLU? That&#8217;s a topic for another post!</p>
</div>
</div>]]></description><link>https://ilyaeck.github.io/2016/09/09/Deep-Language-Modeling-Part-II-Applications.html</link><guid isPermaLink="true">https://ilyaeck.github.io/2016/09/09/Deep-Language-Modeling-Part-II-Applications.html</guid><dc:creator><![CDATA[Ilya Eckstein ]]></dc:creator><pubDate>Fri, 09 Sep 2016 00:00:00 GMT</pubDate></item></channel></rss>