Limor Gultchin ’17
Computational Humor: What’s At Stake
With recent advance in machine learning and data science, more and more fields that were once sealed off from computational approaches are opening up for the touch of artificial intelligence. Humor is one such field. Notoriously difficult to analyze, assess and rate even for the human agent, it provides a challenge for the adventurous computer scientists who dare to take it on. Literature that provides analysis of the roots and causes of humor has been circulating at least since ancient Greece, with the following major theories presented to date10:
1. Superiority theory. The superiority theory of humor states that the feeling of self containment associated with a sense of superiority in our status or well being over others is what makes us laugh. In other words, we laugh when we are better off than a fellow human, and our assertion of this fact. Notable philosophers proposed and bolstered this assumption, an act which did not help the status of comedians and humor: they were deemed as agents of evil mockery and pride. Among the famous supporters of this view were Plato, Descartes, Hobbes and the Bible.
2. Incongruity theory. Scholars endorsed a more modern and positive approach to humor was endorsed mostly since the renaissance. This reading of humor claims that what creates a comedic effect is a “benign violation of expectation”8 . Recent research have shown that many types of laugh outbursts could be explained by their surprising nature. Whenever an unexpected event happened, that turned out to be harmless – amusement was afoot. Two conditions had to be met to create humor:
(a) an incongruity, which first took an audience by surprise, perhaps even a tense, fearful one.
(b) a resolution of the contradiction of expectation had to happen, followed by relief, which had a vocal manifestation, of the form of “Ha Ha”, accompanied with an upward twitch of the lips.
Among the supporters of this understanding are James Beattie, Immanuel Kant, Arthur Schopenhauer and Søren Kierkegaard.
3. Stress release theory. Along the lines of the relief theme proposed by the “incongruists”, Sigmond Freud proposed a similar reading into humor, with a typical Freudian twist. To Freud, humor was a result of pent up stresses, many of whom related to societally taboo subjects, such as sexuality and violence, that are released when a joke teller refers to them directly or as an innuendo. By “opening for a conversation” a subject the listener has been putting active effort to suppress, a joke can signify a promise of relief, an appeasing statement, an “it’s OK” signal. It tells the listener others are contemplating and suppressing the same topics, and offers an opportunity to vent some of these accumulated suppressions. Lord Shaftesbury, Herbert Spencer and John Dewey also adopted a biological-psychological reading into humor as a stress reliever.
4. Mock-A ression play theory. From a biological-evolutionary point of view, laughter can be explained by its predecessors in primates. Observations by evolutionary biologists revealed that primates laugh too, usually during mock-aggression activity within a family of chimpanzees. Mock-aggression activity can be seen when members of the same family of monkeys are engaged in playing, that to an outside observer might look like a violent confrontation. When two or more such primates are “fighting” in this way, like children who are engaged in a mock-fight, they seem to be training for actual potential future conflicts, yet need to signify to each other that they are not being truly aggressive. The chosen signals are usually a quick, fast breath of air coming from the diaphragm and a movement of the mouth, in which the front teeth are exposed and the lips are drawn back and upwards. The sound this mock-aggressive signal is making is “ah ah”, which resembles human laughter, only in a backward direction: primates’ diaphragm is set up a little differently than that of humans (or rather the other way around?), to allow for appropriate locomotion. Thus, primates breathe in instead of out, when poking fun at their fellow primates.
A computational approach to humor is a much more recent development, yet it has already produced its own modest history. Beginning in the 1990s, computer scientists attempted at understanding, and moreover producing humor automatically. Developments in machine learning made it possible to imagine an algorithm that could take in examples of humor and to try uncover their inner pattern. HAHAcronym14 was one of the first examples of the development of a humor-generation focused system, which aimed to produce humorous acronyms. In the mid 2000s, Twitter proved to be a useful platform for such investigations, due to its ease of access and its inclusion of ready-made initial classifications (e.g. hash-tags). Barbieri and Saggion utilized the popular social network to detect irony3 , while Yishay Raz focused on automatic classification of types of humor12, such as anecdotes, fantasy, insult, irony, etc. Some of these computational approaches referenced the voluminous traditional thinking of humor, presented above2. There was also much work done for the understanding and assessment of visual humor. In 2015, Shahaf et al. joined Bob Mankoff, cartoon editor of the New Yorker, to build a system that would be able to predict which one of the +5,000 weekly submissions to the newspaper’s cartoon caption competition is the funniest one13. These attempts and more have all been interesting and illuminating, but proved to have a limited amount of success. There is clearly much more to be done to achieve a more accurate understanding of what makes things funny and how we can “teach” humor to computers.
Increments to our knowledge of humor are gradual and modest, being a difficult task as it is, and yet there are more and more indications as to the potential of such investigations. In an attempt to add to the existing corpus of computational humor literature and experimentation, this thesis focuses on generating literal humor, through an examination of Google’s word2vec word embedding. In particular, we will examine humorous 4-word analogies generated based on the embedding’s representation of words.
For this literal approach to the composition of humorous 4-word analogies, we took advantage of the existing and relatively new technology of neural word embeddings, and in particular Google’s word2vec, which opened to public usage in 2013119. Using a neural net trained on instances of Google news articles containing 3 billion running words, Google’s machine learning engineers were able to create an embedding of 3 billion words mapped into vectors of 300 dimensions. The vectors are learned representations of the words in the training text corpus, which were crafted using continuous bag-of-words and skip-gram architectures. A neural net is trained to predict a word for the words that appear close by in the text, and the parameters learned by it are used as these word representations. In fact, a semantic field is created, such that words that tend to appear closer across the training texts appear closer to each other in this multidimensional space as well. The resulting vectors thus capture relations between words in the underlying training data, such as which words are similar to each other (and thus are ’neighbors’ in the semantic space) and allow us to complete analogies that capture the relations between pairs of words (such as Paris:France::Rome:Italy). These resulting vectors can therefore be used as features when training models in various natural language processing and machine learning applications. For example, through word2vec, Bolukbasi et al. uncovered gender biases in the underlying embedding6. In this study, analogies such as
Man : Computer programmer :: Woman : homemaker
were automatically created after taking the cross product and Euclidean distance measures of vector representation of single words, pairs and quadruples. We decided to utilize this approach to study the humorous nature of association of words. If word embeddings can uncover gender biases, why can they not uncover funniness of words and combinations of phrases?
Linear regression and SVM classification
This paper had three main goals: to generate humorous analogies, to predict ratings of humorous analogies and to perform a Turing test to assess our results. In order to achieve them, it focuses on binary classification of words and analogies into “funny” and “not funny” categories, and on linear regression to generate prediction of “funniness” scores, based on ratings given by Amazon Mechanical Turk users. Following is a brief explanation of these two methods, meant to dispel the magical nature of machine learning as a “buzz word.”
Once we had our data organized as numerical values, that represent features of different objects (provided by word2vec, in our case), we could train a classification model. Our approach required 4 different classifiers, which created somewhat of a “cascading” effect. SVM classifiers fit a linear classification separator between groups of data that have certain labels. In most tasks of binary classification, they are used to separate positive from negative examples. Our case was no different – we tried to separate positive funny examples from unfunny examples. The separator is fitted such that the least data points will be misclassified (having the opposite label than the desired one). SVMs, or support vector machines, are unique in their definition of a decision boundary which has a desired margin. In a classification problem treated with an SVM, we look at the points that are closest to the hyperplane as support vector (hence, the name). The certainty of a classification of a data point can be determined by its distance from the hyperplane (the farthest it is, the more certain the classification).5 Thus, the best possible hyperplane we could fit is the one that maximizes the distance of the closest points (or SVs) to the hyperplane. A basic hyperplane can be defined asThe goal is to fit the classifier with lowest loss rate, where labels of the classification itself will be described as 1 if h(x; w, w0 ) > 0 or -1 otherwise, and in our case as funny if h(x; w, w0 ) > 0 or unfunny if h(x; w, w0 ) <0. When determining the weight vector w we will try to maximize the distance of the closest points, the support vectors, from the hyperplane, on both sides, while still maintaining a correct classification. To find the optimal hyperplane we will need to minimize the following expression:
Linear Regression (Ridge)
The regression task was done with ridge regression, where a linear function was fitted to predict the scores (y axis) to match a data point, represented by the numerical value associated with a 4-tuple. The following is the definition of the regression1 :
And the loss function we aim to minimize is defined as:
Where X is our features matrix (aka the featurized data), w is the weight chosen for the production of the regression line, w0 is the included bias terms and λ is the ridge regularization parameter. h is therefore the function for a prediction generation. The generation and rating process discussed in this thesis uses these two ideas from machine learning theory5 . The implementation was made possible through the python library scikit-learn7 , which offers great support in putting machine learning theories into practice, and into actual predictions and classification models.
In this project we furthered the understanding of humor and the capabilities of producing it “on demand”. There are various benefits to the development of artificial humor capabilities:
1. Allowing the creation of smoother, more fun interfaces to use, which will surely play an even greater role in our lives in the coming years. Systems which include humorous components could be more congenial: making queries, tasks and warnings less repetitive, statements of ignorance more acceptable and error messages less patronizing4.
2. Facilitating a better understanding of humor itself, by asserting or disproving notions of what makes things funny.
3. Overcoming yet another interesting artificial intelligence challenge posed to computer science researchers.
To the best of our knowledge, there has been no attempt to use our newly gained understanding of word embeddings in the field of computational humor. Furthermore, there has been limited success in previous attempts of humor generation tasks. This is yet another attempt at providing a proof of concept, for future research. This paper shows that computers can not only construct a humorous structure, but also recognize humorous themes relatively well, implying that it might have implications on a fuller understanding of what makes things funny.
Joking Around, or, Learning to Generate Funny Analogies
Our main goal was to show that humorous analogies can be generated based on the word embedding word2vec. We started from any random combination of 4 words, and tried to later generate funny 4-tuples. In the process, we trained 3 different classifiers, which match the 3 phases of generation:
1. 4 random words → 4 funny words. To start building our data set we asked Amazon Mechanical Turk users to come up with to 5 humorous analogies on any topic, and created an initial pool of about 1000 human written analogies. We then asked other users to rate those analogies, in the following manner: each participant was asked to indicate whether a batch of 25 analogies was funny or not (such that they could provide a single up-vote for each joke they found funny). Each batch of analogies was rated by 10 different participants, for a total of about 1200 analogies rated (1000 of them human written; around 200 were analogies we found funny, presented as a check, to make sure raters won’t tire of repetitive analogies which might not be funny, and thus affect the quality of their rating). Since the score range of an analogy is between 0-10, and the average rating for analogy was around 2.5, we concluded an analogy had to gain 4 or more votes to be considered as funny. Next, we trained a classifier which treated as positive examples all the words that were used by AMT users in analogies they have written, and later were rated highly by their colleagues. The word embedding was used to obtain words representing the positive and negative examples, and to draw new words on which we fitted the classifier. Then, we could pull new words that were classified as funny, to create our collection of funny words.
We decided to treat each of the words from funny-rated analogies as ”funny” in itself for our initial classification, as we knew we needed a starting point for this demanding task. The negative examples were any word from the embedding, including verbs, generic names or propositions, which tend to be less likely to be part of a joke. We needed to create an initial collection of words that had a significant likelihood of appearing in a funny analogy. Thus, in a liberal yet effective manner, was treated all words already mentioned in funny analogies as having higher likelihood of appearing in jokes, and thus as generally ”funny words”.
2. 4 funny words → pairs of words. We made a new classifier to generate potentially funny pairs of words, from the pool of funny words. We trained another SVM model, and used the length of each word and the angle and distance between the vector representation of the words as features. After tuning the hyper-parameters of the model, we managed to classify quality pairs, using pairs from the Turkers analogies as positive examples, and random pairing of funny words as negative examples.
3. Generated pairs → classified matching of pairs. As a final stage in this cascading process, we trained a final SVM to tell the difference between random pairing of the 2-tuples and good pairings, which have an appropriate affinity between their first and second halves. The pairs we were using to create these full 4-word analogies were the generated pairs of the previous phase, and the features used were a combined 1200 dimension vector, made up of each words word2vec representation, as well as the distance and angle between the pairs. The Positive examples for training were, as expected, the full 4-tuple analogies rated funny among the AMT users, and the negative examples were random pairing of pairs generated in the last round.
Through this iterative yet evolving process, we managed to generate new analogies that could have now be put to the test of AMT users. The task that was presented to them was identical to the original rating task described above, with the sole difference that the analogies now presented were computer generated. We decided to use the following baselines (each consisting of 300 analogies tested) for our analysis, so that we could assess the progression of this method, one step at a time:
• 4-tuples of completely random 4 words from the embedding
• 4-tuples of funny words from the pool generated by classifier 1.
• 2-tuples of randomly matched generate pairs by classifier 2.
• AMT made analogies.