What is the most common dad joke? What pun is the most worthy of posting?
Join me on this useless and odd journey!
I really love dad jokes and similar puns.
I also love analysing big chunks of data.
…soooo let’s go!
Where’s the data?
There is a subreddit called /r/dadjokes that has been a bastion of bad jokes since 2011. Most of the jokes posted there have the initial bit of the joke in the title and then the punchline in the body (or selftext as reddit calls it).
Oof that’s a groaner.
Then I had this idea.
Let’s just download all of them. Easy right? Right.
Again, where’s the data?
There’s a python library called psaw which is a wrapper around the pushshift.io API’s
With that library I could setup something like this.
from psaw import PushshiftAPI import datetime as dt import time api = PushshiftAPI() # Start here and go backwards start_epoch = int(dt.datetime(2019, 8, 18).timestamp()) for i in range(500): # Generator that will only give you 2000 at a time joke_list = api.search_submissions(before = start_epoch, subreddit = 'dadjokes', filter = ['id', 'title','selftext', 'score'], limit = 2000) for joke in joke_list: # Some posts don't have .selftext selftext = "" if hasattr(joke, "selftext"): selftext = joke.selftext.replace("\n", " ") # Dump that out to be piped into a file # ||| for quick simple parsing later print(joke.created_utc, "|||", joke.id, "|||", joke.score, "|||", joke.title, "|||", selftext) # next search to start where the last entry start_epoch = joke.created_utc # be nice to rate limits time.sleep(2)
I ran this, saw it was giving me sensible results, then I set it to fetch a million jokes and went to get dinner.
I arrived back to a script that had finished running.
How many jokes? - 157.266, all the way back to the first joke posted in 2011.
Awesome, that’s a nice little data set!
Obviously the first thing I had to do was to run all the jokes through a markov chain generator. I used markovify for this because it’s the quickest way. I even used their “Basic Usage” example almost verbatim.
Ready to read some markov chain generated dad jokes? Most of them are bad. The ones I picked here were the ones that made sense and none of them have a punchline.
Why washing machine instead of giving me cannons. Things didnt work out.
Knock, Knock. Our local TV weatherman broke both her arms?
What the difference between a piano, and an empty plate in the knees, and naturally, he was Finnish.
Any guy who invented the knock knock joke, shouldve been a post earlier about the wheels.
Did you hear about the hard-working mechanic who specializes in small groups
Dad: Sure, wheres the punchline?
Ok enough nonsense
Right. Here’s my idea.
Most jokes have a similar structure. Setup and punchline.
Let’s look at a common joke.
How do you find Will Smith in the snow? You look for fresh prints!
So this joke starts with a common setup. “How do you…” and then the punchline is the last couple of words, here it’s “fresh prints”
We might get somewhere if we only look at the first three words of the joke. But first we need to cleanup the text.
I created a simple class to create some structure to the joke.
class Joke: def __init__(self, aTimestamp, aId, aScore, aJoke): self.myTimestamp = aTimestamp self.myId = aId self.myScore = aScore self.myJoke = aJoke.strip()
And then I read in the
jokes.txt file generated before to create an array of jokes.
jokes =  with open("jokes.txt") as myFile: for line in myFile: s = line.split(" ||| ") # a couple of jokes parsed badly, let's just ignore them if len(s) < 4: continue j = Joke(s, s, s, s + " " + s) jokes.append(j)
You’ll see here that I treat the setup and punchline as the same string, it’s not a very reliable seperation to look at title and selftext, so I combine them.
To cleanup the text I use a few methods.
I tokenize the joke using
from nltk.tokenize import word_tokenize words = word_tokenize(self.myJoke)
string.punctuation to translate any punctuation to an empty string.
import string punktTable = str.maketrans('', '', string.punctuation) strippedWords = list(filter(None, [w.translate(punktTable) for w in words]))
And lastly I use
nltk.stem.snowball.SnowballStemmer to stem each word. This helps immensly when you want to group together similar sentences.
from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("english") stemmedJoke =  for word in strippedWords: stemmedJoke.append(stemmer.stem(word))
When this is done, the first three items in the list
stemmedJoke are the cleaned up first three words of the joke.
Let’s join those words together, print them out and count the occurance of each triplet.
Here is the top 10.
7898 what_do_you 2632 what_did_the 2629 did_you_hear 2159 whi_did_the 1667 how_do_you 806 what_the_differ 725 what_kind_of 663 what_is_the 561 did_you_know 543 what_doe_a
So 7898 jokes start with “What do you…”, followed pretty far behind with “What did the…” and “Did you hear…”
This is the stemmed version, which explains why one of them says “differ” and not “difference”
What if we do the exact same thing but for the last 3 words of the joke? Here is the top 20 most common with a version of the joke/jokes below each entry.
172 sticki_a_stick Whats brown and sticky? A stick. 168 out_of_it How do you make holy water? By boiling the hell out of it I used to be a very small kid But i grew out of it 150 in_his_field Why did the scarecrow win an award? For being outstanding in his field 142 grow_on_me Ive gotten pretty attached to my beard. Its really starting to grow on me 120 all_of_them Hey dad, did you get a haircut? No son, they cut all of them How many apples grow on a tree? all of them How many dead people do you think are in the cemetary? Hopefully all of them 115 medium_at_larg What do you call a midget psychic on the run from the law? A small medium at large 114 see_that_well Why did the blind man fall down the well? He couldnt see that well 113 get_in_there Thats a nice cemetery, I hear people are dying to get in there 106 have_2020_vision I dont know where I see myself in a year. I dont have 20/20 vision 94 waist_of_time What do you call a belt made from a watch? A waist of time 93 he_woke_up Did you hear about the kidnapping at school? Its ok he woke up 93 a_dad_joke * This isn't a joke, it's mostly people ending the joke by saying * something like "I made a dad joke" 87 to_get_in Thats a nice cemetery, I hear people are dying to get in * Same as the one above, just skipping "there" 85 trip_all_day I bought some shoes from a drug dealer.. I dont really know what he laced them up with, but I was tripping all day 85 do_he_laugh I dont always tell dad jokes... ...but when I do, he laughs 85 a_littl_lighter Whats the difference between a hippo and a Zippo? Ones heavy, ones a little lighter 83 them_all_cut Hey dad, did you get a haircut? No son, I got them all cut * Same as above, but the ending is worded differently 81 make_up_everyth Never trust an atom. They make up everything 76 a_chicken_sedan Why does a chicken coup only have 2 doors? Because, if it had 4 doors it would be a chicken sedan 74 food_no_atmospher Have you heard about the restaurant on the moon? Great food, no atmosphere
Some of the jokes are impressively copy pasted, to the letter.
"I dont always tell dad jokes... ...but when I do, he laughs" but I’m only searching for
Yeah, it has been posted before, sorry.
I don’t think there is one, this was mostly for me. If you want the data set, yell at me on twitter and we’ll figure something out.