Monday 29 May 2017

The onomatopoeia of laughter

When the folks at the New Yorker take a break from publishing stories about how spending thousands of dollars on watches is a legitimate form of therapy for Trump-election-induced depression, they feature some good stuff. Case in point, a short piece on "Hahaha vs. Hehehe", in which the author is baffled by, and trying to explore, the emergence of "hehe" as an alternative to "haha".

I will offer my personal interpretation of the different representations of laughter in text in a minute. But first: consider something so obvious that the author of the article does not even question: why do all representations of laughter (besides acronyms) include the letter 'h'? And not just in Germanic languages, but also in Greek, Chinese and, according to my mother-in-law, Arabic (where laughter is apparently represented as just "hhhhhhhh", without any vowels)? (Though different languages obviously use different letters to represent "h"; for example, the Spanish write "jajaja!" - which to me sounds like an overexcited German Shepherd responding to "do you want a cookie?")

The answer is obvious, but fascinating nevertheless: all humans use the same sounds when laughing, and the predominant sound in laughter is that of drawing a breath - which we all instinctively code as an "h". But although all languages (that I know of) have tried to standardise the orthography of laughter in a strikingly similar manner, laughter in real life varies by individual: a professor at Tulane claims that he could transcribe and then read back his friends' laughter with such accuracy that everyone could guess whose laughter he was reciting. That is not to say that individuals' laughter is invariable: it does change, to convey different levels of amusement or other emotions. Nevertheless, it seems that an individual's laughing patterns are quite unique. This observation in itself is moderately interesting, but hardly breakthrough: upon reflection, we'd all realise that we can recognise our friends by the sound they make when they laugh. But I find it amazing that linguists can actually transcribe laughter in such a way that it becomes recognisable.

Unfortunately, it is too much to expect every person to be able to accurately transcribe their laughter, which is why we resort to a common laughter-vocabulary consisting of the exclamations ha, haha (with "ha" potentially repeated a few more times; the reason I called out the single "ha" separately is that it can have a markedly different meaning, as I will discuss below), hehe, heh, hihi, hee hee, tee hee, ho ho, the prefixes mwah, bwah &c, the acronyms lol, rofl, lmao &c and various emoticons.

Of these, "ha ha" is the undisputed king, followed by heh and lol, according to Google Ngrams. All others are less popular:

Do note, however, that Google Ngrams tracks the appearances of a word in books, whereas emoticons and acronyms are far more popular online. Furthermore, the results above are not perfectly accurate, as some writers do not inject spaces between consecutive "ha"s or "he"s; I chose to ignore this in order to eliminate noise from "haha" or "hehe" being used in different contexts (I discovered that the Hehes are a tribe in Tanzania, for instance). Finally, some noise still remains - for example, "lol" also stands for "lots of love".

I therefore tried to validate these results by searching for these various words in my whatsapp chat history. Sadly, the search function doesn't give the number of hits, but eyeballing the data, I'd say that "haha" is indeed the most common, followed by "heh" - although the latter seems to be used almost exclusively by me. There are a few "lol"s, "hehe"s and various emoticons, but no "hee"s, "lmao"s or "rofl"s. Let's now look into these exclamations/acronyms/emoticons in more detail:

Ha
The author of the New Yorker article writes that "ha" is the basic unit of written laughter; that's true, inasmuch as the syllable "ha" is a constituent of "haha", but it's misleading: "ha" on its own is not actually used to express laughter or even amusement so much as surprise, suspicion, triumph &c (according to Oxford Dictionary). The word has its origins in Middle English apparently, though the dictionary doesn't give any sources.

That said, it's true that "ha" is also used for laughter. Well, sort of. I think most people use it as an acknowledgment that something intended to be amusing was said. The prankster who has so far tricked the CEO of Barclays and the Governor of the Bank of England uses it (in character) to applaud his own humour/cunning:

"If you ask for the crystal glasses you’ll be able to admire [the waitresses'] enchanting dexterity. I keep those glasses low down, ha! You don’t reach my age without knowing all the tricks." and

"What else would Mack the Knife do but support those he can trust in!
Begs the question, who should we seek to silence next!?
Onward, to bigger, and better things.
I may have already had a stiff one, it’s been a long day, and I get no younger.
Clapton has nothing on me ha!"

(To me, this usage seems so idiosyncratic that I am surprised that it by itself didn't give away the fact that the emails were a prank.)

Haha (& so on)
"Haha"'s preeminence is confirmed by Oxford Dictionary, which redirects queries for "hehe" to it. According to the dictionary, "haha" is first recorded in Old English, though again, there are no examples (apparently it appears in Hamlet, though it seems that it's actually meant to express surprise instead of laughter). Also, note that Garner says "ha-ha" is the correct spelling.

In accordance with both the Ngrams results and the OED, most of the people who responded to my question on facebook on what words they use to express laugther said that "haha" or "hahaha" is their preferred choice. In their words, "hahaha is more expressive, whereas the other ones are too descriptive", and "the number of consecutive 'ha's is proportional to the volume of laughter".

(One exception: one friend said he uses "hahaha" when laughing at people, and generally seems to attribute a mocking tone to "haha".)

Personally, I use "haha" when only mildly amused, and "hahaha" when I find something genuinely funny. A quick search of the term in my family's whatsapp group (aptly named "Bantercats") reveals that "hahaha" is the most common variant, followed by the standard "haha". There was also a "hahahahaha" from my brother in response to this picture of a hotel room in Tianjin:


(I was put up at that hotel after my flight was cancelled due to bad pollution (that's a thing that happens here) - I pulled the curtains to look outside and found out the room had no window. I don't think they understand the function of curtains in that hotel.)

Heh
Urban Dictionary defines "heh" as "half-laugh, semi-cynical connotation, used on IRC by those too cool to say lol or roflmao". (Here's the word's self-proclaimed inventor.)

I use "heh" to indicate a drier sort of amusement than "haha"; a more detached kind, perhaps, or even slightly patronising (for example, I used "heh" in response to being told that the Chinese call Plato "Bolatu").

At other times it serves as a neutral response. In these cases, it serves to let my interlocutor know that I've seen their message, and that I am well-mannered enough to respond, without actually putting any effort into my response.

Interestingly, the New Yorker writer says that "heh" may have some down-home vulgarity undertones. I had never interpreted it this way, but David Foster Wallace does use "heh" in Brief Interviews with Hideous Men to transcribe the conspiratorial and sleazy banter between two men, one of whom has taken advantage of a vulnerable woman's grief to sleep with her. So I suppose there is something in this view.

Finally, "heh" is used to represent sinister laughter in some comic books, e.g.:


Hehe
Again according the UD, "hehe" refers to muffled laughter, with a sneaky aspect to it. Indeed, everyone seems to attribute similar characteristics to it - my facebook friends assigned "tongue-in-cheek" or "snigger" tones to it, and this academic article says that "hehehe" is a "cynical variant of hahaha".

(Searching for "hehe" in my whatsapp messages yields 13 hits. Of these, 11 "hehe"s come from the same person. I wonder what that says about him.)

Unlike the New Yorker writer then, I do not see "hehe" as an alternative to "haha": they serve different purposes. One interesting question though: do the people who use this variant do it consciously every single time? Do they intend to communicate this "nudge-nudge", "wink-wink" attitude? What I mean is, do people think "okay, I want to show I am being cheeky, so I will specifically write 'hehe'", or do they subconsciously write "hehe" because they are feeling cheeky?

It'd be quite something if the latter - it would show that our intuitive understanding of the various undertones in laughter exclamations has been ingrained in us.

Hihi, hee hee, tee hee, ho ho
All of these are far less common variants. I agree with the New Yorker writer that the first three are cutesy, possibly too cute, as she puts it. To me, they look a bit artificial - they give the impression one specifically chose to use them instead of their flowing naturally, and so they can be a bit jarring. In fact, I don't think I've ever seen "tee hee" anywhere except this famous bash.org quote.

(Note that "hihi" is increasingly being used as a greeting instead of as a stand-in for laughter - and that's the definition given by the Urban Dictionary.

As for "ho ho", its most notable variant is Santa Claus's "ho ho ho". It, along with all previous variants, also appears in comic books when the artists want to show that different characters have different laughter:
Prefixes: bwah, mwah
The former of these is used to express explosive or hysterical laughter, but it's not very common in text speak.

The latter is interesting though: it is universally recognised as evil laughter. Up to this point, all variants have been abstractions, or the greatest common denominators, of individuals' laughter. Though simplified, they have been representations of the ways people actually laugh - earnest laughter, cheeky laughter, cool and detached laughter, mocking laughter - these are all real.

But evil laughter? This is the only one that is an invention. No-one is evil in their own narrative; no-one actually says "mwahaha, I will destroy earth and kill billions of people!". Granted, some people are deranged and may laugh while plotting atrocities - but the thing is, "mwah haha" does not have a mad quality to it. So out of all the exclamations we have discussed up to now, this one is the only one that is purely fictitious.

Acronyms
I am going to put this out there: I consider "lol" and its kin to be signs of intellect deficiency. Though I have at times used them myself, they seem to me a less refined version of "heh" (a view shared by the Urban Dictionary, as mentioned earlier).

I dislike them for several reasons. First, they have lost their meaning: "lol" is supposed to mean "laugh out loud", but no-one interprets it this way - as a frequent user of "lol" wrote on facebook, "[if I actually laugh out loud] I say 'literal laugh out loud'". And no-one literally rolls on the floor laughing, and I am not sure what "laughing my ass off" is even supposed to mean. So it seems to me that using "heh" is more sincere.

Second, "heh" is easier to read, process and interpret than "lmao", "rofl" and even "lol", with which older people are unfamiliar:

Third, "lol" especially is often used a neutral filler - for example, "I am not sure what to have for dinner lol". Fillers such as "like" are bad enough in speech; we shouldn't be introducing them into prose. They serve no function that cannot be served more elegantly, and more considerately for the reader, by other means.

For one thing, "lol" makes the writer come across like a prepubescent boy; for another, sentences like the one above (which are sadly quite common) are hard to parse: the reader expects the sentence to end at "dinner" but is then confronted by an unexpected word, without any punctuation. This is taxing and unnecessary. Writing badly due may be excusable; doing so on purpose is not.

Emoticons/Emojis
First of all, I must confess I did not know the difference between emoticons and emoji. Apparently, the former are textual representations of smiley faces (or other things) - e.g. :). Emojis are actual pictures - e.g. 

Given the plethora of different emoticons/emoji, it is easy to find one that perfectly fits the writer's intended purpose, and so there is less ambiguity than there is in, say, using "haha" vs "hehe".

My personal take on this increasingly popular form of prose is that emoticons are, generally speaking, good and emojis bad. Emoticons are useful: they are a quirky way of efficient, speedy communication. If you have dinner plans with a friend, and they cancel because they caught a cold, you can send a sad face - :(. This expresses your sadness that they are ill and that you won't see them more efficiently, and perhaps more kindly, than a "oh man, bummer" - which may come across as having slightly accusing undertones.

They also slightly enrich language. When someone at work thanks you for something, or compliments you, you can respond with :D, instead of writing "no problem" (boring), "you're making me blush" (silly) or "oh I'm not that great" (falsely modest).

Emojis serve the exact same purposes, but they are, in their vast majority, unbearably ugly. Look at gtalk's latest crop of smiling faces:


Look at them! They look like melted blobs of cheddar. They are horrible. Facebook's aren't much better either. Furthermore, they are jarring when inserted in text. Emoticons are way less invasive and annoying. What really bugs me is that most apps now automatically translate emoticons into emoji. Meh.

Usage guide
Anyways, to recap, use:

  • "ha!" for surprise or for pranking British bank CEOs.
  • "haha" for laughter, and add more "ha"s in proportion to the funniness of the thing you are responding to. If what you are responding to something unexpectedly hilarious, you can add a "bwah" at the beginning;
  • "hehe" when you're being cheeky;
  • "heh" when you want to acknowledge receipt of a slightly amusing message or if you're a comic book villain;
  • "hihi"/"hee hee"/&c when you're intentionally being cute, but bear in mind that it may come across as artificial and overdone.
  • "ho ho ho" if you're Santa Claus/writing a Santa Claus story (and are not feeling adventurous enough to change his laughter);
  • "mwah ha ha" if you're drawing a comic book villain. But bear in mind this'd imply a simplistic view of good and evil.
  • A combination of the above if you're a comic book artist drawing several characters laughing at the same time;
  • emoticons to enrich your prose - but do not overuse. It's good to be able to express yourself in a variety of ways; over-relying on emoticons may cause a decline in your ability to do so.
Do not use:
  • "lol" and other acronyms;
  • Emojis.
I hope this helps!



Friday 12 May 2017

Ithaca

Cavafy's Ithaca is one of the, if not the, most celebrated and well-known Greek poems. Like the rest of his poems, part of its universal appeal is, I think, that it is straightforward and accessible without being trite or overly simplistic.

The poem's main message is that the journey is more important than the destination. But the secondary message, explicitly stated by the poem's narrator, is that Ithaca serves a paramount role as a destination: "without her you would not have set out". This second message is often overlooked: because one of modern society's most common afflictions is focusing too much on an arbitrary goal only to find out that achieving it doesn't really bring happiness, most people refer to the poem to remind themselves that the adventures on the way to Ithaca are more rewarding that the arrival on the island. The problem of not having an Ithaca to long to reach is less prevalent.

But I think that with the progress in AI research, we need to start paying more attention to the poem's penultimate verse. Thinkers and journalists are (rightly) highlighting the dangers of increasing AI-led automation, but important though this discussion is, it pales in comparison to the questions that face us if we manage to build super-intelligent computers.

Imagine a world where AI solves the problem of scarcity. It is so intelligent that it not only fully automates all jobs, it not only fully optimises production of goods, but it can event invent new raw materials if the ones available in nature are not sufficient. All goods are free - food, clothes, yachts, you name it. No-one needs to work anymore. Money stops existing. (As an aside, this world does pose a serious challenge: what do we do with unique goods such as land or works of art? Since no-one can offer their labour in exchange for such goods, how do those who did not own any possessions at the advent of this world acquire such unique goods? If you thought today's world is unfair to the poor, let me tell you, there might come a time when you will wish capitalists could exploit your labour.)

Moreover, this AI can not only cure diseases, but also reverse aging. It can connect humans' brains to the internet - no longer to look up anything on your phone, it's all there in your head. No need to study: you can just download all information you need, a la Matrix. It can modify our genes at conception so we are all born with genius-level intellects and Apollonian/Aphrodisian bodies.

No more heated political debate: political division has two factors, imperfect knowledge and scarcity. Since these two are now overcome, we can finally live in peace. No war, no famine, no poverty (assuming we find a solution to the aforementioned unique asset scarcity problem). No suffering, no misery.

But also, no Ithaca: no opportunities for heroism, self-sacrifice, hard work, toil, perseverance. Virtues such as honesty, charity and humility become irrelevant. Our motivators - desire for wealth and legacy, fear of death, competition - become obsolete. We won't be able to relate to our past art - regardless of how much the world has changed, the themes in ancient tragedies, myths and epics are as relevant now as they were at the time these tragedies &c were written. This will be a world where Odysseus never leaves Ithaca in the first place; worse, a world where he not only never leaves Ithaca, but starts importing lotuses (somehow convincing the lotus eaters to stop munching their fruit for a minute and trade).

What will a word like this look, or rather, feel, like? It is a world that is for all intents and purposes unimaginable to us. For thousands of years, we humans have had Ithacas to go to. What happens when these Ionian islands are taken away from us? I don't know - but I find it scary as hell.

EDIT: my brother pointed out that it looks like I am arguing there will be no journey, not that there won't be an Ithaca. It's true that the metaphor isn't perfect - I guess what I am trying to say is that as there will be no journey because there won't be a need to strive for anything.

Wednesday 3 May 2017

My problems with Hollywood - Pt I

I started writing this post intending to moan about Hollywood's recent lack of originality, and its studios increasing proclivity to produce sequels and spin-offs instead of bringing new scripts to our screens. But while writing the first sentence, it occurred to me that I may just be acting like a grumpy old man: you know, looking at the past with rose-tinted glasses, going on about the "good old days", even though these "good old days" may actually have been worse than today.

So I did what a decent data geek does: I built a database of the top 10 box office hits for each year from 2017 going back to the 1920s, and crunched the data. You can read about how I built the database, and access it yourself, here. My conclusions:

Hollywood is relying more on sequels...
Here is the % of top 10 films that are original scripts by year (note that "original" here excludes not only sequels and spinoffs but also remakes and adaptations - refer to the link above for details on how I classified the films in my database).

And most original scripts since 2010s are animations. In the 00s, 37% of original films were cartoons; in the current decade, the percentage is 67% (the only original top 10 films of this decade have been Gravity, Inception, Maleficent, Ted and Get Out, though the latter's only included because it is in 2017's top 10 so far.). Not that there is anything wrong with cartoons - many are exceptionally good; but it would be nice to have some original live action films every now and then.

(Of course, it could be that if we looked at the entire set of films produced instead of only the box office top 10, the % of original scripts would be much higher. In fact, given that it is becoming increasingly easy for wanna-be director hipsters to shoot indies, I'd expect this to be so; it may also be that big studios themselves are producing more original films, and it's just consumer tastes that have changed, propelling sequels to the top 10. But I doubt this; I think the studios put the bulk of their resources, both in terms of production and marketing, behind sequels.)

For the record, here's how the other types of films (remakes and adaptations) have fared:


... and it's not really making much sense to me
When I thought of writing this post, and before doing the data analysis, I was planning to make some wishy-washy statement about how, yes, sequels are more profitable and all, but oughtn't Hollywood care about artistic merit too? (Which, wishy-washy or not, is something I genuinely believe.)

But since I got the data together anyway, I decided to look into how much more profitable sequels are, compared to original scripts. The answer is, actually, they are less profitable:


(ROI here is calculated as (Revenue/Budget - 1). Both budgets and revenues have been adjusted for inflation to reflect 2017 prices.)

As you can see, the average original film has a far lower budget than sequels, but roughly the same revenue - and a higher IMDB rating. Now, it's true that these numbers are a bit skewed by one-off massive successes - in particular, the reason the non-weighted average is so high for original films is the Blair Witch Project, which had a tiny budget (~$100K) but proved to be a massive box office success (~$240 million). So I included the variance here: this is a measure of the extent of divergence in the data; the higher it is, the more diverse the values in the set - hence, variance is used as a measure of risk.

As the table shows, though sequels are less profitable than originals, they have much lower variance - in other words, they are a safer bet. Or so it looks at first glance: I also ran the data excluding the top 20% of box office hits - hence taking out of the picture once in 100 years outliers like the Blair Witch. Yet here again original scripts beat sequels:

This analysis is incomplete, however, because it suffers from survivor bias: what is says is that amongst films that made it to the top 10, original scripts outperform sequels. What about those that didn't? Like before, it may be that looking at the entire set of produced films would yield different results. Still, this indicates that if Hollywood executives placed more emphasis on discovering good original scripts, and marketed them well (not like they did, say, with the Nice Guys, which was an excellent film that no-one watched), they might do a service to their shareholders as well as to their consumers.

Higher budgets do not translate to better films
First I looked into the correlation between a film's budget and its IMDB rating. It's non-existent. Then at the correlation between a film's budget and its revenue - after all this is what an executive will be interested in. Again, non-existent.

Looking at the same correlations but only for sequels and spinoffs, I find that they are slightly stronger, but still not significant. So then I looked at a few recent franchises to see how their budgets, ratings and box office performance have evolved, all with the objective of answering the question, is a higher budget worth it?

The answer varied depending on the franchise, from kinda...

F&F films have maintained a remarkably consistent rating, but have been peforming increasingly well at the box office (the reason for the dip on the chart on the right is because the latest installment is still in theatres).


The Dark Knight films have also maintained a high rating and have performed well at the box office - though the significant increase for the third installment did not translate to higher revenue, nor to better score.

... to the no:



All these franchises have increased budgets vs the original, yet seen consistent or declining scores and revenues.

... to the God no:


With the exception of Jurassic World, these franchises have seen significant decline in ratings and box office performance in spite of budget increases vs the original.

The one interesting outlier: Mission Impossible has had lower production budgets for the last two installments vs the first and the previous ones in the series, but these have been better received by the viewers. Weirdly, however, they have not performed as well in the box office:

So looking at this data, if I were a studio executive, I'd really challenge my directors to explain why they want more money, when earlier productions had both higher artistic merit and better financial performance.

Having established that Hollywood is failing to produce original films, and that they are not particularly good at managing their money either, and given that I had already gone through the trouble of building a film database, I thought to have a look at a few other trends. Here are a few resulting observations:

Films are worse now than in the 60s
It is not just I who is looking at the past through rose-tinted glasses: older box office successes have a higher average rating than more recent ones:

Unfortunately, my database has a very low number of films from before the 60s, which is why these early decades have such a high average. But still, the 60s have a decent sample size, and a higher rating that recent decades.

Looking at the top 3 films for each year, it looks like earlier decades had higher average earnings too:

I imagine this is due to consumers' having more choices these days.

The highest grossing genre is animation; the highest rated is drama


Horror/Thriller, appears to be the most profitable, even excluding the Blair Witch project. Romantic films are the least popular (probably because Twilight falls into this category), followed by comedies. These probably explain the decrease in the % of comedies, and the increase in animation films:


Strangely, though, dramas are also decreasing, even though they are well liked and the second most profitable (though they do not make big box office hits).

The average film runtime has fluctuated widely through time
A few years ago, I watched a film with a friend, after which he commented: "man, that was too long. I swear there was a time you could watch a flick in an hour. Now everything seems to be two hours long". It turns out he's right: films are longer than they were in the 80s, but shorter than they were in the 50s:


As it turns out, longer films are perceived as being better, though this may be because old films were longer, and as we saw earlier, older films have a higher rating:

In summary: Hollywood is pretty broken; it may be that it needs a Moneyball solution - which would help bring some better stuff to cinemas. It would be interesting to compare films to TV by the way; it seems to me that the quality of TV (and Netflix/Amazon/&c) series is getting better and better. 

A few decades ago, excellent programmes like Freaks and Geeks and Firefly were getting cancelled due to low ratings; but I think we will see fewer cult series getting cancelled in the future. A TV station can only broadcast 24 hrs of content per day, and even fewer profitable hrs; so a niche show with few viewers will get axed in favour of a (perhaps blander) one that would appeal to more people. Imagine that viewers' preferences are represented by the following Venn diagram:
The big circle in the middle represents the number of people who watch a mainstream series, like Friends. The smaller circles represent more niche shows. These shows may be profitable - i.e. their ad revenue may exceed their cost of production; but a TV channel will just rank them by profitability, and will fit in as many of the ones at the top of the list as it can in its schedule.

Netflix by contrast will produce any series that's profitable (more or less, it also needs to take into account its cost of capital &c, but let's keep things simple here). It may still be the case that some excellent shows will never get produced, because they appeal to very few viewers; but it's certainly a better deal for the consumer than TV and cinema.

So perhaps we can (have to?) place our hopes for art on what was once seen as the lowest of the lowbrow entertainment media. At least until a Hollywood exec reads this and says "you know what? Screw sequels".

Film Database - Technical Details

You can find my database here. It consists of a raw data tab (DB), a pivot table that makes it easy to look at the data, and some back up tabs.

To build it, I first went to boxofficemojo.com and downloaded copy-pasted the top 10 film details for every year from 2017 to 1980 onto the Excel file (e.g. here's the link for 2017). Since this only goes back to 1980, I then went to the All Time Domestic box office site, sorted the data by year, and copy-pasted all films going up to 1980.

(This creates a few issues, because it becomes harder to do like-for-like comparisons before the 80s. For instance, 1972 only had 7 films that made it to the all time box office records, whereas 1974 had 17; this means that it's harder to compare the 70s to the 80s, which only feature top 10 films).

Thankfully, copy-pasting from boxoffice.com made it easy to extract the URL for each site - this I put in column B. A formula gets the search link for the film's site in IMDB.com (columns E&F). This is where things get tricky: I needed to be able to extract some basic information about the films. To do this, I used importXML formulas on Google Sheets. What these formulas do is they go to the specified part of a webpage and get the data that's located there. The database at the link above has the results copy-pasted as values, but you can see formulas in the tab "Import Formulas", in case you want to reapply them. The same formulas are used to get IMDB ratings and a few other bits of data.

A few more formulas convert the information thus extracted to an analysis-friendly format. These are relatively simple if you're already familiar with Excel, so I won't go through them.

I adjusted all financials to 2017 $ values to account for inflation. To do this, I used the adjustment table found here. This was missing a few years, so I added those myself by either keeping the same price as the previous year or taking the average between the previous and subsequent year (I did this when these two prices differed significantly).

BoxOffice.com had 99 genre classification combinations, because some films belong to more than one genre (e.g. Action Thriller, Action Comedy &c). I mapped these to 8 genres for analysis purposes. In general, I tried to do the mapping by deciding which single genre best defines a multi-genre film. For instance, Lethal Weapon is correctly classified as an Action Comedy by BoxOffice.com; but if I had to choose one genre only, I'd go for Action. You can find the mappings in the "Genres" tab, though note that I did manually adjust the classification of a few films (so, for instance, Alvin and the Chipmunks' full genre is Family Comedy, which would normally map to Comedy, but I manually changed it to Animation). Feel free to download the db and change the mapping if you disagree with it.

I also had to classify films into one of 5 types - Original, Adaptation, Sequel, Spinoff or Remake. I did this manually (this was the one part I did not manage to automate). Some films could be considered to be more than one of these (e.g. is the Dark Knight Rises a sequel or an adaptation?). In general, the first of a sequence of films adapted from a novel or play (or any other source) was marked an adaptation, and subsequent films in the series sequels; if a film could be considered both an adaptation or a remake, the classification depended on how similar the new film was compared to the previous adaptation. For instance, the Amazing Spiderman was tagged as an adaptation, whereas Ben Hur as a remake (though to be fair, I've only ever watched the Charlton Heston version - but since the film is usually called a remake, I went with that).

I think this pretty much covers it - but feel free to ask any questions in the comments.