A Corpus-Based Study of Lexical Items in Five Screenplays

In Comparison with a Spoken and a Written Corpus

by Xiaofen Wang-Gempp

(Master’s Research Project in Applied Linguistics)


It is well known that there is little in the twentieth century that has influenced modern life as greatly as the computer, and the field of linguistics has not escaped this influence. Since the 1960s, linguists have been taking advantage of the computer’s processing power to study language use and language structures by analyzing textual data stored in computer databases. The study of language on the basis of computer-stored data is a characteristic of “modern” corpus linguistics. According to Kennedy, there are four major areas in this burgeoning field:

…making and compiling the corpus; developing corpus-based analysis tools; corpus-based description of linguistic phenomena and application of the corpus-based description in language learning and teaching, and natural language processing by machine (Kennedy, 1998: 9).

Although corpus linguistics did not begin with the development of computers, it is inextricably linked to the computer, for the computer has “incredible speed, total accountability, accurate replicability, statistical reliability and the ability to handle huge amounts of data” (Kennedy, 1998:5). Therefore, computers give a huge boost to corpus-based studies since they facilitate analyses that are more complete and reliable. With the increasing availability of public machine-readable texts and user-friendly computer-assisted language analysis tools, more linguists are undertaking corpus-based studies, which cover “a wide variety of topics within linguistics” (Biber, Conrad & Reppen, 1998:11).

Yet, even with the machine-readable corpus and data gathered from the computer, the amount of data is still overwhelming, and in the raw form, researchers’ intuitions or assumptions are, therefore, required to turn the raw data into a linguistically meaningful analysis. Meaningful corpus-based studies can only be carried out when first-hand textual data supplied by the corpora are combined with the researchers’ knowledge of and about the language; so a corpus-based study is “a question of corpus plus intuition, rather than corpus or intuition” (Leech, 1991:74

The study presented in this paper focuses on the corpus-based description of linguistic phenomena. As a linguistics major, an English learner, and an English teacher, the researcher has always been interested in English movies, since they were the first media to introduce “real” spoken English to the researcher during her college years in China, and English used in the movies is noticeably different from “text-book English” (written English).

Linguistic knowledge tells us that language, in a broad sense, can be classified into two genres[1]: spoken genre and written genre. The screenplay is a distinctive type of text, a literary genre, but its function is to replicate or mirror spontaneous speech, which is a subcategory of the spoken genre. In this study, the researcher will gather data from these three types of texts: screenplays, a spoken corpus, and a written corpus, using a computer text-analysis tool, Text Analysis Computing Tool (TACT), a text retrieval program, and Linux’s text-processing tools.   Biber et al. state,

…quantitative techniques are essential for corpus-based studies…[but] a crucial part of the corpus-based approach is going beyond the quantitative patterns to propose functional interpretations explaining why the patterns exist (1998:9).

 This study will combine both quantitative and qualitative approaches. The researcher will start with frequency lists, as frequency counts are the most straightforward approach to a quantitative analysis. This will allow the researcher to view the screenplays from “some nonlinear but critically neutral perspective” (Lancashire,1993:293). In addition, the frequency count lists will enable the researcher to identify the shared lexical items[2] and investigate how these words are used across screenplays and among the three types of text. A qualitative approach will enable the researcher to study how genre influences the patterns and variations across screenplays, and among the three types of text.

To summarize, the researcher will compare the shared lexical items of screenplays with those extracted from spontaneous speeches and written texts in order to reach a better understanding of the language used in screenplays.

Literature Review

The strength of the corpus-based approach, as Stubbs states, is that 

Much of this deep patterning is beyond human observation and memory. It is observable only indirectly in the probabilities associated with lexical and grammatical choices across long texts and corpora. This therefore leads in turn to a methodological focus on computer-assisted and quantitative methods, particularly in cases where native speaker intuitions are very limited, and where description can proceed only on the basis of attested corpus data. (1966:21).

Therefore, the major reason for compiling linguistic corpora is to “provide the basis for more accurate and reliable descriptions of how languages are structured and used” (Kennedy, 1998:88).

Historical Overview

The history of studying language structure and use based on large, systematically complied text corpora is only about 40 years old. Starting in 1961, W. Nelson Francis and Henry Kucera compiled the Brown Corpus, the first electronic, machine-readable corpus for linguistic research. The Brown Corpus is “a synchronic corpus of approximately one million words representative of the written English printed in the United States in the year 1961” (Kennedy, 1998:24). Since then, the number of machine-readable texts has increased enormously. Between 1970 and 1978, a corpus of written British English, the Lancaster-Oslo/Bergen (LOB) Corpus was compiled. It is a British English counterpart to the Brown Corpus. In 1975, London-Lund Corpus (LLC) was completed and it has about half a million words. Between 1991 and 1995, the British National Corpus (BNC) was compiled and it was designed to be representative of British English as a whole (Kennedy: 1998, Aijmer: 1991, Biber: 1998).

These corpora and more, like Association of Computational Linguistics/Data Collection Initiative (ACL/DCI), The Corpus of Spoken American English (IBM-Lancaster CSAE) and COBUILD/Birmingham Corpus are designed for general linguistic purposes, that is, they have been designed so that they can be “examined or trawled to answer questions at various linguistic levels on the prosody, lexis, grammar, discourse patterns or pragmatics of the language” (Kennedy, 1998:3-4).

Partington gave an overview of the main areas of linguistic analysis that use corpus-based techniques: style and authorship studies; lexis; syntax, text, spoken language; translation studies; register studies and lexicography (1998:2-4). Biber et al. (1998), Kennedy (1998) and Partington (1998) used both publicly available corpora and specifically designed corpora to investigate language use empirically in almost all these areas. The results of their findings show that there are strong, systematic patterns in the way language is used.

Studies related to the present paper

The researcher’s interest in studying screenplays derives from her observations that lexical items used in “textbook English” (written English) and the English in screenplays (spoken English) are distinctive. Lots of studies have been done on lexical choices based on word frequency counts from different texts. This section will present an overview of what other researchers have done in this area.

Whissell (1999) used the computer’s ability to count large number of words speedily and accurately to do the stylometric analysis of 155 songs composed by Beatles Paul McCartney and John Lennon between 1962 and 1970. Using the computer program TEXT.NLZ, written by the author himself, he compares the most commonly used words in Lennon and McCartney’s lyrics by counting individual words. Each word or punctuation in a song is treated as an entry unit. The results of the comparison showed that Lennon used fewer pleasant words, and more nasty, soft and sad ones, thus confirming the observations made of his less than upbeat attitude. He also used more second person pronouns, proportionally more forms of the verb to be, more forms of the word girl, and more forms of the word dead. McCartney, by comparison, used more words repeatedly, used more p unctuation marks, used the conjunction and more frequently, used forms of the word love more often, and included more whoops and nonsense words in his lyrics (Whissell, 1999:260).

Lennon was “the less pleasant and the sadder lyricist” (Whissell, 1999:257) and Lennon-McCartney lyrics became less pleasant, less active, and less cheerful over time, which is congruent with the critics’ observations of Beatles’ songs.

Burrows and Craig (1994) used two computer programs, SPSS and MINITAB, to statistically compare ten plays of the English Romantic tragedies and ten Renaissance tragedies to give strong evidence against the condemnation by critics that English Romantic tragedies are a series of poor imitations of Renaissance tragedy. The study is based on the frequencies of the ninety-nine most common words in the dialogue of the twenty plays. The frequency of the ninety-nine words owes least to the context, that is, they are primarily function words - articles, propositions, conjunctions, pronouns and so on. The central notion of their analysis is concomitance, which Burrows et al. explain as expressive of the fact that, in patterns of frequency drawn from any set of written texts, many word-types will behave like each other in the sense that they will consistently occur either more frequently or less so, in rough unison with each other … Thus, in the prose of impersonal description, the, of and a will usually show much higher frequencies than they do in dialogue. I and you, meanwhile, will move together on an opposite cycle, occurring comparatively seldom in prose of the former kind but very often in the latter (1994:67).

Although there is a resemblance between Renaissance tragedy and English Romantic tragedies in theme, diction, situation and characters, Burrows and Craig’s research shows that the former is more expository, while the latter includes more commonplace interactions between characters.

Craig (1999) uses Principal Components Analysis to study the Idiolects of Ben Johnson’s characters. Principal Components Analysis is a technique used to determine whether it is possible to “reduce a large number of variables to one or more values that will still let us reproduce the information found in the original variables” (Hatch & Lazaraton, 1991:490). The analysis is based on the spoken parts of Johnson’s 214 characters and the variable frequencies of the over 500 most common words in his plays. The patterns from the data show that the, of and and are frequent in the same texts, and whenever they are frequent, I, you and my are not, and vice-versa. Craig’s explanation is that

high frequencies of the definite article indicate a concentrate of nouns, characteristic in turn of description or narration. A concentration of of indicates description in particular… [a]nd suggests longer chains of nouns and clauses, and thus exposition. Dialogue with a high proportion of the pronouns…is likely to include a good deal of personal interaction, and to be marked by reflexiveness” (1999:223).

 Therefore, Craig concludes, the characters from Johnson’s early plays tend to be self-oriented, characterized by “myopic egotism: simpletons, gulls and a self-intoxicated courtesan” (1999:228), which is proved by the high frequencies of the pronouns I, you and my. Characters from his later plays and the Roman tragedies tend to be world-oriented, as the, of, and and have high frequencies. In addition, those plays from middle-period plays tend to be widely dispersed between the above two.

  Baker did a comparative study of Marlow and Shakespeare’s works by exploiting vocabulary richness as a measure of authorship pace. Pace is “the measurement of an author’s ability to generate new words as the length of his manuscript increases. It is the ratio of Types to Tokens” (Baker, 1988:34). The result shows that Marlow’s Pace is superior to Shakespeare’s, which abandons the belief of the superiority of Shakespeare in matter of vocabulary richness and establishes the unchallenged position of Marlow among Elizabethans had he survived.

Kennedy compared the rank ordering of the 50 most frequent words in six corpora: Birmingham Corpus, Brown Corpus, LOB Corpus, Wellington Corpus, American Heritage Corpus, and London-Lund Corpus.  These corpora differ not only in size, but also in “geographical diversity, diversity in the period of text production, [and] variation across written as well as spoken texts…” (Kennedy, 1998:7). Yet, there is a striking consistency in the content of the lists: all the words except said are function words, although there are some striking differences in the rank ordering of the 50 most frequent words. This finding reflects the nature of different corpora (for example, LOB is written whereas LLC is spoken).

Biber et al. studied the differences between spoken and written registers. Their study focused on the use of three types of dependent clauses: relative clauses, adverbial clauses, and complement clauses in two written registers from the LOB Corpus (80 academic prose texts and 14 official documents) and two spoken registers from the London-Lund Corpus (44 conversations, and 14 prepared speeches). The results show that “the different types of dependent clause are distributed in very different ways across the [two] registers…. The different functions of the different types of subordinate clauses correspond to their distribution across the registers” (Biber et al., 1998:141).

Seegmiller, Fitzapatrick and Call (1999) used Linux’s text-analysis tools to process 23 ESL students’ compositions at Montclair State University. From these compositions, they obtained the mean sentence length, total words used, clauses per sentence, and errors contained in two timed essays, one of which was written at the beginning of the semester, and the other at the end of the semester, as a way to assess the language development of these students. The data clearly show “the kind of change [that] occurs in the students writing during a rigorous ESOL course” (unpublished).

Sotillo (2000) did a comparative study of six threads of cyber discourse from five New Jersey towns over a three-month period in order to analyze “the participants’ use of lexicon and syntax”. She used TACT to gather the frequency counts of the following lexical items: nominalizations, Wh-questions, pronouns, use of because as subordinator, and types of verb denoting activities and mental states. The results of this analysis show that working class postings and middle class postings have their own distinctive patterns.

All of the corpora used in the above corpus-based linguistics studies, whether general, publicly available or specialized, are “the actual language used in naturally occurring texts” (Biber et al., 1998:1).

From the literature review, we can see that corpus-based analysis is descriptive, not prescriptive. It focuses on “performance rather than competence and on observation of language in use leading to theory rather than vice versa” (Leech, 1992:107), so as an approach to linguistics, corpus-based analysis provides a new perspective on the study of language use and language structures. It is not based on linguists’ intuition or small amount of language data, but on large database of real language use.

Research Questions

According to Biber et al., there are two major reasons for studying language use: “assessing the extent to which a pattern is found, and analyzing the contextual factors that influence variability” (1998:1).

It is known that the “screenplay” is a unique literary genre. Yet, like written and spoken genres, the “screenplay” is also used as a cover term and can be further broken down and classified as comedy, romance, thriller, etc. As part of the same literary genre, screenplays have the following characteristics in common: they are pre-scripted, written to be spoken, with the goal of replicating actual spontaneous speech, while at the same time, they are different from real speech in that they do not have the spontaneity that real speech has. In addition, actual speech exists in spoken form first, though it can be transcribed into a written form. Screenplays are an inversion of speech; they go from written form to speech form. Therefore, we can see that there is a complex relationship between screenplays and written and spoken genres, though we know that the screenplay is much closer to the spoken genre than to the written genre. Accordingly, the best way to study the patterns and variations is to do a comparison not only across screenplays, but also against or with representative spoken and written genres. Based on this knowledge, it is hypothesized that non-linguistic factors, genre, can influence the lexical patterns and variations across screenplays and among screenplays, spontaneous speech and writing.  Thus, the following research questions are proposed:

1.      What lexical patterns emerge among screenplays that suggest they belong to the same genre?

2.      Are there any lexical variations across screenplays and what may account for these variations?

3.      What are the lexical differences and similarities between a corpus of screenplays and a spoken corpus?

4.      How does a corpus of screenplays differ from a written corpus from a general lexical perspective?

The study will be approached using the comparisons of the counts gathered from the word frequency count lists[3], as the frequency count of lexical items is a straightforward approach to analyzing quantitative data. Sorting and counting, the basic functions of all computer-assisted linguistic tools, will give the word frequency count lists and the number of word types and word tokens. The frequency count lists will allow the researcher to view the screenplays from “some nonlinear but critically neutral perspective” (Lancashire: 1993:293), and identify the most frequently used lexical items and explain how different genres and subgenres influence their usage. The type/token ratio can be used to measure the vocabulary richness in a text: type is a unique or different word in a text; token is the running word in a text. For example, in American Beauty, the total number of types is 1660, while the total number of tokens is 8913.

Two levels of comparison will be done. One level is a horizontal comparison, that is, screenplays will be compared against each other. The other is a vertical comparison: screenplays will be compared against a standard spoken corpus and a standard written corpus.


According to the definition given by Kennedy, a corpus is “a body of written text or transcribed speech which can serve as a basis for linguistic analysis” (1998:1). The present study will focus on the influence of non-linguistic factor, genre, on the patterns and variations across screenplays and among different text types. This determines the design and nature of the corpora. Both general corpora and specialized corpora are used for this study.

Two general corpora are the Brown Corpus, representing written American English, and the CSAE (the Corpus of Spoken American English), representing spoken American English. The Brown Corpus, which is “made up of 500 texts from material published in 1961, … including newspaper articles, religious writing, scientific publications, fiction of various kinds and so on”, (Butler, 1985:34), is a widely used corpus of American English. It is a written genre with “75% factual writing and 25% fiction” (http://www.linguistics.ucsb.edu/research/sbcorpus/faqs.html). Researchers can utilize the index and concordance search engine at the website http://vlc.polyu.edu.hk/ConcordanceIndex/Brown/Default.htm.  The CSAE is composed of “a large number of recordings of people talking – people from all over the United States, in all walks of life, talking about and doing all sorts of things… in fact just about anything you can think of that people do with language” (ibid). The CSAE is commercially available as a computer database on CD-ROM disks[4]. Both of these corpora are systematic, representative and recognized as de facto standard samples of American English among linguists.

The specialized corpus of the screenplays compiled for this study is the focus of this analysis. Since there are so many movies produced each year, analyzing them all would be an impossible task. Rather than arbitrarily or randomly select movies as the data, the researcher chose certain criteria to narrow down the screenplays used for this study. The movies selected to form the specialized corpus were produced and debuted during the last decade of the twentieth century and won Oscar Awards for both Best Picture and Best Screenplay. The justification for using these criteria is that winning Best Picture increases their viewing capacity, and winning Best Screenplay suggests that the language used tends to make the characters real, which means that the language is close enough to real life to be believed by most people. With these criteria in mind, five screenplays were chosen for the linguistic analysis: American Beauty, Shakespeare in Love, Forrest Gump, Schindler’s List and The Silence of the Lambs. 

The following Table 1 presents general information about the five screenplays, which can be accessed at the Internet Movies Database at the website:  http://www.imdb.com.

American Beauty

1) Year winning Best Picture and Best Screenplay: 1999 (the 72th Academy Awards)

2) Plot Outline: A deceased man tells his tale of how he turned his miserable life around and turned everyone else’s upside down as a result.

3) Tagline: … Look closer

4) Genre: Drama/ Comedy /Family

Shakespeare in Love

1) Year winning Best Picture and Best Screenplay: 1998 (the 71th Academy Awards)

2) Plot Outline: A young Shakespeare, out of ideas and short of cash, meets his ideal woman and is inspired to write one of his most famous plays.

3) Tagline: …A comedy about the greatest love story.  Almost never told… Love is the only inspiration.

4) Genre: Romance / Comedy

Forrest Gump

1) Year winning Best Picture and Best Screenplay: 1994 (the 67th Academy Awards)

2)     Plot Outline: Forrest Gump, while not intelligent, has accidentally been present at many historic moments, but his true love, Jenny, eludes him.

3) Tagline: The world will never be the same once you’ve seen it through the eyes of Forrest Gump. Life is like a box of chocolates… you never know what you’re gonna…

4) Genre: Adventure / Drama/ Comedy

Schindler’s List

1) Year winning Best Picture and Best Screenplay: 1993 (the 66th Academy Awards)

2) Plot Outline: Oscar Schindler uses Jews to start a factory in Poland during the war. He witnesses the horrors endured by the Jews, and starts to save them.

3) Tagline: Whoever saves one life, saves the world entire. The list is life.

4) Genre: Drama / War

The Silence of the Lambs

1) Year winning Best Picture and Best Screenplay: 1991 (the 64th Academy Awards)

2) Plot Outline: Clarice Starling, a young FBI agent, is assigned to help find a missing woman, and save her from psychopathic killer.

3) Tagline: Dr.Hannibal Lecter, brilliant, cunning, psychotic. In this mind lies the clue to a ruthless killer. Clarice Starling, FBI. Brilliant, vulnerable, alone. She must trust him to stop the killer. Prepare yourself for the most exciting, mesmerizing and terrifying two hours of your life. The only way to stop a killer is by going into the mind of a madman.

4) Genre: Thriller / Crime

Table 1:  general information about the five screenplays

Plot outline shows the theme of each screenplay. Tagline highlights the theme and genre tells the subgenre to which each screenplay belongs. All five screenplays are freely available in electronic format on the Internet and have been downloaded for this study.

The characteristics of these corpora are that they are all “attested, actual, authentic data” (Stubbs, 1996:xv), that is, all the corpora used in this study occur naturally without the intervention of the researcher.

A screenplay has its own special written format: dialogues alternate with stage directions in the published text. Because the stage directions are not relevant to this study, they were carefully removed. In addition, all the characters’ names were placed in angled brackets so that TACT would not include them in its word counts. The corpus of the five screenplays is made up of the entire dialogues among the characters in the five screenplays. This, in turn, consists of five subcorpora, individual screenplays.

In total, there are three corpora (the Brown Corpus, the CSAE and the collected five screenplays), one of which consists of five subcorpora (American Beauty, Shakespeare in Love, Forrest Gump, Schindler’s List and The Silence of the Lambs). They are used in this linguistic analysis.

Tools and Methodology

There are many computer software programs publicly and commercially available to assist linguistic researchers with data analysis; some of the common ones are CorpusBench, ConcApp, Micro Concord, TACT, and Word Cruncher. All of these text-analysis programs can perform the basic tasks of counting, searching, and sorting linguistic items.

In this paper, TACT is used to analyze the five subcorpora, and Linux’s text-processing tools are used to analyze the CSAE. TACT, developed by John Bradley and Lidio Presutti at the University of Toronto Computing Services, runs under MS-DOS and can be downloaded for free from the Internet. In order to use TACT, a plain text file needs to be converted into a text database using an ancillary program called MakeBase. All the other 15 programs in the TACT suite can be employed to extract data from the resulting database files. The most often used programs in TACT are UseBase, CollGen, TACTStat, TACTFreq, and Fcompare. TACTStat, TACTFreq and UseBase are the programs used by the researcher to analyze the selected screenplays. TACTStat “produces type-token statistics for word-length and word-frequency”; and TACTFreq “produces alphabetical, reverse alphabetical, and descending-frequency word-lists” (http://www.chass.utoronto.ca/epc/chwp/bradley/index.html). The last one, UseBASe, is “interactive, specializing in “quickly answering questions related to a work’s vocabulary” (ibid). After selecting words from a frequency list, the information in the selected list can be displayed in five different ways: text display, variable context, KWIC (key word in context), distribution display and collocate display. UseBase can sort and extract all instances of words or phrases in context.

The researcher starts her study with the horizontal comparison of the five subcorpora (as they are the focus of the study). Running TACTFreq[5] and TACTStat against the five subcorpora, word frequency count lists and type-token statistics were obtained.

TACT and Linux text-analysis tools use spaces and punctuation marks between words as word delimiters. Therefore, the frequency counts retrieved for the five subcorpora are raw counts[6], since forms with apostrophes are considered as single words and variants of the same form are considered as different words.  For example, I, and I’ll have different graphic form, so I’ll is counted separately from I, while buy, buying, buys, and bought are considered different types. In this study, the researcher is interested in the frequency of particular lexical items, that is, the headword or lemma[7], which can be represented by a number of separate word forms. Therefore, lemmatization has to be performed in order to give accurate frequency counts of each lexical item. In addition, the text lengths of these five subcorpora are different. In order to make the frequency comparison accurate and consistent, the data need to be normalized to adjust the raw counts from corpora of different text lengths to a fixed amount of text so that they can be compared (Biber et al, 1998: 263). For the comparison across the five subcorpora, the median is used to create a baseline text length to be normed. These two steps, lemmatization and normalization, standardize the raw frequency counts and prevent the research results from becoming distorted. The lemmatization is done manually and the normalization is done using a specially created Excel spreadsheet[8]. Also, ranking orders will be assigned to the words after lemmatization and normalization, so that a clear picture can be obtained on the frequency of the same word in different corpora and subcopora.


The frequency count lists of the five subcorpora are the starting point of this analysis. Table 2 displays the most commonly used 15 lexical items in the five subcorpora.

American Beauty

Shakespeare in Love

Forrest Gump

Schindler’s List

The Silence of the Lambs

  378  you

  348  the

  398  i

  310  you

  415  you

  340  i

  315  i

  392  you

  245  the

  333  the

  265  to

  291  a

  313  the

  230  to

  283  to

  187  a

  268  you

  311  to

  219  i

  241  i

  165  the

  208  and

  281  a

  183  a

  229  a

  155  and

  194  is

  277  and

  151  it

  167  and

  126  me

  187  my

  163  was

   91  this

  149  of

  111  my

  167  to

  149  that

   89  of

  144  in

  107  it

  164  it

  148  of

   83  is

  125  he

  100  i'm

  154  will

  139  in

   79  what

  117  it

   99  of

  145  of

  139  it

   72  have

  115  me

   91  in

  121  in

  137 me

   67  and

  101  is

   89  is

  102  for

  131  forrest

   65  it's

   95  for

   87  what

  96  me

  119  on

   65  me

   93  your

   84  just

  94  have

   95  this

   64  do

   91  that

Table 2: The raw count of the most commonly used 15 words in the five subcorpora

In Table 2, all the words are in lower case, even those that would normally be capitalized as I and Forrest. This is designed into TACT, so that, for example, you will not be counted twice when it is at the beginning of a sentence and within a sentence. Sometimes, this can cause distorted counts. For example, the count of will in Shakespeare in Love includes the count of Will (short for William, the main character), Will as an auxiliary at the beginning of a sentence, and will as an auxiliary within a sentence.  In cases like this, the counts need to be adjusted manually.

As previously mentioned, the raw counts of the top 15 words need to be lemmatized, so that accurate counts of each word can be obtained. As the raw counts are based on the graphic word forms, the lemma of the pronoun you[9] will include you, you’ll, you’d, and you’ve, and the lemma of have will include have, I’ve, you’ve, we’ve, and similarly for I, he, it, is, am, was. The indefinite article has two variants a and an.

Of these 15 words, nine are shared across all the five screenplays and will be used as the basis for the horizontal study. Table 3 displays the counts of the shared nine words after lemmatization in descending ranking order. A close look at these frequency count lists displayed in Table 4 shows an interesting trend: they owe very little to the context[10] in which they are used. This means that they are all function words, not content words[11].

Ranking Order

American Beauty

Shakespeare in Love

Forrest Gump

Schindler’s List

The Silence of the Lambs


 489    I

 348   the

 502   I

 374   you

 504   you


 449   you

 333   I

 428   you

 320   I

 333   the


 265   to

 309   you

 313   the

 245   the

 326   I


 204  a(an)

 301   a(an)

 311   to

 230   to

 283   to


 165   the

 208   and

 300   a(an)

 194   a(an)

 251   a(an)


 155   and

 167   to

 277  and

 151   it

 167   and


 126   me

 164   it

 148  of

 89     of

 149   of


 107   it

 145   of

 139   it

 67     and

 117   it


 99    of

 96     me

 137   me

 65     me

 115   me

Table 3: lemmatized counts of the shared nine words in the top 15 words with descending ranking order in the five subcorpora

Table 4 presents the data that will be used in the horizontal comparison. As for the vertical comparison, these nine lexical items[12] will be retrieved from the Brown Corpus, the corpus of the five screenplays, and the CSAE. Table 5 displays the lemmatized counts of the nine lexical items in the three types of texts.

Ranking Order

The Brown Corpus

The Corpus of the five screenplays

The corpus of CSAE


   38274   a(an)

 2064     you

   3724     you


36410   of

1970     I

3365     I


 28854   and

 1404     the

    2508     and


26156   to

 1256     to

   2493     the


9083     it

     1250     a(an)

 2149     of


5589     I

      874       and

  1773     me


  4432     the

 678       it

     1760     a(an)


  3652     you

  630       of

1473     it


 1183     me

    539       me

 1037     to

Table 4: lemmatized counts of the nine lexical items in the Brown Corpus, the corpus of the five screenplays, and the CSAE

Table 3 and Table 4 show the lemmatized counts of the nine lexical items in the five subcorpora and the three corpora. In addition, the lemmatized counts need to be normalized for the purpose of comparison. When norming frequency counts, the total number of words (tokens)[13] in each subcorpus must be taken into consideration. Table 5 shows the tokens in each of the five screenplays (subcorpora).


American Beauty

Shakespeare in Love

Forrest Gump

Schindler's List

The Silence of the Lambs

Number of Tokens






Table 5: total tokens in each of the five subcorpora

The normed counts can be obtained by dividing the total tokens into the raw counts of each of the nine lexical items, and then multiplying by 10,000, the approximate median text length of the five screenplays. Table 6 gives the normed counts of the nine shared lexical items. An additional column of the corpus of the five screenplays, normed to the same text length as the five subcorpora, is added for reference.

Ranking order

American Beauty

Shakespeare in Love

Forrest Gump

Schindler’s List

The Silence of the Lambs

The Corpus of the five screenplays


  549  I

  366  the

  411  I

  501  you

  437  you

  419  you


  504  you

  350  I

  350  you

  429  I

  289  the

  400  I


  297  to

  325  you

  256  the

  328  the

  283  I

  285  the


  229  a(an)

  317  a(an)

  255  to

  308  to

  245  to

  255  to


  185  the

  219  and

  246  a(an)

  260  a(an)

  218  a(an)

  254  a(an)


  174  and

  176  to

  227  and

  202  it

  145  and

  178  and


  141  me

  173  it

  121  of

  119  of

  129  of

  138  it


  120  it

  153  of

  114  it

  90    and

  101  it

  128  of


  111  of

  101  me

  112  me

  87    me

  100  me

  109  me

Table 6: normed and lemmatized counts of the nine lexical items to a basis per 10,000 words of text among the five subcorpora and the corpus of the five screenplays

 Table 7 shows the total number of tokens in these three different types of texts.


The Brown Corpus

The five screenplays


Total number




Table 7: total tokens of the three corpora

Since the text lengths are dramatically different among the three corpora, the Brown and the CSAE were normed to a basis per 50,000 words of text, which is the approximate size of the corpus of the five screenplays. Because the lexical items analyzed in this study are all function words, and the corpus of the five screenplays is already large enough to include all the possible function words in the English language, the size of the corpora will not influence the comparison of function words as the comparison of whole vocabulary, which will be done in the last section. Table 8 gives the normalized counts of the nine lexical items in descending rank order.

Rank order of the nine lexical items

The Brown Corpus

The Corpus of the five screenplays



 1393   a(an)

 2080   you

 2308   you


 1326   of

 1985    I

 2082   I


 1059   and

 1415    the

 1542   the


 952     to

 1265    to

 1552  and


 331     it

  1259   a(an)

  1330  of


 203      I

  881     and

 1097   me


 161     the

  683     it

 1089   a(an)


 133     you

  635    of

  911   it


 43       me

  543    me

  642   to

Table 8: normalized counts of the nine lexical items to a basis per 50,000 words of text among the three corpora in descending ranking order

Tables 6 and 8 display the comparable counts of the nine lexical items in the five subcorpora and three corpora. These comparable lexical item counts will be the source data used to answer the four research questions, which will be discussed in the next section.

A close look at the nine lexical items allows them to be placed into three categories, as shown in Table 9.

The lexical items

Grammatical categories

Non-Grammatical Functions

I, you, me, it


I, you, me – interpersonal interaction.

it – the event or topic, etc talked about

the, a(an)


the – given (old) information

a(an) – new information

of, and, to[14]

of - preposition; and-conjunction;

to[15]: preposition or infinitive marker

sentence or phrase lengths

Table 9:  grammatical categories and non-grammatical functions of the nine lexical items

Another look at the nine lexical items shows that although they are all function words, I, you, me, and it function as “noun phrases”. In addition, I functions as a subject, me as an object, while, you, it can function both as subject and object. When I and me are added up, the frequency of the first person singular used as subject and object stands out on the top of the frequency count list (with the exception of The Silence of the Lambs), as displayed in Table 10.


American Beauty

Shakespeare in Love

Forrest Gump

Schindler's List

The Silence of the Lambs

The Corpus of the five screenplays

I + me





















Table 10: first person singular, second person, and third person singular inanimate, function  both as a subject and an object

The, a(an), of, and, to, on the other hand, perform purely grammatical functions. According to their grammatical functions, they are used with different types of content words to form grammatically correct phrases[16]. These words can be classified into three categories according to the grammatical functions performed, as shown in Table 11.

Lexical item(s)

Grammatical function

the, a(an), of and to (used as preposition)

Used together with noun phrases

to ( as infinitive marker)

Used together with verb phrases


Connecting parallel structures

Table 11: the three categories of these nine lexical items according to their grammatical functions

Now turning to the distributions of these nine lexical items in the five screenplays, a brief inspection of Table 6 shows a pattern across each of the five screenplays and in the combined corpus of the five screenplays: the distribution of I, you, the, and a(an) are within the top five most frequently used words. In order to have a clear visual view of how these nine shared lexical items function, the data from Table 8 are displayed in Chart I.

Chart I

In Chart I, a pattern emerges. The distributions of I, you, the, a(an)  are within the top five ranking orders and are used more frequently than me, it, of and and among the five subcorpora and the corpus of the screenplays. This indicates that in screenplays, as a literary genre, the interpersonal communication between the speaker and the hearer(s) predominates. Similar amounts of old and new information are given, since all the stories happen in specific settings and new information needs to be supplied frequently by the screenplay writers to keep the story interesting and entertaining. The much higher frequency of I as subject, and much lower frequency of the use of me as an object, strongly suggests that the speaker is in control of the scene. The much lower frequency of it compared to I and you implies that the content of what I and you are talking about is secondary to the interpersonal interaction between the speaker and the hearer(s).

It is tempting to conclude that the variations in the use of the nine lexical items are directly related to the portraying of the characters and the themes of the movies (illustrated in Table 2). The variations of the use of I (the speaker) and you (the hearer(s)) depend very much on how often the I wants to express “myself” and how often the I talks about you or gives you an opportunity to express “yourself”. As for to, and, and of, although they perform different grammatical functions, they all tend to increase the grammatical complexity of the sentences, which, at the same time, increases the sentence length.

There are variations in the way the screenplays’ writers exploit these function words to portray their characters. For example, an examination of Shakespeare in Love, in Chart I, shows the highest frequencies of the, a(an) and of, and the lowest frequencies of I and you. Knowledge of English grammar tells us that these high frequencies imply greater incidence of noun phrases, which, in turn, indicates that the characters spend considerably more time with expository matter than with interpersonal communication.

As for the frequency of the word to, TACT’s UseBase is used to extract the concordance lists in KWIC form across the five screenplays. The counts of to used as a preposition are manually separated from the instances of to used as an infinitive markers. Table 12 displays the raw counts of to used as prepositions and to used as infinitive markers. Table 13 converts these raw counts into percentages for the sake of comparison.


American Beauty

Shakespeare in Love

Forrest Gump

Schindler's List

The Silence of the Lamb

The five screenplays

Total number of TO







TO used as preposition







TO used as infinitive marker







Table 12: raw counts of to as proposition and infinitive marker



American Beauty

Shakespeare in Love

Forrest Gump

Schindler's List

The Silence of the Lamb

The five screenplays

TO used as preposition







TO used as infinitive marker







Table 13: raw counts converted into percentages

There is a sharp discrepancy between the use of to as a preposition and to as an infinitive maker across the five screenplays. The high frequency of to as preposition in Shakespeare in Love supports the earlier result about the intensive use of noun phrases. This suggests that the plot is static and expository in nature. The high frequency of to as infinitive marker in American Beauty indicates a greater use of verbs, adding to the dynamic nature of the plot and suggesting that the screenplay is more action-based.

Next, the researcher will examine the nine lexical items in the three main corpora. The data in Table 8 is displayed in Chart II to see if there are any emerging patterns or variations among the three corpora.

Chart II

Chart II shows that there is a pattern among you, I and the, in that they are on the top list of both the corpus of the screenplays and the CSAE. This conforms with our common sense knowledge that spontaneous speech and screenplays are both face-to-face interpersonal communication, occurring in a specific physical setting. At the same time, the lower frequency use of you, I and the in the corpus of the screenplay, when compared to the CSAE, indicates that although the screenplays attempt quite successfully to replicate real speech, they are still different from real speech.  The discrepancy of the frequency of the and a(an) in these two corpora suggests that spontaneous speech is based more on ‘old’, mutually understood information than on supplying new information, when compared to the screenplays. Screenplays need to give new information frequently to entertain the audience and maintain the plot development. In spontaneous speech, the reaction of the hearer(s) is crucial to keep the thread of conversation going. This explains the reason you and me are used more often in the CSAE than in the screenplays.

In the Brown Corpus, (75% factual writing and 25% fiction (http://www.linguistics.ucsb.edu/research/sbcorpus/faqs.html)), supplying new information (the use of a(an)) is essential, and personal opinion is of little importance, so there is a vast difference between the frequent use of I, you and me between the Brown Corpus and the corpus of the five screenplays. Also, because the Brown is a written genre, and needs to use language to create a context, the is used vastly less often. The introduction of new information is more often used to build a given context. 

Grep[17] is used to obtain the concordance list of to in the CSAE, and the Brown Corpus on-line search engine makes it easy to build a concordance list and gives 2001 instances of the word to. Table 14 displays the raw counts of to used as a preposition and to used as an infinitive marker. Table 15 converts these raw counts into percentages.


The Brown Corpus

The Corpus of the five screenplays






TO used as prepostion




TO used as infinitive marker




Table 14: raw counts of to as proposition and infinitive marker


The Brown Corpus

The Corpus of the five screenplays






TO used as preposition




TO used as infinitive marker




Table 15: raw counts of to converted into percentages

A comparison of the uses of to among the three texts shows that in all cases to is used as an infinitive marker and has a higher frequency than to used as a preposition. This result shows that to is favored as an infinitive marker than as a preposition in these three types of texts. At the same time, the greater use of to as an infinitive marker in the CSAE and the screenplays indicates that in spoken English, verb phrases are used more often than noun phrases to lengthen the sentences and at the same time to keep the texts more dynamic. Thus, spoken English is more likely to use verb phrases and written English is more likely to use noun phrases to develop the texts, which confirms previous findings by Biber et al.


This paper is both a horizontal comparative analysis of the nine shared lexical items across the five screenplays, American Beauty, Shakespeare in Love, Forrest Gump, Schindler’s List and The Silence of the Lambs, and a vertical comparative analysis of these nine lexical items across three text types: screenplays, the written genre, and the spoken genre.

The frequency counts generated by TACT and Linux text-analysis tools allow the researcher to investigate the nine lexical items in the five screenplays and in the three text types. The great frequencies of I, you, the and a(an) in the five screenplays shows what is shared in this literary genre: the predominance of face-to-face communication between characters in specific settings with large amount of new information continuously given to keep the story interesting. The higher frequencies of I, you, and the in the five screenplays as well as the CSAE establishes a correlation, and indicates that screenplays mirror the real speech. This is in sharp contrast with the frequency of I, you, and the in the Brown Corpus (shown in Chart II): because the written genre always has to build up the context with words, it has the highest frequency of a(an). The frequency differences of a(an) and me in the screenplays when compared with the CSAE show some unique characteristics of screenplays: there are more new information and the speaker is in control of the scene, which agree with the character building of this genre.

In this study, the researcher made use of two advantages  that computer tools bring to linguistic analysis: first, computer tools can accurately count the occurrence of linguistic items in texts with tremendous speed and accuracy;  second, they permit the researcher to work with collections of data, too large to do manually, and readily search for patterns in order to arrive at generalizations about language use that go beyond mere intuitions. Therefore, corpus-based analysis not only constitutes an extremely useful technological tool, but can be looked at as a type of approaches that makes it possible to “do new types of investigations and conduct research on scope previously unfeasible” (Biber et al, 1998:105). Without the computer-based corpora and computer programs, the researcher would not have been able to do this lexical investigation objectively, accurately and efficiently, and to answer the research questions successfully.

The only thing that the computer is not yet capable of doing is creating the corpora themselves. This study used laboriously constructed corpora: the five subcorpora of the screenplays and the CSAE. In addition, the researcher has gained an appreciation of the amount of human effort that is required to conduct such a study: creating the five subcorpora was enormously labor intensive, and separating the actual spoken texts from the stage directions was something that could only be accomplished through many painstaking hours of work.

Through analysis of the first-hand textual data of these nine lexical items, we have already seen that there are strong, systematic patterns in the ways these are used within the same text type and across text types. Therefore, a better understanding has been gained of the co-occurring lexical features and patterned differences. All these facts allow the researcher to conclude that there is a definite correlation between the distribution of the nine function words and the text types in which these function words appear.

Topics for Further Study

A preliminary look of type/token ratios of the five subcorpora provided by TACT’s TACTStat opens up some interesting study opportunities in the area of vocabulary richness. Table 16 shows a type/token ratio comparison across the five subcorpora, and the data is graphed in Chart III. An additional row of the corpus of the five screenplays is added for reference.

Names of the screenplays

Number of Types

Number of Tokens

Type/Token ratio

American Beauty




Shakespeare in Love




Forrest Gump




Schindler’s List




The Silence of the Lambs




The corpus of the five screenplays




Table 16:  vocabulary richness measured by the type/token ratio


Chart III

There is not much variation in the vocabulary richness among the five subcorpora. In these five screenplays, the vocabulary richness in Forrest Gump is the lowest while that of The Silence of the Lambs is the highest. This vocabulary richness correlates with the portrayal of the characters. Forrest Gump is a story told by Forrest himself, whose IQ is 75, while The Silence of the Lambs is the intelligent contest between two brilliant individuals: Dr. Lecter and FBI agent Clarice Starling. Yet it is not surprising that there is such a small difference in vocabulary richness, since the language used must be accessible to the movie-going public and the targeted audience makes the writers keep their vocabulary richness fairly consistent. That is to say, if the richness is too high, the audience may lose interest in the plot or the overall story; if it is too low, the audience will become bored.

Studying the vocabulary richness across the five screenplays appears to open up other interesting areas of study. One interesting follow-up to this study would be to compare the vocabulary richness across all three types of texts. As the researcher has no access to the full Brown Corpus, there was no way to break down the corpus into a comparable number of tokens close to the other two corpora. Therefore, the type and token counts in the Brown Corpus exhibit a vast discrepancy with those of the other two corpora, because the number of types in the Brown Corpus is as high as written English could go without getting into very rare words, and it is so vast that the number of different types is diluted by the large number of tokens. This explains why the type/token ratio is extremely low in the Brown Corpus, as displayed in Table 17.


Number of Types

Number of Tokens

Type/Token Ratio

The corpus of the five screenplays




The Brown Corpus








Table 17: vocabulary richness comparison among the three text types

In addition, it would be fruitful to compare the vocabulary richness of the main character(s) in each screenplay with the rest of the characters, since the screenplay is usually focused on the development of the one or two main character(s), and limits the other characters to supporting roles. This too would make for an interesting comparison.

This paper has attempted to sort out the characteristics of screenplays, and what makes this a unique linguistic genre. But in conducting this study, the researcher has come to the conclusion that she has only scratched the surface, and that there remains many avenues for additional research.  Clearly, there is still a lot of room for additional study on this fascinating topic.

[1] The word genre comes from the French word for “kind”, “class”. The term refers to “a distinctive type of text” (www.aber.ac.uk/media/Documents/intgenre/intgenre1.html).

[2] In this paper, the term “lexical items” refers to both content words and function words.

[3] Word frequency count list is a list of words accompanied by a number indicating how many times that word occurs.

[4] The CD-ROM disks of the CSAE contain both written transcriptions and recordings of the speech digitized into computer files.

[5] TACTFreq can produce three kinds of word frequency count lists, two of which are used in this study: one in alphabetical order, and the other in descending-frequency order.

[6] Raw counts are counts of the total number of occurrences of the graphic word form.

[7] A lemma is “a set of lexical forms having the same stem and belonging to the same major word class, differing only in inflection and/or spelling” (requoting from Kennedy: 1998:7, who quotes from Francis and Kucera, 1982:1).

[8] A spreadsheet is a computer program that organizes data into rows and columns. Formulas may be created and applied to the data to perform many different complex functions.

[9] The count of you , in Shakespeare in Love, includes the occurrences of thou, because thou is an variant of you in Shakespearean English.

[10] The context here is the non-grammatical context, that is, the situation in which the dialogue takes place.

[11] In this paper, open classes and closed classes are used to define these two terms. Function words consist of relatively few words, and new words are not usually added to them, so they are called closed classes. They include conjunctions, prepositions and pronouns. Content words, on the other hand, contain an unlimited number of items. Nouns, verbs, adjectives, and adverbs are open-class words, as new words can be added to these classes.

[12] The frequency counts of the nine lexical items and the number of word types and word tokens in the CSAE were obtained by running two shell scripts (written by Dr. Eileen Fitzpatrick) in Linux. The Brown Corpus has an index and concordance search on line (http://vlc.polyu.edu.hk/ConcordanceIndex/Brown/Default.htm), and all the counts can be obtained by using the online interface.

 [13] TACTStat in TACT suit produces the total counts of types and tokens in a text.

[14] These three lexical items are put into the same category because when each is used, the sentence is lengthened.

[15] To is hononym, as it functions both as a preposition and an infinitive marker. It will be discussed separately later.  

[16] And, as a conjunction, is the only lexical item among these nine ones connecting not only phrases,  but also sentences.

[17] Grep is a very powerful Unix text search utility used to search a file(s) for a particular pattern of characters.


Baker, J. C.   1988.  ‘Pace: A Test of Authorship Based on the Rate at which New Words Enter the Author’s Text’      Literary and Linguistic Computing 3 (1): 136-139.

Baayen, R.H., H. van Halteren, and F.J. Tweedie.  1996.   ‘Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution’    Literary and Linguistic Computing 11(3): 121-131.

Biber, D., S. Conrad and R. Reppen.  1998.   Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press.

Biber, D. 1993. ‘Representativeness in Corpus Design’. Literary and Linguistic Computing 8: 243-257.

Bradley, J. TACT Design.  Retrieved April 12, 2001 from The Electronic Publishing Facility (EPC) on the World Wide Web http://www.chass.utoronto.ca/epc/chwp/bradley/index.html    

Brown Text Corpus. Retrieved July 6, 2001 from the Virtual Language Center on the World Wide Web http://vlc.polyu.edu.hk/ConcordanceIndex/Brown/Default.htm

Burrows, J. F., and D. H. Craig.  1994.  ‘Lyrical Drama and the “Turbid Mountebanks”: Renaissance and Romantic Tragedy’.  Computers and the Humanities 28: 63-86.

Butler, C.  1985.   Statistics in Linguistics.   New York: Basil Blackwell.

Butler, C.  1985.  Computers in Linguistics.  New York:  Basil Blackwell.  

Chambers, E. (ed.), 2000.  ‘Computers in Humanities Teaching and Research: Dispatches from the Disciplines’.   Special Issue of Computers and the Humanities 43 (3).

Craig, H. 1999.  ‘Contrast and Change in the Idiolects of Ben Johson Characters’.   Computers and the Humanities 33: 221-240.

Hatch, E. and A. Lazaraton.  1991.  The Research Manual: Design and Statistics for Applied Linguistics.   Heinle & Heinle Publishers. 

Information of the five screenplays.  Retrieved June 22, 2001 from the Internet Movie Database on the World Wide Web: http://www.imdb.com/

Kennedy, G.  1998.   An Introduction to Corpus Linguistics.   New York: Addison Wesley Longman Inc.

Landow, P. and P. Delany. (eds),  1993.  The Digital Word: Text-Based Computing in the Humanities.  The MIT Press Cambridge, Massachusetts.

Lanashire, I.  Computer-Assisted Critical Analysis: A Case Study of Margaret Atwood’s Handmaid’s Tale in Lawler, J. and H. A. Dry. (eds)   1998.  Using Computers in Linguistics: A practical Guide.     London and New York: Routledge.

Leech, G.  1992.   ‘Corpora and Theories of Linguistic Performance’   in Svartvic: 105-122.

Richards, J. and J. Platta and H. Weber.   1985.   Longman Dictionary of Applied Linguistics.  Longman Group Limited.

Stubbs, M.  1996.   Text and Corpus Analysis.  Cambridge: Blackwell Publishers.

Thomas, J. and M. Short (eds). 1996.    Using Corpora for Language Research.  New York:  Longman Publishing.

Whissell C.  1996.   ‘Traditional and Stylometric Analysis of the Songs of Beatles Paul McCartney and John Lennon’.  Computers and The Humanities 30 (3): 257-265