FAQ
Please read the FAQ carefully before seeking advice in the office hour for corpus-linguistic projects - many of your questions will be answered on this page.
"How can I access the corpus material/introductory material provided on the homepage?"
"What is a corpus? What is a concordancer, etc.?"
"Can you send me the ... corpus?"
"I want to use WordSmith Tools - Do I have to buy a licence?"
"I have no time to come to the korpuslinguistische Sprechstunde during the office hours, what can I do?"
"I want to write my thesis about a very specific feature of language and I haven’t found a corpus on the English linguistics homepage which reflects it. What can I do?"
"Some frequently used corpora (FROWN, FLOB, BNC) are from the 1990s; do they still reflect current language use?"
"I want to work with the BNCweb. How can I access it?"
"I have decided to create my own corpus. How big does it need to be?"
That really depends on what you want to analyze. As a rule of thumb, the more specific your analysis is, the bigger your corpus needs to be. In addition, it also depends on the sources for your corpus. Depending on your topic, it may be very hard to gain access to texts. In general, the standard size of many linguistic corpora is 1 million words. However, sometimes it is just not possible to create huge corpora due to limited sources or copyright restrictions. In other cases a much smaller corpus suffices to prove your point. In any case, as long as you always explain the size of your corpus and the compilation process, you should be on the safe side.
You might also want to consider first compiling a pilot corpus, so that you can check beforehand how many instances you can expect to find in your corpus. Then you can decide on the size of your final corpus.
"Can I discuss theoretical/conceptual aspects of my thesis with you? Is my research question acceptable? Can I write my thesis about XYZ etc.?"
"Do I always have to use raw AND normalized frequencies in my project?"
Usually it is enough to mention your raw frequencies once in a data table. In most cases, there is no need to visualize them in a chart. After mentioning the raw frequencies once, you should instead focus on giving normalized frequencies on every occasion because, unlike the raw frequencies, they are meant to be comparable. And, therefore, they should form the basis of your analysis.
You also don't need to compare or discuss differences between your raw and normalized frequencies. As a general rule of thumb, always try to avoid unnecessary redundancies.
"How do I cite corpora and corpus software in my list of references?"
There are different ways to cite corpora in your list of references depending on the style sheet you are using. Below, you can find some examples. Please note you should use the full name of the corpus and not only the abbreviation such as BNC or GloWbe.
In your list of references, the corpus should appear under the name of the editor. Only in case there is no editor available, the corpus can be listed under the name of the corpus.
- Davies, Mark (2013): Corpus of Global Web-based English: 1.9 billion words from speakers in 20 countries. Available online at: <URL>.
- The British National Corpus, version 2 (BNC World), 2001. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. Available online at: <URL>.
Corpus software can be indicated as in the following example:
- Anthony, L. (YEAR OF RELEASE). AntConc (Version VERSION NUMBER) [Computer software]. Tokyo, Japan: Waseda University. Available from <URL>.