Week: 21/03 – 25/03
Lecture Room: Building 1 – Room 1.38
Schedule: 09:30 – 12:30 | 14:00 – 17:00
Teachers:
Contents:
Topics to be treated in this module include:
1. Foundations of corpus linguistics
- principles and methods of corpus analysis
- applications of corpus data in lexicography
- types of corpora, overview of existing corpora
- corpus design, representativity, data sources, metadata
2. Corpus compilation
- building corpora from online data: web scraping etc.
- boilerplate removal, normalization, metadata extraction
- representation and exchange formats
- online and stand-alone tools for web corpus compilation
- automatic linguistic annotation (POS, lemma, NER, parsing, …)
- online and stand-alone tools for linguistic annotation
3. Searching corpora
- regular expressions
- character encodings and the Unicode standard
- CQP query language for lexico-grammatical patterns
- practical exercises with Sketch Engine and CQP web
4. Quantitative analysis
- frequency lists and metadata distribution
- collocations and word sketches
- keyword analysis
- lexicographic interpretation of results
- foundations of statistical inference
4. Reproducibility
- research methodology and documentation
- data management, sustainability of corpus resources
Please see the module description for further information.