linking (associating) corresponding elements of source and target texts in a parallel corpus; can be done automatically or semi-automatically
a piece of code attached to words in a text representing some feature relating to that word, or the physical markup of an element
a list of occurrences of a word or set of words shown in context
KWIC (key word in context)
a type of concordance where a word is shown within x words of context, centred in the middle of the page
KWAL (key word and line)
a type of concordance which allows one or more lines of context either side of the key word
measurement (in words) of the co-text appearing with the word selected for study (e.g. -4, +4)
roughly dividing sentences into non-overlapping segments
eliminating ambiguity by choosing a specific tag (code) from available options
representing textual and linguistic data (corpus annotations, tags) in a certain format, usually standardized
SGML (Standard Generalized Markup Language)
an internationally recognized text encoding standard widely used in corpus processing
assigning the syntactic structure to a text, a common form of corpus annotation
a parsed corpus
full parsing
a type of parsing that tries to provide the most detailed sentence structure possible
skeleton parsing
a less detailed approach to parsing that ignores finer points of structure
corpus (pl. corpora)
any body of text(s), especially machine readable ones
parallel corpus
(a.k.a. aligned corpus, translation corpus) a corpus containing different language versions of the same texts
comparable corpus
a number of corpora in each language that follow the same compositional pattern
monitor corpus
a growing and consistently structured collection of texts used mainly in lexicography to reflect language change
monolingual corpus
a corpus of texts in a single language
multilingual corpus
collections of different textual corpora in different languages (a collection of individual monolingual corpora)
unannotated corpus
a corpus that exists as raw plain text
opportunistic corpus
a corpus that may be in many ways deficient (raw, unannotated, incomplete) but is otherwise cheap and easily available
frequency list
a list of lexical items ordered by frequency count
dictionary: a collection of words and related information
CAT (computer-aided translation)
computer systems and software that make translation more effective and reliable to human users
the quality that characterizes naturally occurring corpus data
the characteristic co-occurrence of patterns of words
error tagging
assigning codes to indicate the types of errors occurring in a learner corpus
data about data, typically contextual information of corpus samples (where they came from)
arranging items in a given order
a special character (often * or ?) used to represent any character for searching or matching
corpus analysis
statistical probing, manipulating and generalizing from the corpus dataset
a significantly frequent (or infrequent) word selected for study
tagging texts with various forms of information (phonetic, prosodic, syntactic, semantic, pragmatic, etc.)
regular expression
a search term including wildcards used for complex searches