Glossary

from Using Computers in Linguistics: A Practical Guide
Edited by John M. Lawler and Helen Aristar Dry
Routledge 1998

  • alias: A user-designated synonym for a Unix command or sequence of commands. Differs from a variable in that its value does not change: e.g., if you designate m to be your alias for mailx, then typing m will always run this mail program. Differs from a script in that scripts are normally stored in executable files, while aliases are loaded as part of the shell environment directly (and are thus simpler and faster). Aliases are a facility provided by the C-shell (csh) and its successors, like tcsh.

  • alphanumeric: Of ASCII characters, any string composed of only upper- or lower-case English letters or Arabic numerals.

  • anonymous ftp: Downloading files from a public-access Internet machine, i.e., one which allows a remote user to log in as "anonymous" and transfer files using the ftp protocol even if the user does not have an account on the machine.

  • Archie: An Internet search facility that searches through directory and file names (and in some instances through file descriptions) in order to determine whether a particular string is present. If you ask an Archie server to find the string "phone" it will return the names of files that include this word, whether it refers to a sound or a telephone.

  • argument: As in mathematical or logical usage, specifying a value to be operated on by a function or other command. By default, this is usually interpreted as a file name. In the command cat message, the argument is message, which is subcategorized as a file name by cat.

  • ARPA:
    See DARPA.

  • ATIS: The Air Travel Information System evaluations were a series of evaluations of speech recognition and spoken language understanding systems sponsored by DARPA. These evaluations began in 1990 and ended in 1995. They are responsible for the development of a corpus of approximately 20,000 utterances regarding air travel, grouped by speaker, session, and data collection site. The ATIS corpus is distributed by the Linguistic Data Consortium.

  • ASCII: The American Standard Code for Information Interchange is a standard character set that maps character codes 0 through 127 (low ASCII) onto control functions, punctuation marks, digits, upper case letters, lower case letters, and other symbols.

  • ASCII file: A data file, typically a text file with hard line breaks, that contains only character codes in the numeric range 0 to 127 (low ASCII), and interprets them according to the ASCII standard.

  • ASCII, high: The unstandardized highest half (character codes #128-#255) of the 256 characters in ASCII. While low ASCII is standard worldwide, high ASCII characters vary from one hardware platform to another, or even from one software program to another.

  • attribute: In SGML, a qualifier within the opening tag for an element which specifies a value for some named property of that element. In an object-oriented database, a named property of an object which not only holds information about a particular instance of an object, but also encapsulates behavior (such as integrity constraints and a default value) that is true of all instances of the class of objects.

  • backquote convention: A facility allowing indirect reference in Unix commands, by using the output of one command, enclosed within backquote characters (`, ASCII #96), as an argument to another command. For instance, in the command finger `whoami`, first the whoami program is run, returning the login id of the user; this is in turn passed as the argument for the command finger, which returns information about a user.

  • batch processing: Running a computer program without any interaction with the process as it goes along. Sometimes called background processing.

  • binary, octal, decimal, hexadecimal: Four common arithmetic bases (2, 8, 10, and 16, respectively) widely used in computing. Computers use binary numbers internally, and octal and hexadecimal numbers are easily converted to binary (and vice versa). Decimal numbers are the norm in text, as usual; binary numbers, consisting of only 0 and 1, are easily recognized; octal numbers (now obsolete) use only the decimal digits [0-7]; hexadecimal (also called hex) numbers contain the normal decimal digits [0-9], and add [A-F]to represent eleven through fifteen as single "digits". These "digits" are pronounced as (English orthographic) letters, rather than extending conventional morphology; i.e, hex "A5" is pronounced "A-five", not "*eleventy-five".

  • BinHex: More accurately BinHex 4.0. The standard Macintosh format used when a binary file must be converted into an ASCII file so that it may be safely transferred through a network. It preserves the data fork and resource fork which all Macintosh files must have. Do not confuse BinHex 5.0, which is not an ASCII format. All BinHex files should by convention carry the extension .hqx.

  • bit, byte: Related terms for small units of information. Bit is an acronym for binary digit, the smallest possible unit of information: i.e, a single yes or no (1 or 0), in context. A byte is a unit consisting of eight bits, in order. There are 28 (= 256) possible bytes (combinations of 0 and 1), and thus 256 possible characters in ASCII, each with a unique byte value. Computer memory is normally specified in kilobytes, megabytes, and gigabytes.

  • "black box" evaluation: The evaluation of a complex system by examining only inputs to the system and outputs from the system, ignoring intermediate results and internal states.

  • browser, or web browser: A piece of software which retrieves and displays World Wide Web files. It acts as an interface to Internet protocols like ftp and http. Common browsers include Netscape, Internet Explorer, and Mosaic.

  • BSD, SysV: Two competing dialects of Unix. BSD is an acronym for Berkeley System Distribution, an academic version developed at the University of California at Berkeley. SysV stands for System V, a commercial version originally developed by AT&T. The two systems are incompatible in some ways, though they are converging in the latest versions.

  • base character: A character to which an overstriking diacritic is added.

  • character: The minimal unit of encoding for text files. A character typically corresponds to a single graphic sign, like a letter of the alphabet or a punctuation mark.

  • character code: A numerical code in a data file which represents a particular character in text.

  • character set: The full set of character codes used for encoding a particular language.

  • client: See server.

  • COCOA: A method of text encoding used by the Oxford Concordance Program and other software.

  • collating sequence: The sorting order for all the characters in a character set.

  • command: A unit in a linguistic (i.e, written-language-based) interface to a computer program or operating system; Unix and DOS have command-line interfaces, in which the user types commands which are then executed. Command-line systems are powerful but complex; they can be added to and customized. They are the earlier of the two principal user interfaces (the other is the Graphic User Interface, or GUI).

  • composite character: A single character which is a composite of two or more other characters. For instance, à is a composite of a (the base character) and ` (a diacritic).

  • concordance: A list of words, normally in alphabetical order, where each occurrence of each word is shown with surrounding context and identified by a reference indicating where it occurs in the text.

  • control character, control-shift, Ctrl: The most common and most standard of the ASCII metacharacters. ASCII keyboards contain a Shift key, which produces upper-case characters (# 41H through 5AH) when pressed, instead of lower-case (# 61 through 7AH). The Control-Shift key, by analogy, produces Control characters (# 01H through 2AH). These are non-printing and in principle have standard uses, though in practice they vary greatly. They are often represented by prefixing caret (^, ASCII #94) to the appropriate alphabetic character; thus ^M represents CR or Carriage Return, sent by the Return key on all keyboards, and by the Enter key on most.

  • corpus:
    1. In general, a collection of electronic text, usually compiled on a principled or systematic basis for the purposes of linguistic and other research.
    2. In computational linguistics, a body of linguistic data, either text or speech, intended to support the study of linguistic phenomena. This data may be annotated in some way to enhance its usefulness. Examples of corpora include the Penn TreeBank and the ATIS corpus.

  • daemon (less commonly demon): A pre-activated program that is always ready to perform its task (as opposed to one that must be called by the system activation software in response to a specific need). Web server programs are usually run as daemons, for example.

  • DARPA: The Defense Advanced Research Projects Agency, a branch of the United States Department of Defense responsible for a wide range of research and applications development, and a long-time funder of research in language processing. For a number of years, in the late 1980s and early 1990s, this organization was known as ARPA. Its Web site is http://www.darpa.mil/.

  • diacritic: A small mark (such as an accent mark) added above, below, before, or after a base character to modify its pronunciation or significance.

  • digital image: An electronic representation of a page of text or other material which is a picture of the page, rather than a transcription of the text. A fax is a digital image, for instance, while the wordprocessor file that produces the page that is faxed is not, since it can be searched.

  • directory: A collection of files that are notionally "in" the same "place." Every Unix user has a home directory, in which one's files may be stored; it usually has the same name as the login id of the user, and may be referenced as $HOME or by the tilde convention (~ is $HOME, ~jlawler is jlawler's home directory). At any time in a Unix session, a user has a current directory, which may be changed with the cd command. Usually called folder in GUI systems.

  • DNS: Domain Name Server. An Internet machine that knows the names and IP addresses of other machines in its subnet. When you attempt to connect to the Internet, your request goes to a DNS, which translates an address like emunix.emich.edu into an IP number like 35.1.1.42 and forwards your connection request to that IP address.

  • domain model: In computational linguistics and artificial intelligence, a symbolic representation of the objects and relationships in a particular segment (domain) of the world.

  • dot files: In Unix, special ASCII files placed in one's home directory to control various programs and set customized parameters. Their names begin with period ("dot", ASCII # 46) and are by default not shown by the ls program. Examples are .cshrc, which contains commands and definitions for the csh shell; .newsrc, for customizing news readers like trn; and .login, which contains commands executed once at the beginning of each Unix session.

  • DTD: Document Type Definition, the definition of the markup rules for an SGML document.

  • editor: A program that allows one to create, modify, and save text files. Virtually all popular editors ( pico, emacs, vi) on Unix are screen editors, like wordprocessors. Early Unix line editors ( ed, ex) operate with commands instead of direct typing; i.e., to correct a mistake like fase, you might enter the command replace s with t, rather than just overstriking the s with t.

  • element: In an SGML file, a single entity delimited by a start tag and an end tag. For instance, a title element might be delimited by <title> and </title>.

  • encoding: The manner in which information is represented in computer data files. Character encoding refers specifically to the codes used to represent characters. Text encoding refers specifically to the way in which the structural information in text is encoded.

  • entity: In SGML, a named part of a marked up document. An entity can be used for a string of characters or a whole file of text. Special characters (like "Ê") are normally represented by entities (like "&Ecirc;") in SGML.

  • escape (n): An ASCII control or metacharacter (#27, ^]) with its own key on most keyboards, intended originally to signify escape (v) (sense 1). While it has been put to a number of different uses over the decades, it is still often used to pause or terminate a program or process. Frequently called Meta in some programs, notably emacs, where it is a common command prefix.

  • escape (v):
    1. To pause a running program and return control temporarily to the operating system, usually in order to run some other program. In Unix, the exclamation point (ASCII #33, !, pronounced "bang") is an escape character that can be used in most programs to accomplish this.
    2. To cancel the default (meta-)interpretation of the following character in a string and interpret it literally instead. Thus, while the unescaped (meta)expression " ." matches any character, the regular expression "\." matches a literal period or full stop character only, because it is escaped by the preceding "\".

  • executable: A file name that can be used as a command, consisting either of a script of commands to be executed by typing the name, or of true compiled binary program code. In the latter sense (also called binaries), the executable(s) is sometimes used to distinguish compiled binary code from its human-readable programming-language source: "He gave me the executable, but I needed the source files."

  • field: In a database, a subdivision of a record which stores information of a particular type.

  • file: A collection of information encoded in computer-readable form (normally in bytes)and associated with a single name by which the computer's operating system stores and retrieves it.

  • filter: A type of program especially common in Unix in which a file or other data stream (by default, the standard input) is read serially, modified in some regular way, and sent (in modified form) to some other file or stream (by default, the standard output), without any change to the original data source. There are many languages for creating simple text filters in Unix, like sed, awk, and perl.

  • folder: Synonym for directory (metaphorically, a place to put files), used in Macintosh, NeXT, Windows 95, and some other Graphic User Interfaces. See GUI.

  • font: A collection of bitmaps or outlines which supply the graphic rendering of every character in a character set.

  • font system: A subcomponent of an operating system which gives all programs and data files access to multiple fonts for rendering characters.

  • (file) format: The encoding scheme, often proprietary, in which the information in a file is marked up. Wordprocessing files created by different software are usually incompatible in format to some extent. To read one program"s files using a different program requires format translation, which may be built into a full-featured wordprocessor, but is often a separate step requiring separate software. Many formats are in use; a frequent feature of upgrade versions of popular microcomputer software is a different (and usually incompatible) standard file format, and there are different standards and versions for different countries and languages.

  • frequency profile: In a concordance or similar program, a table showing how many words occur, once, twice, three times, etc. up to the most frequent word.

  • ftp: Internet File Transfer Protocol, a way of sending files from one Internet machine to another.

  • generalized markup: The discipline of using markup codes in a text to describe the function or purpose of the elements in the text, rather than their formatting.

  • glyph: In character-encoding, the shape or form of a printed or displayed character, as opposed to a pairing of form and interpretation.

  • gopher: An Internet search facility, which allows the user to search through a hierarchically organized set of menus in order to find a particular file. Gopher menus categorize files according to content (e.g., "libraries," "phonebooks"), as determined by a human being, not a computer.

  • GUI: A Graphic User Interface is one invoking visual rather than linguistic metaphors, often employing menus, non-text input devices like a mouse or trackball, and icons employing visual symbolism and metaphor, like a desktop with paper files on it. Contrasts with command-line interface.

  • Hidden Markov Model (HMM): A Hidden Markov Model is a statistical model of the distribution of "hidden" features, such as phonemes or part-of-speech tags, based on observable features, such as acoustic segments, or words. The computational models can be automatically trained from data samples, and then used to recognize the "hidden" layer, based on the statistical model derived from the training data.

  • homograph: A word which has the same spelling but different meanings, e.g. lead as a verb "to lead" and as two different nouns: "a leash", and the metal.

  • HTML: Hypertext Markup Language is a method of marking a document that is to be displayed by a web browser. A subset of SGML,it consists primarily of formatting tags, like <b><i>boldface italic</i></b> for boldface italic.

  • http: Hypertext transfer protocol. A way of sending hypertext documents over the Internet.

  • hypertext: A non-linear version of text presentation with embedded links to other information. The basis of the World Wide Web and of the Internet protocols employed on the Web.

  • hypothesis: In corpus-based linguistics, an annotation produced by an annotation procedure which can be checked against an annotation key.

  • index: A list of words, normally in alphabetical order, where each word is accompanied by a list of references indicating where that word occurs in the text. Sometimes also called a word index.

  • information extraction: In computational linguistics, the process by which information in a form suitable for entry into a database is generated automatically from textual documents.

  • input-output (I/O) redirection: Process (and capability) allowing a program (typically a filter program) to take its input from some other program, and/or send its output to another. A characteristic feature of Unix, much copied in other operating systems. The control structure implementing this is called a pipe, and the vertical bar (" |") symbol is used in the Unix command line to represent this.

  • interactive retrieval: The process of searching or querying a text and getting an instant response. The query is performed on an index which has been built previously.

  • IP number, IP address: A four-part number which uniquely identifies an Internet machine, giving the net and subnet to which it belongs. The IP number 35.1.1.42, for example, designates the Domain Name Server of the University of Michigan (at press time -- IP numbers are subject to change without notice), and tells us that it is part of net 35 and subnet 1. Part of the Internet Protocol.

  • key:
    1. An individual button on a keyboard; by extension, the character(s) or command(s) it signals.
    2. In searching, a synonym for search string.
    3. In indexing or database management, the most important field, in the sense that it uniquely identifies an item.
    4. In corpus-based linguistics, a benchmark against which the accuracy of an annotation procedure can be compared.

  • Kleene closure: In regular expressions, the use of asterisk ( *, ASCII #30) as a special character to indicate "any number of" the preceding character (including zero, or "none of"). Combined with the use of the special character dot (i.e, period, ASCII #34) to represent "any character", the regular expression idiom ".*" represents "any string". Named after the logician Stephen Kleene.

  • lemmatization: The process of putting words under their dictionary headings, for example, "go", "going", "gone", "went", under "go".

  • line: A unit of organization in a text file including all the characters up to and including the line end character (either carriage return, line feed, or both, depending on operating system).

  • link: An embedded URL in a hypertext document. Links are created in HTML using the <a ..> "anchor" tag, and are displayed in a browser as emphasized text (usually blue and underlined). When one clicks on a link, the browser requests the file and displays it.

  • Linguistic Data Consortium: The LDC is an open consortium of universities, companies and government research laboratories which creates, collects and distributes speech and text databases, lexicons, and other resources for research and development in computational linguistics. It is hosted at the University of Pennsylvania. Its Web site is http://www.ldc.upenn.edu.

  • loop: A programmed repetition of a set of instructions, typically with incrementation of some index value. The instructions will then be repeated on each member of the indexed set of values. Implemented by the for, while, or do structures in many computer languages.

  • machine learning: In computational linguistics and artificial intelligence, a set of techniques which allow a computer program to improve its performance iteratively on a chosen task. See training corpus.

  • markup: Codes added to the stream of an encoded text to signal structure, formatting, or processing commands.

  • metacharacter: A character or (shift-)key to be interpreted as modifying the value of the character (or key) following it in a string (or produced simultaneously in typing), either by prefixing a special character (" ^X-Q terminates the program"), or by interpreting it literally, thus escaping the default special interpretation of the following character.

  • MIME: Multi-purpose Internet Mail Extensions. A way of sending files of different types (e.g., graphics, sound, or word-processor files) via email without converting them into ASCII, or plain text. None of the original information will be lost, and, if the recipient has a MIME-compliant mailer program, it will call up the proper program needed to display or play the files.

  • MUC: The Message Understanding Conference refers to a series of evaluations of text-based language processing systems sponsored by DARPA. These conferences are responsible for a series of corpora covering increasingly difficult information extraction tasks and subtasks.

  • multi-user, multi-processing: Two independent characteristics of desirable operating systems, both found in Unix. A multi-user system is one that allows several users to run commands simultaneously without having to take turns. A multi-processing system is one that allows any user to run several commands simultaneously without having to wait until each is done (serial processing). Multi-processing is also called parallel processing.

  • news: An Internet utility that allows users to download (notionally, "read") "articles" posted to "newsgroups" by other users interested in the topic the newsgroup was formed to discuss. Also called "Usenet". The newsgroup sci.lang, for example, is dedicated to discussing the science of language. To read news, you need a news client like trn and access to a news server, such as those established at most universities.

  • normalization: The process of organizing a database in such a way that no piece of information occurs more than once in the database.

  • object: The fundamental unit of information modeling in the object-oriented paradigm. In principle, there is a one-to-one correspondence between notional 'objects' in the data model and the actual entities in the real world which are being modeled. (This is not true, in general, of the data structures of conventional programming languages or database systems, and is less true in practice than in theory of official object-oriented languages and databases.) An object stores state information (like the field values of a database record; notionally nouns) and it stores behavioral information (called methods; notionally verbs) about what computations can be performed on an instance of the object. The information stored in an object is encapsulated in that it is not visible directly; it can only be seen by sending a message to the object which asks it to perform one of its methods.

  • object-oriented: A modern paradigm of programming which models information in terms of objects. Computation occurs when one object receives a message from another asking it to perform one of its methods, i.e, special subroutines subcategorized for each type of object. The object-oriented approach, in which the data and the program behavior are encapsulated in the objects, contrasts with the conventional approach to programming, in which a monolithic program operates on data which is completely separate. Object-oriented programming is more amenable to modelling parallel processing.

  • object-oriented database: A database system which models entities in the real world as objects and follows the object-oriented paradigm of programming.

  • open: Of software, especially an operating system, signifying that it conforms to a well-known internal architecture and set of standards, or that it is not restricted to use on a single brand of computer, or that it is manufactured and maintained by many vendors, or some combination of these. Contrasts with proprietary.

  • operating system (OS): The basic software that runs a computer, managing all other software and apportioning computing resources to avoid conflicts. Windows, DOS, and Unix are examples of operating systems

  • optical character recognition (OCR): A method of creating electronic text by automatically analyzing a digital image of a page of text and converting the characters on that page to ASCII text.

  • Oxford Concordance Program (OCP): A flexible batch processing program for generating concordances, word lists and indexes from many kinds of texts.

  • padding letter: A letter or other character that does not affect the sorting of words.

  • parallel corpus: A text corpus containing the same text in multiple languages. Such corpora are used for training corpus-based machine translation systems, for example. The Rosetta Stone is an example of a parallel corpus.

  • part-of-speech (POS) tagging: The process of assigning lexical categories (that is, part-of-speech tags) to words in linguistic data. This process can be performed automatically with a high degree of accuracy (above 95% in English) without reference to higher-level linguistic information such as syntactic structure.

  • path:
    1. A list of directories in which the operating system looks for files. To put a directory in one's path is to add the directory's name to this list; to put a file in one's path is to store the file in a directory that is on the list.
    2. Used also of the full path or pathname of a file, the sequential list of directories which locates the file on the disk; the reference is parsed recursively, like a linguistic tree, e.g, in Unix, the string /usr/jlawler/bin/aliases specifies a file named aliases, which is further specified as being located in the subdirectory named bin, which is located in the subdirectory named jlawler, which is located in the subdirectory named usr, which is located under the top (root) directory (always called simply " / ")

  • Penn Treebank: A corpus of Wall Street Journal documents annotated with part-of-speech and bracketing information, distributed by the Linguistic Data Consortium. The Penn Treebank also includes a bracketed version of the Brown Corpus. Its web site is http://www.cis.upenn.edu/~treebank.

  • PPP: Point-to-point protocol. A way of accessing the Internet which allows your home machine to act as if it were, itself, an Internet machine. PPP, for example, allows you to retrieve and display Internet graphics files. If you access the Internet through a serial line (formerly the most common type of modem connection), you can not use a graphical browser.

  • precision: In information retrieval or corpus-based linguistics, the number of answers in an answer set hypothesis which are also in the answer key, divided by the size of the answer set hypothesis.

  • proprietary: Of software, especially an operating system, signifying that it is manufactured and maintained by only one vendor, or that it is the only type usable on a particular computer, or that it does not conform to a widely-accepted standard, or that its details are secret, or some combination of these. Contrasts with open.

  • protocol: An agreed-upon way of doing things. Internet protocols have been established for such actions as transmission of information packets (tcp), file transfer ( ftp), and hypertext transfer (http). Any machine which does things according to these protocols can be a part of the Internet.

  • recall: In information retrieval or corpus-based linguistics, the number of answers in an answer set hypothesis which are also in the answer key, divided by the size of the answer key.

  • record (n): In a database, a collection of information about a single entity.

  • regular expression (RE): A formal syntactic specification widely implemented in the Unix language family for reference to strings. For example, the regular expression denoting a string of alphanumerics (i.e., letters or numbers) is [A-Za-z0-9]*

  • rendering: The process of converting a stream of encoded characters to their correct graphic appearance on a terminal or printer.

  • reverse alphabetical order: Sorting of words by their endings so that, for example, a word list in reverse alphabetical order begins with words ending in -a. A word list in reverse alphabetic order is also called a speculum.

  • router: An Internet machine whose specialized job is finding paths along the net for information packets. It looks for functional, uncongested paths to destinations, and sends data along them.

  • RTF: Rich Text Format is a special interchange file format that can be created and read by most popular wordprocessors. RTF preserves most formatting information, and graphics. Since they use only low ASCII, RTF documents can be usefully transmitted by email.

  • scanning: The process of creating a digital image of a page of text or other material. This term is sometimes also used for optical character recognition.

  • script: A collection of Unix commands, structured together as a program and stored as an executable file. The commands in a script are interpreted by the shell (normally sh) and treated as if they were entered in order by the user at the command line.

  • server: Software that forms part of a server/client pair. Typically, a server resides on a central machine and, when it is contacted by the client software on a user's machine, sends a particular type of information. Web servers, for example, send hypertext documents; news servers send articles posted to newsgroups.

  • SGML: Standard Generalized Markup Language is a method for generalized markup that has been adopted by ISO (the International Organization for Standardization) and is consequently gaining widespread use in the world of computing.

  • sgmls: A shareware Unix and DOS program for validating SGML documents.

  • shell: A kind of tool program that parses, interprets, and executes commands, either interactively from the keyboard, or as a script. DOS uses a shell called COMMAND.COM; there are several shells available in Unix: the most common are the original Bourne shell (sh), used mostly for interpreting scripts, and the C-shell (csh), the standard for interactive commands and aliases.

  • special character: A character that is not available in one of the character sets already supported on a computer system.

  • standard input, standard output: The input and output streams for DOS or Unix tool programs. The operating system associates these streams with each program as it is run. The standard input defaults to the keyboard, and the standard output to the screen, though both are frequently redirected to other programs, or to files.

  • stream: A (long) string of bytes, which may come from any source, including a file. Streams are operated upon by filters and other programs. "Stream" is often used as an alternative, active metaphor for "file", when considered in terms of sequential (serial) throughput that can be redirected.

  • string: A sequence of bytes. Since bytes are used to encode text, "string" is often used as a synonym for "word" or "phrase" in electronic text-processing environments. Special uses of the term include search string (the string to be matched in a searching operation) and replacement string (the string to be substituted for occurrences of the search string in a replacement operation).

  • style sheet: A separate file that is used with a document containing generalized markup to declare how each generalized text element is to be formatted for display.

  • subdirectory: A directory that is located inside another directory. There can be long chains of subdirectories in a file's full path if it is deeply buried in the file system.

  • switch: One of a number of parameters that may be set for Unix tool programs, each specifying special instructions (e.g, with the sort tool, sort -rn specifies reverse numeric sort). Each program has its own unique array of possible switches, invoked on the command-line before arguments, using a switch prefix (normally minus sign "-") before the individual letters indicating the switch settings, thus resembling clitics on the command verb. May be set by menu or checkbox in a GUI. Also called options or preferences.

  • tag: A string of characters inserted into a text file to represent a markup code. In SGML, each text element of a given type is delimited by an opening tag of the form <type> and a closing tag of the form </type>. In computational linguistics, a part-of-speech tag is a lexical syntactic category associated with a word in a corpus; a coreference tag is an annotation indicating the referential dependency of the tagged phrase on other tagged phrases in the corpus.

  • tag set: In computational linguistics, a set of possible tags for a given annotation task. For example, a part-of-speech tag set is a list of lexical syntactic categories which may be associated with lexical items. Cf. paradigm.

  • TCP, or Transmission Control Protocol: A way of transmitting information packets on the Internet so that those belonging to the same body of data can be identified and reassembled into their original order.

  • TEI: The Text Encoding Initiative is a joint effort of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics to develop SGML-based guidelines for the encoding of texts and the analysis of texts.

  • telnet: A way of logging in to a remote machine; also, the name of one of the more common programs that implement this facility.

  • tool: One of a generalized type of small useful modular programs, made to work together in a conceptually unified way so as to provide maximum flexibility, power, and ease of operation. Part of the Software Tools philosophy, instantiated most thoroughly in Unix.

  • test corpus: An annotated corpus set aside for evaluation of the annotation procedure. To ensure the accuracy of the evaluation process, there should be no overlap between training and test corpora.

  • training corpus: An annotated corpus whose contents are consulted in the process of developing a procedure to produce these annotations. To ensure the accuracy of the evaluation process, there should be no overlap between training and test corpora.

  • type/token ratio: A measure of the spread or richness of the vocabulary in a text, calculated by dividing the number of types (different words) by the number of tokens (instances of each word).

  • Unicode: A character set which attempts to include every character from all the major writing systems of the world. Version 1.0 contained 28,706 characters.

  • variable: In Unix, a special name assigned to substitute for some (usually recondite and un-mnemonic) term that may vary from user to user (and can thus not be supplied literally in documentation). For instance, $HOME is a first-person indexical variable that refers to the home directory of whatever user types it, while the variable bookmark might be assigned by one user to point to the full pathname of a file containing their Web bookmarks, and by another to a a file containing a list of book reviews. $HOME is an example of a global, or system variable, part of Unix and available to all users, while the various uses of bookmark are local, or shell variables interpretable only in the environment of the particular user.

  • WAIS: Wide Area Information Service. An Internet search facility that retrieves filenames labeled with a score based on their probable relevance to the search criteria. Unlike Gopher, WAIS searches indexes of the text inside the files rather than an index categorizing files by content.

  • wildcard: A simplified version of the Kleene closure, usually consisting only of " *o" for "any string" and "?o" for "any character", used to allow variable pattern specifications. Found in Unix shell dialects, DOS command syntax, and a large number of search languages based on regular expressions.

  • word list: A list of words, normally in alphabetical or frequency order, where each word is accompanied by a number indicating how many times that word occurs.

  • World Script: A subcomponent of the Macintosh operating system (version 7.1 and later) which gives programs access to script interface systems for multiple non-Roman writing systems.

  • WWW, or World Wide Web: The "web" is a metaphor for the multiplicity of links effected by Web browsers and Web servers, a notional place. It is not, itself, a piece of software or hardware.

  • This is a selection from Using Computers in Linguistics: A Practical Guide
    Other selections:

    John Lawler