Project



Overview
Digital



Libraries



on the Web
Digital



Library



Projects
A Digital



Library



Vocabulary
Resources

Digital Libraries Homepage

A BASIC DIGITAL LIBRARY VOCABULARY

Following is a very basic vocabulary for use in understanding digitization and representation of materials. This is by no means an exhaustive list and is intended primarily to support the main concepts behind creating and accessing digital libraries.

Acrobat antialiasing Arpanet artificial intelligence
ASCII assymetric cryptography automatic speech recognition binary search
bitmap Boolean operators browsing CCITT Group IV Fax compression
CD ROM CMYK CIE Standard Colorimetric System clustering
collision chain compression controlled vocabulary copyright
cryptography cryptolopes CSMA/CD database
deskewing digital library dithering DLI
DPI DTD DVD ethernet
FTP gamma GIF HTML
HTTP Internet Internet Browser IP address
Java JPEG keyword searching linear searching
LSI MARC MPEG OCR

PCL

PDF pixels per inch PostScript
precision RealAudio/Video recall RGB
scanner SGML speech compression TCP/IP
telnet TIFF URL vector graphic
vector model wav World Wide Web WYSIWYG

 

   
Acrobat   Software developed by Adobe Systems to facilitate cross-platform presentation of text and graphics. Acrobat generates documents in PDF (portable document format).
     
Antialiasing   A technique for smoothing bitonal graphic images by using grayscale pixels to gradually provide transition from the black to the white portions of an image. Without antialiasing, a diagonal line would appear choppy instead of smooth. This is especially useful in text scans where letters may be composed of not only vertical and horizontal strokes but of varying degrees of curves and diagonal lines. The overall effect is a smooth outline for all characters.
     

Arpanet (Advanced Research Projects Agency Network)

  Developed in 1969 by Larry Roberts and supported by the United States Department of Defense, the Arpanet can be considered the beginning of today's Internet. At the time, the network was developed to facilitate high-speed communication between institutions doing military research.
     
Artificial Intelligence   A long pursued goal of computing system designers, a system capable of artificial intelligence would actually be able to approximate human thought processes, understand spoken commands, and respond appropriately. Speech synthesis has already progressed considerably, but the goal of producing a truly intelligent system has still not been achieved.
     
ASCII (American Standard Code for Information Interchange)   A text representation standard that stores one alphabetic, numeric, or other character in one computer byte. This is the most common means for representing characters and allows plain text to be shared among multiple applications. Saving a document in ASCII format in a word processor like WordPerfect means that the same document can be used in another different word processor without having to be converted. ASCII does not allow, however, for page layout like word processing and page layout programs do. A document originally created with bold and italic text and with special formatting (tables, indentions, etc.) will lose all special formatting and text characteristics when saved to ASCII format.
     
Assymetric Cryptography   Use of one-way encryption of messages. A sender can use encryption to send a message to a person with the capability of decoding the message. The recipient of the message does not have the capability of encoding messages, so the transaction is one-way or assymetric.
     
Automatic Speech Recognition   The ability of a computer system to recognize and interpret human speech by use of software designed to analyze sounds and interpret them according to a programmed vocabulary. The applications of such a system range from enabling a user to command a computer to perform certain tasks (shut down, open file, etc.) to enabling hands-off dictation of entire texts to a computer system to identifying individuals by matching voice and pronunciation to know patterns.
     
Binary Search   A means of quickly searching a sequential data list. A search for a match begins in the middle of the file and continues subdividing the list until a match is found or until the list is small enough to be searched sequentially.
     
Bitmap   Also called a raster image, a bitmap uses a grid of pixels to reproduce an image. Each pixel is a small square that is assigned a color value and location within the bitmap grid. The combination of pixels can produce anything from a low resolution bitonal image to a high resolution full color image of photographic quality. Most photo-editing software uses the bitmap format as a standard for editing images. Instead of editing an entire image, a photo-editor can edit each individual pixel to produce desired results.
     
Boolean Operators   Logical operators used to facilitate database searching. AND, OR, and NOT are commonly used Boolean operators. Boolean logic is based on the work of British mathematician George Boole, whose work in algebra established the logical principles of set theory.
     
Browsing   Looking through a collection of materials without a particular goal. People often browse the Internet looking for nothing in particular so much as just for something interesting. Browsing physical library collections is also a common user approach. When a user is familiar with the arrangement of a collection (history in one area, literature in another, etc.), browsing can be a productive means of finding specific materials.
     
CCITT Group IV Fax Compression   An international standard for facsimile transmission developed by the Consultative Committee on International Telecommunications Technology (CCITT), now known as the International Telecommunications Union (ITU). The standard supports compression for more efficient image transmission.
     
CD ROM (Compact Disk Read Only Memory)   A hard plastic disk measuring 12 centimeters (5 3/4 inches in diameter) which is composed of a very thin layer of metal sandwiched between plastic layers. The metallic layer is imprinted with depressions of varying depths that can be interpreted as images, sounds, and text. A single CD ROM can hold around 650 megabytes of information, over 450 times the capacity of a single floppy disk (1.44 megabytes). Even larger storage capacities are possible with the newer DVD technology (Digital Versatile Disk).
     
CIE (Commission Internationale de l"Eclairage) Standard Colorimetric System   International standard for representing color in three dimensions -- lightness, red-green, and yellow-blue. The CIE model represents all colors that can be perceived by the human eye. Colors that can be reproduced on a color monitor or in a color photograph comprise only a subset of the full visual range.
     
Clustering   A technique for retrieving data based on frequency of use. For example, a text that was accessed numerous times and had been scanned in full (rather than being viewed and passed over) would rank higher in a retrieval list than a text that had only be viewed (rather than fully accessed) or that had rarely been accessed at all. The idea behind clustering is that the more useful items in a collection will be more frequently accessed by users.
     
CMYK (Cyan Magenta Yellow Black)   A color model based on the light-absorbing properties of color inks on paper. The combination of cyan, magenta, and yellow on paper should produce the color black, but because of impurities in ink these must be combined with black ink to produce true black on paper. The combination of cyan, magenta, and yellow instead produces a brownish color. This color model is used in the so-called "four-color" printing process.
     
Collision Chain   A collection of data in a hash table with conflicting values. In other words, because of representing words with numeric values, numerous words may be represented by the same value and are said to collide. Usually a secondary hashing scheme is used to sort through collision chains within a table.
     
Compression   A scheme for decreasing the size of a digital file in order to speed retrieval or to save on storage space. Compression may be lossless or lossy, depending on the algorithm that is used to compress the file. Where space is not so much an issue, lossless compression is favored.
     
Controlled Vocabulary   A collection of terms used in indexing materials described in a database. For example, the Library of Congress Cataloging System uses a controlled vocabulary to assist catalogers in uniformly describing materials to be included in a library catalog. Most online databases utilize controlled vocabularies to provide access points to the materials indexed in the databases. For each item described in the database, an indexer/cataloger further analyzes the content and assigns specific subject terms to further describe the item. Most systems using controlled vocabularies maintain detailed thesauri that define and cross-reference the terms used in the vocabulary.
     
Copyright   Legal claim to intellectual property. In the United States, the 1976 revision to the copyright law provided a duration of 75 years. The Berne Convention, which the U.S. will be applying provides for the life of the author plus 50 years as the period of copyright.
     
Cryptography   The process of encoding information using a secret algorithm or key. Cryptography is especially important in electronic communications since it is relatively easy for an outsider to intercept messages being sent over the Internet or even to impersonate someone else when sending messages. People wanting to exchange messages securely can use encryption technology to secure their exchanges.
     
Cryptolope or Secret Envelope   Refers to IBM's encryption technology designed to assist companies in doing business over the Internet. The vendor using cryptolopes can set options which either allow or don't allow users to view, download, copy, print, or otherwise access materials from its Web site. Further information on cryptolopes is available directly from IBM.
     
CSMA/CD (carrier sense multiple access, collision detection)   Standard for transferring data across an Ethernet LAN. Devices needing access to the network will check for available bandwidth. If the network is free, the device will broadcast. If it is busy, the device waits for a random amount of time before trying again.
     
Database   A collection of related files that are managed with a database management system. A database may be text only or may include media of many types (sound, video, graphics, etc.).
     
Deskewing   Realignment of a scanned image to return it to its proper orientation. Skewing can result from improper placement of a page on a scanner or misalignment that can occur when using a sheet feeding scanner.
     
Digital Library   Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities. (Source: Digital Library Federation)
     
Dithering   A process of approximating a color or shade of gray by using dots of varying values. Dithering is used to improve the quality of a compressed digital image by representing subtle changes with varying dot sizes and intensities. Dithering produces more photorealistic images without drastically increasing file size.
     
DLI (Digital Library Initiative)   NSF, NASA, and ARPA funded program focused on digitizing and providing access to library collections. Six universities are the primary sites for the Inititiative: University of California at Berkeley, University of Michigan, University of Illinois, Stanford University, University of California at Santa Barbara, and Carnegie Mellon University.
     
DPI (Dots Per Inch)   A means of describing printed text or image quality based on the number of ink dots that are used to produce a solid image. Standard resolution on paper for black text from an inkjet printer is 300 dpi. Laser printers and some more expensive inkjet printers are capable of 600 dpi or higher. The higher the dpi number, the finer the printed image will appear.
     
DTD (Document Type Definition)   Document formatting definition file used in SGML. SGML makes use of tags to control the presentation of a document. The tags and their functions are defined in the DTD file.
     
DVD (Digital Versatile Disk)   High capacity storage medium based on and the same size as a Compact Disk. Where a standard CD can hold up to 650 MB of information, a DVD can hold anywhere from 4 to 9 Gigabytes on a side or up to 17 Gigabytes on a double sided disk. DVD was developed originally as a medium for storing compressed digital movies which were too large to fit on a single CD.
     
Ethernet   The most commonly used local area network (LAN). Developed by Xerox, Digital, and Intel, Ethernet allows connection of up to 1024 nodes over twisted pair, coax, or fiber and is established as an IEEE standard (IEEE 802.3).
     
FTP (File Transfer Protocol)   An Internet communications standard that facilitates the sharing of files from one computer to another. Many Web sites offer file downloads to users using FTP. Indeed, some repositories of digital texts provide users with the option to download full text via an FTP site.
     
Gamma   A numeric value representing the difference between the input and output of a device. Most commonly, gamma is used in relation to monitors, which may display images either brighter or darker than the original scanned image. Gamma correction is an integral part of photo editing software and can be used to adjust scanned images so that they display and print more like the original.
     
GIF (Graphics Interchange File)   A standard image representation format used extensively to make images available over the Internet. A GIF image can have a maximum of 256 colors or shades, which makes it useful for preserving the sharpness of bitonal or grayscale images. Although not as useful as other formats in preserving the richness of natural color scenes (the GIF compression scheme reduces the number of colors in a photograph from millions of colors to a maximum of 256), it is widely used to preserve color images for presentation over the Internet.
     
Hash Coding   A search technique that eliminates the need to scan an entire database by providing access to key words based on where they appear in a database. Words and their locations may be represented, for example, by numeric values, which are then searchable. Hash coding provides quicker search times than linear searching.
     
HTML (Hypertext Markup Language)   The standard formatting language used on the World Wide Web. Very similar in appearance to SGML, HTML's primary difference is that it supports hypertext links within a document, one of the hallmarks of Web page design. HTML uses tags enclosed in angle brackets (<>) to determine text characteristics, layout, and position of images and sound files.
     
HTTP (Hypertext Transfer Protocol)   Standard protocol that enables computer systems to exchange hypertext materials over the Internet.
     
Internet   The worldwide telecommunications backbone that supports communication among computer systems around the world. The Internet had its roots in the Arpanet in 1969 but has since grown into a vast communications network utilized not only by government, but by educational institutions, private organizations, companies, and individuals.
     
Internet Browser   Software that allows a user to view materials on the Internet. The three best known browsers are Netscape Navigator, Microsoft Internet Explorer, and NCSA Mosaic.
     
IP (Internet Protocol) Address   Typically used to refer to the numeric address of a computer connected to the Internet. IP addresses can be static (permanently assigned) or dynamic (assigned temporarily to enable computer to computer communications and consist of numeric strings punctuated by a period. For example, the IP address 139.62.208.133 might designate an individual workstation at a specific organization. The first three strings of numbers would normally represent a particular computer system within an organization, while the final three digits of the address would locate an individual machine attached to the network at that site.
     
Java   Platform-independent programming language created by Sun Microsystems. Similar in structure to C++, Java applets can be imbedded in HTML documents and enable Web users to run programs on their computers regardless of their operating system (Windows, Apple, UNIX, etc.). Both Netscape Navigator and Microsoft Internet Explorer browsers have built-in support for Java.
     
JPEG (Joint Photographic Experts Group)   A standard image representation format used extensively to make images available over the Internet. A JPEG image preserves natural color more efficiently than the GIF format while still providing image compression sufficient to make it useful as a means of representing images in a computer system. The JPEG compression algorithm selectively discards color information based on the amount of compression desired. JPEGs support RGB, CMYK, and grayscale color modes.
     
Keyword Searching   Search capability that allows querying a database by any chosen words. Keyword searching can accommodate the use of Boolean operators, phrase and proximity searching, and allows for inexperienced searchers to find content within a database without knowing the structure or subject organization.
     
Linear Searching   Accessing a data file by starting at the beginning and going through the entire file looking for a string of data. Linear searching is inherently slower than other means of accessing a file, but it can accommodate searching for multiple strings of data.
     
LSI (Latent Semantic Indexing)   Technique for condensing vector space into fewer dimensions. LSI uses a matrix to facilitate location of words within documents. LSI also facilitates cross-language retrieval of documents.
     
MARC (Machine-Readable Cataloging)   Designed at the Library of Congress, MARC is a format for describing bibliographic items. In a MARC record, each line of the record begins with a coding indicating the line's content followed by the content to be displayed. For example, the 245 tag indicates the title of an item, while the 100 tag indicates the author. MARC records are the foundation for library OPACs (Online Public Access Catalogs) and facilitate describing, searching, and displaying information about bigliographic items.
     
MPEG (Motion Picture Experts Group)   A Standard for compressing and storing moving images digitally. MPEG-1 is the most common standard and is useful for videoconferencing. MPEG-2 is more commonly used for storing motion pictures on disk.
     
OCR (Optical Character Recognition)   Conversion of printed text into digitally encoded text. A page of print can be scanned using a scanner and is then converted into a digital representation character by character. OCR converted texts typically require editing since character recognition accuracy can vary depending on the quality of the scanned text. Generally speaking, sans serif fonts are more accurately converted during the OCR process than are serif fonts. The darkness of the original type face can also affect OCR accuracy.
     
PCL (Printer Control Language)   Developed by Hewlett Packard, PCL is a language designed specifically for printing. PCL is not as portable across printer models as PostScript but it does support faster printing of images than the PostScript language.
     
PDF (Portable Document Format)   A display format developed by Adobe Systems based on the PostScript printing language which displays an exact image of an original document with all its layout and type characteristics intact. PDF files can be created using Adobe's Acrobat or PageMaker software and can be read using the free Acrobat Reader, which can be used as a plug-in for Netscape Communicator and Microsoft Internet Explorer to share PDF files over the Internet.
     
Pixels Per Inch   A measure of the dimensions of a graphic image. A pixel is a single small square of color information that can be displayed on a computer monitor or can be printed on a printer. The higher the pixels per inch the finer the printed image will be. Images stored solely for viewing on a computer monitor need not have a high number of pixels since the typical monitor displays at the resolution of 72 pixels per inch.
     
PostScript   A complex programming language developed by Adobe Systems for generating graphic images that may include multiple fonts, colors, and bitmapped images. Although PostScript is usually thought of primarily in terms of printing, it is a full programming language having its own set of commands that facilitate page layout as well as printing. PostScript is widely accepted as a standard language for interfacing with a variety of printers.
     
Precision   The extent to which a search of a database produces results that exactly match a user's query. A search that retrieves 30 documents, 20 of which are considered relevant, has precision of 66.66%. Precision is used along with recall to measure success of an information retrieval system.
     
Public Key/Private Key   The most secure of two encryption methods used for exchanging information over the Internet. Each recipient of a message has both a public and a private encryption key. A sender uses the public key to encrypt a message. The recipient uses his or her private key unencrypt it.
     
RealAudio/Video   Designed by RealNetworks, RealAudio and RealVideo are generally accepted Internet standards for enabling streaming audio and video over the Internet. Instead of waiting for a sound or video file to download, RealPlayer allows the file to start playing as soon as a user initiates reception. Over a fast Internet connection, RealAudio/Video appears to come across in real time.
     
Recall   The extent to which a search of a database produces relevant documents. A search of a database that contains 40 relevant documents that produces a resulting set of 30 documents is said to have 75% precision. Recall is used along with precision to measure success of an information retrieval system.
     
RGB (Red Green Blue)   A color model that uses three colors – red, green, and blue – to reproduce up to 16.7 million colors on a computer monitor. The RGB model is used by all computer monitors to reproduce images on the screen. In the RGB model, each pixel that composes an image is assigned an intensity value ranging from 0 (black) to 255 (white).
     
Scanner   An optical device that can be interfaced with a computer to allow for importing a digital image of a physical object into a computer program. Scanners can be hand-held, flat-bed, or sheet-fed. On a flat-bed scanner, the object to be scanned is placed against a glass panel (usually page size) and is scanned by a scanning head consisting of a linear array of light-sensitive sensors. The image is illuminated by a high-intensity light and the reflected light is picked up by the sensors and transmitted as digital codes photo-editing or OCR software.
     
SGML (Standard Generalized Markup Language)   A syntax for marking up a document to have certain layout, text display, and image display. SGML imbeds tags describing certain aspects of a document in angle brackets (<>) and surrounds all text elements with tags to describe how the text will be displayed. For example, a line of bold text will be enclosed inside the tags . The resulting display will be to show the enclosed text in bold. SGML provides standards for laying out tables, displaying graphics, and for producing special characters like the pound sign, the ampersand, and the copyright symbol. Unlike PDF documents, SGML documents can be rearranged according to user specifications, producing a document display that may be larger than or smaller than the original. This provides considerable flexibility from one machine to another. The down side of this is that an author wanting control over the appearance of his or her document essentially loses control to the reader. Current SGML standards include the American Association of Publishers' Electronic Manuscript Standard, the Department of Defense Continuous Acquisition and Life-Cycle Support rules (CALS), and the Text Encoding Initiative standard (TEI).
     
Speech Compression   Use of algorithms to sample voice as it is recorded and reduce the size of the resulting digital file. The most common standard for compressing speech, that can also be used to compress music, is the GSM algorithm used for digital cellular phones. GSM can compress speech by a factor of 5, producing files that occupy approximately 1600 bytes per second of recorded information.
     
TCP/IP (Terminal Control Protocol/Internet Protocol)   The set of protocols that allow different computer systems to exchange information over the Internet. TCP/IP is platform and machine independent and serves as the standard communication language for all computer systems tied into the Internet.
     
Telnet   Standard protocol for logging onto another computer system over the Internet and running programs from a remote terminal or workstation. Telnet was developed as part of the ARPAnet project.
     
TIFF (Tagged Independent File Format)   A format used to exchange files between computer applications and platforms (Windows, Mac, etc.). The TIFF format is shared by almost all photo-editing and paint programs and is supported by most scanner software as an option for saving scanned images. TIFFs are bitmapped images that support RGB and CMYK color schemes as well as grayscale.
     
URL (Uniform Resource Locator)   The digital "address" of a resource available over the Internet. URLs are prefaced with the name of the protocol used to access a resource. For example, a hypertext source is prefaced with the http:// protocol designator while a telnet system is prefaced with the protocol designator telnet://. The address may be a combination of letters and numbers but is translated into a numeric value by a domain name server, a computer system assigned to resolve addresses on the Internet.
     
Vector Graphic   An image format that uses geometrical shapes likes lines and curves to define images. Vector graphics are frequently used to define type faces or high definition shapes where the smoothness of the image at all resolution settings is important. Unlike bitmap images, which can lose their definition and look chunky at increased resolutions, vector graphics retain their definition at any resolution. When displayed on a computer monitor, both bitmaps and vector graphics are displayed in pixels.
     
Vector Model   Information retrieval model introduced by Gerard Salton that represents documents using vectors based on word frequency within a document. The more times a word occurs within a document, the stronger the vector representing the document. A search of the database on a word would produce an ordered listing of documents whose vectors most closely match the search query.
     
WAV   Sound file format standard in the Windows operating system. WAV files typically use quite a bit of disk space and aren't the most efficient means for transmitting sound digitally.
     
World Wide Web (WWW)   The graphics rich, sound and video enabled, hyperlinked part of the Internet that many people confuse with being synonymous with the Internet. The communication protocol that supports the Web is called Hypertext Transfer Protocol (http). Anyone with access to the Internet and who is running a Web browser on his or her workstation can access materials on the Web and get the full benefit of multimedia content and hypertext links to other resources. The full Internet includes other types of systems including FTP sites, Gopher servers, and telnet systems, to name a few.
     
WYSIWYG (What You See Is What You Get)   A format for laying out documents where the image displayed on the screen is the same as what a user will ultimately print. Developed by Xerox, this standard for layout and print is used by most word processing programs at this point.
     
     

Comments & Suggestions
to Jim Alderman.

Updated 7 December 1998.