REPRESENTATION AND STORAGE OF DATA ================================== Number Systems The four number systems we will examine are binary, octal, decimal, and hexadecimal. Binary numbers can be represented directly in hardware, and the decimal number system is the most frequently used by people. People also use other number systems, such as base 60 in time and angular measurements, but we will not consider these. Often it is more convenient to know a word or byte in binary than in decimal, as its actual value may be less important than which bits are set and which are reset. As binary numbers are difficult to read, a base from which numbers can easily be converted to binary is preferable. Bases which are a power of two are the simplest to convert to and from binary; no arithmetic is necessary, just table lookup. The most widely used such bases are 8 (octal) and 16 (hexadecimal). The PDP-11 assembly language (MACRO-11) and the debugger (ODT) both use octal as the default base, but system software for many other machines uses hexadecimal. Digits in octal range from 0 through 7; as hexadecimal requires 16 different digits, it is necessary to use letters of the alphabet for the digits valued 10 through 15. A, B, C, D, E and F are used. Decimal Binary Octal ------- ------ ----- 0 000 0 1 001 1 2 010 2 3 011 3 4 100 4 5 101 5 6 110 6 7 111 7 Conversion from binary to octal: Divide the number into chunks of 3 bits each, starting from the right. For example, 11001011010101101 --> 11 001 011 010 101 101 Now, pad out the leftmost chunk with zeros, so that it is exactly 3 bits long. 11 001 011 010 101 101 --> 011 001 011 010 101 101 Now use the conversion table to convert to octal. 011 001 011 010 101 101 --> 313255 Conversion from octal to binary is the opposite process: First, use the table to convert each octal digit into three binary bits, remove any leading zeros, and then pack the chunks. For example, 1604732 --> 001 110 000 100 111 011 010 --> 1 110 000 100 111 011 010 --> 1110000100111011010 Decimal Binary Hexadecimal ------- ------ ----------- 0 0000 0 1 0001 1 2 0010 2 3 0011 3 4 0100 4 5 0101 5 6 0110 6 7 0111 7 8 1000 8 9 1001 9 10 1010 A 11 1011 B 12 1100 C 13 1101 D 14 1110 E 15 1111 F Conversion from binary to hexadecimal is similar, only 4 bit chunks are used instead of 3 bit chunks. For example, 11001011010101101 --> 1 1001 0110 1010 1101 --> 0001 1001 0110 1010 1101 --> 196AD Conversion from hexadecimal to binary is again the opposite process. 2B50F9 --> 0010 1011 0101 0000 1111 1001 --> 10 1011 0101 0000 1111 1001 --> 1010110101000011111001 Conversion from binary to decimal is accomplished by the following algorithm: %Number each bit from 0 through n - 1 starting from the %right. sum <- 0; for i from 0 to n - 1 do if bit[n] of sum = 1 then sum <- sum + 2 ** n fi od and conversion from decimal to binary is accomplished by the following algorithm: %Number each bit from 0 through n - 1 starting from the %right. sum <- 0; i <- 0; while n > 0 do bit[i] of sum <- odd(n); %Set bit[i] if n is odd. n <- n div 2; %Integer divide, i.e. throw i <- i + 1 %away remainder. od Representation of Non-numeric Data. Character Codes. Computers are more often used for processing non-numeric data than numeric data. In order to store non-numeric data, it is necessary to have a representation for it in binary numbers. Non-numeric data may consist of words in English, punctuation marks, digits not requiring numeric processing, spaces, and control characters such as carriage return. Alphabets (as distinct from syllabaries, such as Chinese and Japanese character sets) have around thirty characters. A character set based on English requires 7 bits to represent each character, as there are between 64 (1000000 in binary) and 127 (1111111 in binary) symbols -- 26 upper case letters, 26 lower case letters, 10 digits, about 30 punctuation marks, and several control characters. How the characters are assigned binary values is arbitrary. A number of different character codes are in use, the most common being ASCII (American Standard Code for Information Interchange) and EBCDIC (Extended Binary Coded Decimal Interchange Code). The ASCII code is a 7-bit code. It is not usual to memorise character codes as tables can be referred to, but it is convenient to be able to calculate the codes for letters and digits, and to know the values of a few common control characters. For your reference, here is the ASCII character code: L.S. bits ! ! V 000 001 010 011 100 101 110 111 <-- M.S. bits --- --- --- --- --- --- --- --- 0000 NUL DLE SP 0 @ P p 0001 SOH DC1 ! 1 A Q a q 0010 STX DC2 " 2 B R b r 0011 ETX DC3 # 3 C S c s 0100 EOT DC4 $ 4 D T d t 0101 ENQ NAK % 5 E U e u 0110 ACK SYN & 6 F V f v 0111 BEL ETB ' 7 G W g w 1000 BS CAN ( 8 H X h x 1001 HT EM ) 9 I Y i y 1010 LF SUB * : J Z j z 1011 VT ESC + ; K [ k 1100 FF FS , < L l 1101 CR GS - = M ] m 1110 SO RS . > N n 1111 SI VS / ? O _ o DEL The EBCDIC code is an 8-bit code. IBM use it. We will not. Definition of Bytes, Words, Records and Blocks. A byte is the number of bits required to store a character. Each character is normally stored in 8 bits, the most significant bit either being ignored, set to zero, or used for parity checking. Parity checking is used in data communications, where data are being sent along a possibly 'noisy' medium, such as a telephone line. It is essential to know if bits are corrupted (changed from 1 to 0, or 0 to 1). One way of doing this is to set the parity bit according to whether the character has an odd or even number of bits set to 1. For even parity, the parity bit is set to 0 if there are an even number of 1's, and set to 1 if there are an odd number. This means that the number of 1's in each byte is always even, and if a computer receives a byte with an odd number of bits set to 1, then it can treat it as corrupt, and ask for it to be retransmitted by the sending computer. A byte with two bits corrupted will be treated as correctly received, but this is much less likely. More sophisticated means of data integrity checking can be used, with which it is possible to know which bit has been corrupted. A word is the number of bits that can be sent simultaneously along the CPU's data bus (a bus is a set of wires used to transmit data in parallel), and so varies from machine to machine. For example, the PDP-11 has a word size is 16 bits, the Z80 (the Sharp Micro's CPU) has a word size of 8 bits, the IBM 370 has a word size of 32 bits, and the Burroughs 6800 has a word size of 48 bits. Generally, more powerful machines have larger word sizes. Microcomputers usually have word sizes of between 8 and 16 bits, minicomputers have word sizes of between 16 and 32 bits, and mainframes have word sizes of between 32 and 64 bits, but this is only a rough guide. The PDP-11 is a minicomputer and the IBM 370 is a mainframe. There are two types of records: logical records and physical records. From the programmer's point of view, a file consists of a table of logical records. Logical records can be either of fixed length, or variable length (in which case they are usually separated by a control character such as a carriage return (CR). In a text file, the logical record is simply a line; for other files, a logical record may contain a set of fields, defined by the programmer. A physical record is the unit of storage on secondary memory devices, such as disks or magnetic tapes. If the logical records are smaller than the physical records, several logical records can be stored in one physical record. This is called 'blocking'. In the case of logical records being larger, each may be divided among several physical records ('spanned records'). A block is the amount of data that can be loaded in one disk access (in the case of a tape, a block is the same as a physical record). The larger the block size, the less time is required reading from and writing to the file, but the more space is required in main memory for buffering the data. Block size should always be greater than or equal to the physical record size. If it is not, then at least two disk accesses will be necessary to read or write a physical record.