THE WORLD BANK

GROUP A World Free of Poverty
Home

Banner

Technology Broadcast and Computer-based
On-line databases and services

Database Systems

Hugh Robinson

Context:
This selection, one of two by this author describing database systems, explains database concepts, features, and terminology.

Source:
Robinson, Hugh. "Database systems." In Relational database systems, Introduction to database technology. Milton Keynes: Open University, pp. 7-20.

Copyright:
Reproduced with permission. (Permission does not cover identifiable third party material).

2 DATABASE SYSTEMS

This section is about the concepts and features of database systems, and starts with an examination of some basic ideas and terminology.

2.1 Databases and the database approach

2. 1.1 Databases

We shall begin by considering a typical environment in which a database could be used. It is a simplified and fictional example, but is based on a real situation, and will be used extensively during the course. We shall refer to it as the University example.1 Here is a description of the University environment:

An open learning university needs to keep details of staff and students, the courses which are available and the performance of students on courses. Students are initially registered and issued with a student identification number. Students are not required to enrol on any courses at registration. The region in which they are located is recorded along with the year of registration and the student's name. Each member of staff has a staff number and name recorded and also belongs to a region. Each staff member may act as a counsellor to one or more students and/or may tutor one or more students on courses. Each student has one counsellor and has a tutor for each course on which he/she is enrolled. A student is allocated a counsellor at the time they register. However, their allocation of a tutor for a course on which they are enrolled may take place sometime after enrolment. Staff members may only tutor and counsel students who are located in the same region as the member of staff.

Each course which is available for study is given a course code, a title and a value for credit—0.5 for what is termed a half-credit course and 1 for a full-credit course. A course may have a quota—the maximum number of students which can be enrolled on the course in any one year. A course may not (yet) have any students enrolled on it.

Students may not enrol for more than three credits' worth of courses at any one time. There is a need to record details of current enrolments only (a separate subsystem deals with archival data).

A full-credit course may have up to five TMAs (tutor-marked assignments) associated with it and a half-credit course up to three TMAS. Each TMA is numbered in the range 1 to 5. The grade of a student on an assignment for a given course is recorded as a mark out of 100.

The University is an organization which possesses three key features that indicate that a database is an appropriate part of the computer system for that organization.

The first feature is that the University needs to maintain persistent data. The notion of persistence arises because the times at which some item of data is used are often different to the time at which it was created; hence the data needs to persist from its creation to (at least) its final use. Persistent data is 'what the system needs to remember' in order to transform input data into output data via processing ('what the system does'). The University will need to maintain persistent data about:

  • things it is interested in: staff, students, courses, enrolments, etc.
  • properties (of these things): names (of staff and students), titles (of courses), grades (on a TMA for a student enrolled on a course), etc.
  • associations (between things): the association between staff and students termed counselling, the tutoring association between staff and students enrolled on courses, etc.

An important aspect of this need to maintain persistent data is that it must be done in a manner that ensures that only valid data is recorded. To explain this issue we can use the data recorded about a student as an example. Each student is issued with a student identification number when they register. This student identification number is unique in so far as no two students are allocated the same student identification number. Concentrating on this 'uniqueness' aspect of the student identification number, the issue of ensuring 'that only valid data is recorded' involves maintaining the persistent data about students in a way that preserves this uniqueness. Suppose that the persistent data about students is held in a file (named, say, StudentFile), made up of records (named, say, as Student), each of which holds data about a student. Each record is made up of fields named (say) StudentId (representing the student identification number), Name (representing the name of the studen t) and so on. In this situation, preserving the uniqueness aspect can be achieved if StudentId is the primary key of Student within StudentFile. The designation of this primary key2 guarantees that no two Student records may exist in StudentFile with the same value for StudentId. This is an example of what has been termed 'the capture of meaning': the process of taking an aspect of the organization that a data processing system supports or represents and ensuring that the meaning of that aspect is represented in some way in the data processing system. The features of the data processing system that allow meaning to be captured in this way are termed semantic features. So, for the example of student data, the primary key feature is a semantic feature.

It is important to realize what is being claimed here. We are not making the ambitious claim that a data processing system 'understands' students and student identification numbers. Rather, we make the more modest claim that the system will not maintain two occurrences of the record named Student in the file named StudentFile when the two occurrences have the same value for the field named StudentId. The data processing system then behaves in a way that mirrors (part of) the behaviour of student identification numbers in the University. The data processing system behaves in this way because the primary key feature constrains the file StudentFile to contain only valid combinations of the relevant data, i.e. only those combinations of Student records for which the StudentId fields are different—combinations in which the StudentId fields are the same are forbidden. For these reasons the primary key feature is said to be a constraint. As you will see during the course, constraints (such as primary keys) play an important part in relational database technology.

It is worth giving two other examples of this capture of meaning by means of a constraint. The first example concerns the statement in the description of the University that 'Each student has one counsellor ... A student is allocated a counsellor at the time they register.' That is to say, if you are registered as a student then you must have a counsellor. This could be represented by a constraint that ensured that data about a student (however it was represented) was never recorded without that data being associated with data about a counsellor (however it was represented). The constraint forbids the situation of data about a student being recorded without that data also being associated with data about a counsellor. The second example concerns the statement that 'Each TMA is numbered in the range 1 to 5.' This could be represented by a constraint that ensured that a TMA number was never recorded that lay outside that range.3

The second of the three key features is that this persistent data is structured and homogeneous. That is to say, for example, that the data about an individual student is structured into discrete items of data: a student identification number, a name, the year of registration and the region in which they are located. Not only is the data for one individual student structured in this fashion; the data for all students has the same structure—it is homogeneous. The structured and homogeneous nature of the data is typified by the fact that we can represent it in tabular form. Figure 2.1 shows some typical data about students in the form of a table.

StudentId Name Registered CounsellorNo Region
s01 Akeroyd 1988 3158 3
s02 Thompson 1993 5212 4
s05 Ellis 1992 5212 4
s07 Gillies 1991 3158 3
s09 Reeves 1993 5212 4
s10 Urbach 1992 5212 4
Figure 2.1 Data about students as a table

Notice that, as a consequence of the structured and homogeneous nature of the persistent data, the table consists of two parts. First, there is the heading of the table: the column labels StudentId (for student identification number), Name, Registered (year of registration), CounsellorNo (the staff number of a member of staff acting as a counsellor) and Region. This relates to the kind of data that is relevant for all students—a uniform structure that holds for all rows. Second, there are the individual rows of the table; each row containing the data for an individual student. Each row consists of values for the same kinds of data as the other rows—just one value in the StudentId column for each row, just one value in the Name column, and so on. The distinction between the two parts is of importance since the heading is resistant to the mundane business of students registering, studying and (eventually) leaving the University—all s tudents are the same in terms of the structure of their data—whilst, in contrast, the rows do change in rhythm with the routine of university life. The table in Figure 2.1 shows data about a mere six students. If a further six were to register, then the number of rows in the table would increase correspondingly. If the student with a student identification number of s01 moves house from region 3 to region 4, then the data in the first row would change accordingly. The data for other things of interest to the University will also exhibit this structured homogeneity. For example, Figure 2.2 shows the data about staff represented in a tabular fashion.

StaffNo Name Region
8431 French 4
7774 Redhead 4
5212 Heathcote 4
3158 Jennings 3
5324 Lai 4
Figure 2.2 Data about staff as a table

SAQ 2.1 How many individual members of staff are represented by the data in the table in Figure 2.2?

Solution Five.

Figure 2.3 shows, by way of contrast, a table in which some of the columns do not have homogeneous data.

StudentId Name Registered Counsellor Region
s01 Akeroyd 1988 3158 3
s02 Thompson 1993 Heathcote North
s05 Ellis 1992 5212 4
s07 Gillies 1991 3158 3
s09 Reeves 1993 5212 North
s10 Urbach 1992 Heathcote 4
Figure 2.3 An example of a table in which some columns are not homogeneous

The data in the first three columns of Figure 2.3 is homogeneous, since we get the same kind of entry in each column. However, in the other two columns this is not the case, since we get different kinds of entries in each column. In the Counsellor column we sometimes have a number (the staff number of the counsellor) and sometimes a name (the name of the counsellor).

SAQ 2.2 Explain why the Region column values in Figure 2.3 are not homogeneous.

Solution In the Region column we sometimes have a number for the region and sometimes a name.

The last of the three key features is that the University will have a number of different people—users—who need this persistent data for a variety of purposes. The description of the University environment did not mention these people but here are three typical uses (or applications, as they are often termed):

  1. Central clerks, who carry out activities such as registering students and enrolling registered students on courses, as well as amending, where necessary, TMA grades that have already been recorded for students enrolled on courses.

  2. Regional clerks, who do work such as allocating tutors to students enrolled on courses.

  3. Academics (tutors and counsellors), who carry out activities such as recording the grade of a student on a TMA or reviewing details of the students they counsel.

You will be given details of these different uses later in this book, when you carry out work using your computer. However, the important point to be made here is that these different applications imply different uses of the persistent data. By way of illustration, the central clerks' application uses data about students so that, for a given student, details of courses enrolled, tutors and TMA grades can be displayed, along with the name of the student, their region, counsellor and year of registration. Figure 2.4 gives an example. The regional clerks' application uses similar data but it is viewed differently so that, for example, details of enrolments for a student can be displayed. Figure 2.5 (overleaf) illustrates this requirement.

Student Id: s05
Name: Ellis
Registered: 1992
Counsellor No: 5212
Region: 4

Course Code Title Credit Tutor No TMA No Grade
c2 Syntax 0.5 5324 1 78
c2 Syntax 0.5 5324 2 63
x7 Pragmatics 0.5 5324 1 78
Figure 2.4 A view of data for use by the central clerks

Region: 4
Student Id: s05
Name: Ellis
Registered: 1992
Counsellor No: 5212
Counsellor Name: Heathcote

Course Code Course Name
c2 Syntax
c7 Pragmatics
Figure 2.5 A view of data for use by the regional clerks

If you compare Figures 2.4 and 2.5, you can see that they both use data about the same student, but viewed differently. They both list the student's details and they both list details of the courses on which the student is enrolled. In the case of the central clerks' application this is done by giving the course code, title, value for credit, tutor number, TMA number and grade for each TMA submitted by the student on a course for which the student is enrolled. So, in the case of the student displayed in Figure 2.4, the enrolment on the course with a course code value of c2 results in two lines being displayed by virtue of the student having submitted two TMAs for the course. In the case of the regional clerks' application, however, more summary details of the courses on which the student is enrolled are displayed—simply the course code and title, with no TMA detail. Thus there is a line for each of the two enrolments. Note that the values of title in Figure 2.5 are labelled 'Course Name' since t hat is the way in which regional clerks refer to such values (in contrast to central clerks, who use Title to refer to values of the title of a course). In summary, the two applications require similar persistent data, but structured and named in a way that enables it to be viewed and used differently.

SAQ 2.3 Are the details for a student and their counsellor exactly the same for the central clerks' and the regional clerks' data as illustrated in Figures 2.4 and 2.5?

Solution No—both list the counsellor number but the regional clerks' application also lists the counsellor's name.

The ability to support the three key features that we have just discussed (persistence, structured homogeneity and differing user requirements) constitutes a defining characteristic of a database.4 So, as a summary, here is a definition of a database:

A database is a collection of persistent data that is structured and homogeneous, and organized in such a way that all user data requirements can be satisfied by the database.

2.1.2 The database approach and database systems

Here is another definition-a definition of the database approach:

The database approach is a strategy for developing data processing systems, treating data as a shared organizational resource.

By 'data processing systems' we mean, loosely, systems characterized by structured, homogeneous, persistent data. The phrase 'treating data as an organizational resource' is more problematic. It is sometimes used, with high ambition, to argue that data is an organizational resource, since it is essential to the operation of an organization and costs money to create and maintain in a similar fashion to buildings, stock and machinery.5 We shall be more prosaic here and content ourselves with practical software outcomes (although we are still describing something of an ideal approach which has yet to be fully realized in all of its aspects). We can do this by contrasting the use of the database approach with the use of a file-based approach. Figure 2.6 depicts the University's data processing systems using a file-based approach.

The situation depicted is one in which each of the three applications has its own application-oriented file structure to provide its persistent data in the desired fashion. Four aspects of this situation are important for our account of the database approach:

  1. The applications do not share data. This arises from the fact that each application requires a different view of the data. Given the way in which application software interfaces directly with files, this results in three different file structures to meet three different needs. Consequently, data is duplicated across the three files. For example, all three files may record the fact that the student with the student identification number of s05 is named Ellis. This may be wasteful of storage space but, more importantly, it gives rise to the spectre of inconsistent data. Suppose the student changes their name from Ellis to Lopez and this change is recorded in one of the three files but not in the other two. One file records that the name of the student is Lopez but the other two have the name as Ellis. Which is to be believed?

  2. Constraints are expressed as part of the (procedural) application software code. For example, the constraint that each student has one counsellor will be represented in the procedural application code as a validation check to be performed every time the system records a student registration. In addition, the check will need to be performed at other instances involving a change to the association between a student and a counsellor, such as the student moving region and therefore requiring a new counsellor. A simple constraint is expressed as procedural code, potentially replicated in the software. This causes two major problems. First, the constraint is not explicit and comprehensible—it is buried in program code. Second, if the constraint needs to be changed (say, the University decides students need not always have a counsellor) then a costly and error-prone exercise in identifying and modifying all replicas needs to be carried out.

  3. There is an inability to respond to change. We have already seen an example of this with the case of changing a constraint. However, there is a wider dimension. Organizations like the University evolve, changing their requirements for the kinds of persistent data that need to be maintained. For example, the University may decide that data recorded about staff must now include the home telephone number of each member of staff. A change such as this—a change to the structure of data—is not easily achieved in a file-based approach. This arises from the fact that the structure of the data will be part of the application software code, by virtue of the file structure(s) expressed as part of that code. This causes a problem whenever the structure needs to be changed since this requires a change to the code—a process which, experience tells us, can be costly and error-prone.

    We have described this inability to respond to change in terms of a change to the logical structure of data. However, it is also a problem at the more mundane level of a change in the physical storage of data or in access methods. Such changes also result in the costly and error-prone business of changing the code.

  4. There is a lack of central control. The problems discussed above are exacerbated by the fact that there are three applications. Consider, as an example, the constraint that TMA numbers must be in the range 1 to 5. This amounts to a policy for all areas of the University that deal with TMA numbers. Enforcing this policy requires procedural code to be replicated within software within and across three applications, with all the attendant problems of co-ordination and consistency. Similar problems to those with constraints will arise with access control: the control and limitation of the dissemination of data to authorized users. For example, suppose that it is a University decision to restrict access to a student's TMA grades (marks) to the student's counsellor, the student's tutor for the course to which the TMA relates, and to certain central and regional clerks who require access. This amounts to an access control policy for the University in respect of a student's TMA grades. This access control policy will be represented as procedural code in application software, replicated within and across the three applications. As with constraints, this brings all the attendant problems of co-ordination and consistency.

In contrast, Figure 2.7 gives a depiction of the University's data processing systems after the use of the database approach.

Now, the three applications share the persistent data. This sharing is achieved via the use of a piece of general-purpose software known as a database management system (DBMS) that stands between the application software and the shared persistent data. A DBMS carries out a range of functions, which we shall examine in more detail later, but two things need to be emphasized at this stage. One, the DBMS stands between the application software and the storage structures that contain the persistent data (as opposed to the situation before, in which the application software interfaced directly with files). All access to the database is via the DBMS. Two, the DBMS makes use of a definition of the shared persistent data. This definition exists independently of the application software (as opposed to the situation before, in which persistent data structures were firmly associated with application program code).

It is worth going back over the four aspects we considered before to see the consequences of the introduction of the database approach.

  1. Applications can share data. The DBMS can produce the differing views required for the various applications from the same underlying data. The spectre of inconsistent data through duplication is removed. Indeed, controlled redundancy can be introduced. That is to say, if there is a need to replicate data (say, for efficiency reasons or for ease of use) then this may be carried out since the replication can be recorded as part of the definition of the shared data. Extra storage space will be required, but the DBMS is in a position to ensure that all replicas are maintained consistently.

  2. Constraints are firmly under the control of the DBMS. As we shall describe later, constraints can be explicitly expressed as part of the definition of shared data, thus minimizing the reliance on procedural code as a way of expressing semantics. For example, the constraint that represents the rule that each student has one counsellor would be expressed once in the central definition of data and would no longer need to be replicated as procedural code in application software. The overall effect of this aspect is to simplify the tasks that involve application program code, and the result should be that better applications development is possible with the database approach than without it.

  3. There is an ability to respond to change. Changes to the structure of data, for whatever reason, are dealt with by making a change in one central place—the definition of data used by the DBMS. This minimizes the amount of work required and makes it less error-prone. Indeed, as we shall see later, the detailed structure of this definition of data used by the DBMS allows the impact of any change on application program code to be kept to a minimum.

  4. There is central control, by virtue of the central position accorded to the DBMS and its definition of the shared persistent data. Given that the DBMS has mechanisms for implementing access control (say, expressed as part of the definition of data), then access control policies can be centrally enforced.
Exercise 2.1
The lack of central control with the file-based approach created a problem whenever it was necessary to replicate a constraint as a policy within and across applications. How is this problem tackled with the database
approach?

We are now in a position to give a definition of a database system.

A database system is a computer system developed using the database approach for a variety of uses, incorporating a database and a DBMS.

2.2 Database management systems

It is now appropriate to examine the idea of a DBMS in more detail. We have stated that a DBMS is a piece of general-purpose software. Specifically, we mean that any particular DBMS will carry out a range of functions that can be used for any set of applications using shared persistent data for which the database approach is appropriate.

2.2.1 The functions of a DBMS

Here are the functions of this piece of general-purpose software known as a DBMS:

Data definition. A DBMS must be capable of taking a description of persistent data, as specified by a person, and transforming it into an internal form that can be efficiently processed by a DBMS. This description of data constitutes the definition of shared data of Figure 2.7. The detail of the language used to express this description will vary with the kind of DBMS used, but the language win allow for the linguistic expression of data about:

  • things
  • properties of things
  • associations between things.

The language used—and the constructs expressed through that language—will be independent of those employed in programming languages for describing data structures. We shall emphasize this point by using the technical term schema for the description of the persistent data, rather than the (somewhat) contrived 'definition of shared data'. Each database, such as that for the University database system of Figure 2.7, will typically have one schema6 describing the shared persistent data. The language used to describe a schema is known as a data definition language, usually referred to as a DDL. A DBMS will process a source schema (a formal description of data suitable for people) into its internal form, an object schema (a formal description of data suitable for a machine).

Data manipulation. A DBMS must support the manipulation of data in a database, described by a schema, by application software. This is achieved by a data manipulation language, usually referred to as a DML. As a minimum, a DML will support the insertion, deletion, amendment and retrieval of data in the database. Depending on the kind of DBMS used, a DML may also support more sophisticated manipulation of data, such as the production of summaries.

Access control. In order to control the use of data in a database, a DBMS will provide mechanisms to define and enforce access control to persistent data. This is typically achieved via the ability to declare appropriate access control mechanisms as part of a schema. That is to say, the DDL allows the definition of access control, as well as the structure of data. Note that the access control function is particularly important in systems which are shared by many users.7

Constraint definition. A DBMS will provide the ability to define and enforce constraints. Again, this is usually done via the schema. Thus, a DDL allows the definition of constraints, as well as mechanisms for access control and the structure of data.

Restructure and reorganize. The schema for any database is unlikely to remain unchanged. New applications, requiring new types of data, will be developed. Existing applications will evolve, changing their requirements for data. Consequently, a schema will need to be restructured and a DBMS needs to support changes both to a schema and to the data in the associated database. The physical storage of data will also change. Access methods (such as hashing and the use of indexes) will change in response to evolving usage patterns, for example. There will be a need to reorganize the stored database and a DBMS needs to support this task. A DBMS may also support other tasks in this area, such as the monitoring of storage space and the use of access methods.

Transaction support. A DBMS must support the concept of a transaction: a unit of work that either succeeds completely or fails completely. Support for transactions is necessary when an application program needs to carry out work which effects multiple records8 within a database. For example, an application program may need to amend data in one record in order to delete some other, second, record. If, for some reason, the deletion of the second record cannot be carried out, then any amendment to the first record is incorrect and should be reversed. The actions carried out on the two records comprise a logical unit of work that either succeeds completely or fails completely. If a series of database actions are defined within an application program as a transaction, a DBMS guarantees that a situation in which only some of the actions are effected will not occur.

Recovery. In keeping with any other managed organizational software resource, the DBMS will provide facilities to recover from the loss and/or corruption of persistent data which may occur due to hardware or software failures.

Concurrency support. In keeping with any other shared software resource, a DBMS will support concurrent usage. That is to say, it will support the shared use of the database by multiple users in a way that ensures that the database actions of one user do not corrupt the database actions of other users. Support for concurrent usage typically ensures that transactions wishing to access the same data do not interfere with one another.

2.2.2 DBMS approaches

There are hundreds (if not thousands) of DBMSs commercially available. Despite this apparent diversity, DBMSs can be classified according to the general approach they take to the two functions of data definition and data manipulation. Viewed this way, there are only a limited number of DBMS approaches and any given DBMS can be classified as an example of a particular approach. Three historically significant approaches are:

  • the hierarchic approach, typified by IBM's product IMS

  • the network approach, typified by products based on the work of various Codasyl committees9

  • the relational approach, which is the dominant DBMS approach of the 1990s and the subject of this course.

As you might expect from what has been said, the relational approach is supported by a range of DBMSs. For example, the DBQ system used in this course supports the relational DBMS approach. Other DBMSs support the relational approach: for example, DB2, Oracle and Ingres.

The details of the relational approach will be explored throughout this course, but it is useful to give a very brief overview here. In terms of data definition, data is represented as tables10 (and nothing but tables). For example, using the University example we might use the Student table and the Staff table as the means for representing data about students and staff respectively Examples of these two tables are given in Figure 2.8.

Student
StudentId
Name Registered CounsellorNo Region
s01 Akeroyd 1988 3158 3
s02 Thompson 1993 5212 4
s05 Ellis 1992 5212 4
s07 Gillies 1991 3158 3
s09 Reeves 1993 5212 4
Urbach 1992 5212 4

Staff
StaffNo
Name Region
8431 French 4
7774 Redhead 4
5212 Heathcote 4
3158 Jennings 3
5324 Lai 4
Figure 2.8 The Student and Staff tables

In Section 2.1.1, we classified persistent data as data about things, data about properties (of things) and data about associations (between things). The relational approach allows this trinity to be represented in the following fashion.11

Things. Each of the things the University is interested in recording data about is represented by a table. So in Figure 2.8 we have data about staff represented by the Staff table and data about students represented by the Student table, with the distinction between the column headings and the individual rows that we discussed in Section 2.1.1.

Properties. Properties of things are represented by the individual values recorded in each table. For example, the fact that the name of the student with a student identification number of s0l is Akeroyd is represented by the Name value for the row with the StudentId value of s0l is Akeroyd, in the Student table.

Associations. Associations between things are represented by the values in one column of a table referencing the values in a column of another table. Thus, the counselling association is represented by the values in the CounsellorNo column of the Student table referencing the values of StaffNo in the Staff table. For example, the first row of the Student table references the fourth row of the Staff table, by virtue of the CounsellorNo value of 3158 in the Student table referencing the StaffNo value of 3158 in the Stafftable. Similarly, the fourth row of the Student table also references this same Staff row. In terms of the data depicted in Figure 2.8, the member of staff with a StaffNo value of 3158 counsels two students.

SAQ 2.4 In terms of the data depicted in Figure 2.8, how many students are counselled by the member of staff:

  1. with a StaffNo value of 5212?
  2. with a StaftNo value of 8431?

Solution

  1. 4
  2. 0

In terms of data manipulation, a DML supported by the relational approach operates on whole tables at a time and always yields a table as the result of any particular operation. Here is an example in the DML for the SQL language, which you will use in Section 2.4 and study in detail in Books 4 and 5.

SELECT NAME, COUNSELLORNO, REGION FROM STUDENT WHERE REGISTERED > 1992

This DML code produces, as the result, just the NAME, COUNSELLORNO and REGION columns for those rows from the STUDENT table for which the value of the REGISTERED column is greater than 1992. That is, when applied to the STUDENT table of Figure 2.8, the table depicted in Figure 2.9 is produced.

Name CounsellorNo Region
Thompson 5212 4
Reeves 5212 4
Figure 2.9 The resultant table


Online Databases  • Broadcast and Computer-Based  • Technology  • Home  • Top 



The World Bank Site
The World Bank Site

Last Updated: April 1999