Database Design

Information Systems & Conceptual Schema

Introduction

Conceptual Schema are diagrams used in the planning and design of databases.  These diagrams have their own set of symbols and rules.  An understanding of conceptual schema diagrams and how to produce them is very useful to anybody intending to construct complex databases such as those involving relationships between tables in MS Access.  The term conceptual schema is often abbreviated to CS.

Problem Solving With Databases

You will already be familiar with Polya's 4 step process for problem solving from the work you have done when studying the Algorithm & Programming section of the IPT course in Y11. This process can be applied to problem solving with databases when producing an information system.

George Polya was a well known Mathematician at Stanford University in the USA.  Polya was fascinated by the way in which people solve problems and wrote widely on the subject.  His most famous book was called How To Solve It and it was in this work that he described the four common steps taken when solving a problem.  These are described below: -

1. Define the problem

Define the problem so that its is clearly understood and time is not wasted in following false directions.

In database design the problem definition phase can be divided into two further parts.  Firstly, problem identification which is a general description of the problem outlining the benefits of using an information system to solve this specific problem.  Secondly, conceptualisation which is a statement describing the aim of the information systems and any specific objective associated with it.

2. Plan the solution

Design an algorithm to solve the problem.

In database design the solution planning phase is known as formulisation.  The planning for an information system involves the development of a conceptual schema.  This in turn leads to the development of what is called a relational schema.  Conceptual schema will be described in more detail later.

3. Implement the solution

Implement the solution means actioning the planned algorithm.  When working with Visual Basic this is writing, testing and debugging code.

In database design it means realisation of a solution by creating an actual working information system using a database tool such as MS Access.  This process may involve: -

  • defining tables including data types and field lengths
  • adding relationships between tables
  • populating tables by carefully adding data
  • creating appropriate queries to question the database
  • creating necessary forms and reports
  • removing unwanted user components from the Access GUI.

In addition, technical documentation and a user manual is considered advisable.

4. Evaluate the solution

Evaluate the solution to make sure that fulfils its intended purpose as describes in the problem definition phase.

In database design this involves the use of test data to check that the information system works as intended.  A formal evaluation may be produced in written form.  It would be expected to include remarks on performance, reliability and user feedback. 

Conceptual Schema

Just as Polya's four step process outlines the method by which which problems should be effectively solved so there is another process that outlines the most effective way to create conceptual schema and relational schema.  This method is also named after its creator, or at least one of them, and is called Nijssen's Information Analysis Method (NIAM).

NIAM was, in fact jointly developed by two Professors, G M Nijsssen and E D Falkenberg.  Their process for the analysis and representation of information is frequently represented in a number of ways.  However, for practical purposes it can best be summarised in six steps: -

1. State the elementary facts.

2. Sketch an initial draft of the conceptual schema diagram.

3. Eliminate surplus entities and identify derived facts.

4. Add uniqueness constraints.

5. Add mandatory roles and cardinality constraints.

6. Check the conceptual schema by populating it with examples.

Let us look at each of these six stages in turn.

State the elementary facts

Elementary facts are based upon a given specific body of information on a related subject.  This body of information is referred to as the universe of discourse often abbreviated to UoD.  An elementary fact is a simple, unambiguous sentence that makes a statement about the universe of discourse. For example: -

Eric Clapton plays guitar.

This fact is said to be a binary fact because it associates two entities.  An entity is a thing in the information system.  In other words it could have been written: -

The musician with the name Eric Clapton plays the instrument with the name guitar.

In the sentence above the words "musician" and "instrument" define the types of entity are  classified as entity types.  An entity is frequently identified with a label.  A label is a way of identifying an entity.  Common label types include numbers, ID codes and names.  A label instance is a single occurrence of such a label.  In the example above, the word "name", which appears twice, is a label type, whilst the words "Eric Clapton" and "guitar" are label instances.  The final significant component of the elementary fact above is called the predicate or role. A predicate or role (the two terms are interchangeable) defines the action performed by an entity.  In the example above the predicate is the word "plays".

An elementary fact might contain three entities in which case it would be called a ternary fact.  Look at the example below: -

The musician with the name Sting was born in the birth country with the name England in the birth year with the number 1961.

Without any loss of meaning this particular ternary fact can be divided into two binary facts.  This process is called reduction and the ternary fact is said to have been reduced into two binary facts.  Thus: -

The musician with the name Sting was born in the birth country with the name England.

The musician with the name Sting was born in the birth year with the number 1961.

An eliminatory fact containing four entities is called a quaternary fact. An elementary fact containing more than five entities is called an n-ary fact for whatever number of entities it may in theory contain.

Some ternary facts cannot be reduced because to do so would destroy the meaning of the fact.  Look at the table below and in particular the highlighted row: -

Country City Population
Australia Newcastle 428 000
England Newcastle 284 000
USA Newcastle 38 000
Australia Perth 1 240 000
Scotland Perth 42 000

A country with the name Australia contains a city with the name Perth. 

A city with the name Perth has a population with the number 1 240 000. 

The creation of two binary facts by reduction of the ternary fact has led to inaccuracy.  Only specific cities called Perth have a population of 1 450 000 in our example, that in Australia.  In this case the entities cannot be split at all.  To accommodate this the elementary facts are combined to create a nested binary fact.  In the case of the example above this would results in: -

A country with the name Australia contains a city with the name Perth that has a population with the number 1 450 000.

This represents a combination of:-

(A country with the name Australia contains a city with the name Perth) has a population with the number 1 450 000.

Sketch an initial draft of the conceptual schema diagram

Entities are represented by a symbol in the shape of an ellipse.  Predicates are represented by symbols in the shape of a rectangular box.  This predicate box is divided in half with a vertical line and each half represents the role played by the nearest entity.  Usually only one half of the box is filled in as the role performed by the adjoining half can usually be inferred from the completed half.

Labels are represented with an elliptical symbol drawn with a broken line.

An entity linked to an entity represents a fact whilst an entity linked to a label represents a reference.  To save space it is usual to condense references so that they are contained within the entity symbol.  When this is done the label is represented in parentheses underneath the entity type.  The diagram below follows this convention and is made possible because it is clear that the label type has a 1:1 relationship with the entity.  In other words, there is only one musician with a particular name.

Look at the following table containing four columns that can be reduced to two binary facts and a corresponding single reference.

ID Musician Album Year of Release
0101 Michael Jackson Dangerous 1991
0102 Phil Collins Phil Collins Hits 1998
0103 Sheryl Crow Tuesday Night Music Club 1993
0104 Enya The Celts 1992
0105 Shania Twain Come On Over 1998

 

Notice how the ID number has been condensed into the ID entity.  This has been done because the ID number is a unique identifier whilst it is possible that the musician's name might appear more than once if they had released more than one album.

In the next example the information contained in the table can be written as a ternary fact but cannot be reduced into two binary facts without loss of accurate meaning.  In this case it is necessary to produce nested elementary facts.  The resulting combined entity is by convention represented by drawing an entity ellipse around the role box.

Country City Population
Australia Newcastle 428 000
England Newcastle 284 000
South Africa Newcastle 38 000
Australia Perth 1 240 000
Scotland Perth 42 000

The combined entity has a separate relationship with the entity population. 

Eliminate surplus entities and identify derived facts

Surplus entities

Care needs to be taken when constructing CS diagrams to keep the number of entities to a minimum.  The goal is to have as few entities as possible.  The table below contains an example of information that might lead to the construction of surplus entities.

Musician Birth Year Death Year
Jimi Hendrix 1942 1970
Elvis Presley 1935 1969
Michael Hutchins 1961 1999
Buddy Holly 1938 1959

It would be tempting to represent the information contained in this table like this: -

However, birth year and death year could be thought of as a single entity namely year.  This would simplify the above diagram so: -

Derived facts

Derived facts are those that can be calculated using existing information.  For example, in the table below the musician's age at death can be calculated by subtracting the birth year from the death year.

Musician Birth Year Death Year Died Age
Jimi Hendrix 1942 1970 28
Elvis Presley 1935 1977 42
Bob Marley 1945 1981 36
Buddy Holly 1938 1959 21

A derived fact is identified on a CS diagram by the inclusion of an asterisk beside the predicate box placed nearest the derived entity.  

Add uniqueness constraints

Uniqueness constraints are import in CS diagrams because they are used to avoid data redundancy.  Efficient database design involves the removal of unnecessarily duplicated data.  A uniqueness constraint is represented by a bar usually (but not exclusively) placed above the role box.  The bar represents "one" whilst absence of a bar indicates "many".  

In the diagram below we have a one:many relationship in which a single musician might belong to a number of groups.  For example Paul McCartney belonged to both The Beatles and Wings.

In the next diagram we have a many:one relationship in which a number of musicians only belong to a single group.  If there is more than one group in the table the musician's name may appear several times but each group's name only once.

A one:one relationship has bars above each half of the role box that are separated in the middle.  So in the example below a musician's name and a group's name would only appear once in each column.

In a many:many relationship there are no uniqueness constraints and therefore no bars.  This would be the case if, for example, a number of musicians belong to a number of groups.  The musician's names and the group's names appear more than once in each column of a table.  If uniqueness is achieved though a combination, in other words two entities can be combined to form a single entity, then uniqueness is demonstrated by drawing the bar across the role box without a break in the centre (not illustrated). 

CS diagrams appear more complex when uniqueness constraints are applied to nested elementary facts.  Notice the uniqueness constraint is placed inside the entity ellipse.   The uniqueness constraints in the diagram below indicate that there is only one city of a given name associated with a particular country.  Further the city only has a single population.  

Country City Population
Australia Newcastle 428 000
England Newcastle 284 000
USA Newcastle 38 000
Australia Perth 1 240 000
Scotland Perth 42 000

Although in reality we know this to be inaccurate (there is more than one Newcastle in England, Australia and America) we can take it be be true if told to assume that the example contains significant data.   The term significant indicates there is enough information to be able to make a decision about uniqueness and data type.

Add mandatory roles and cardinality constraints

Mandatory roles

A mandatory role is a relationship between a role and an entity that must be executed.  These are represented on CS diagrams by a "man-dot".  This is a small black circle placed on the entity symbol nearest the role that must be performed.  Non mandatory roles are called optional and are represented by the absence of a man-dot.  Optional roles tend to be obvious as in tabular information they cause gaps.  This might be because the data is unavailable perhaps because it is not required, has been lost, or has yet to be generated. It is possible for an entity to have several roles, some of which are mandatory and others which are not.

Year Winner
1995 Germany
1996 Switzerland
1997  
1998 Japan
1998 Japan

Cardinality constraints

Cardinality constraints identify permitted values or limits to specific labels.  If the cardinality (countable number) is quite small then the legal labels are normally listed in braces (curly brackets).

Check the conceptual schema by populating it with examples

This final step in the process involves returning to the elementary facts which were stated and testing to see if each fits into the CS design that has been produced.  Each fact must appear on the CS diagram but importantly, only once.  Also, extra facts that were not identified as elementary facts, should not be shown on a CS diagram.

BackBack

03 September, 2000