Introduction to Databases

A database is a collection of data that is stored, managed, and accessed electronically. It typically contains one or more tables, which resemble Excel or Google sheets, with data organized into rows and columns. As the number of tables grows, managing the database can become complex, making it essential to design a scalable and maintainable strucutre.

Databases can be broadly categorized into two types: Relational database, primarily used for real-time data updates and transactions, and data warehouse, optimized for data analysis and reporting purposes.


Data Modeling

Data modeling is the process of creating a conceptual representation of data structures, relationships, and rule to meet business requirements. It serves as a blueprint for designing and managing a database.

There are three types of data models: conceptual, logical, and physical. The conceptual data model provides a high-level overview, specifying relationships between objects without technical details. The logical data model refine the conceptual data model by adding more detail, such as fields for each object. The physical data model describe the internal schema of the database, including table structures, storage details, and implementation specifics.

---
title: Conceptual Data Model
---
erDiagram
    Customer ||--o{ Order : places
    Order ||--|{ Product : contains
---
title: Logical Data Model
---
erDiagram
    Customer {
        int customerId
        string name
        string email
    }
    Order {
        int orderId
        date orderDate
        float totalAmount
    }
    Product {
        int productId
        string productName
        float price
    }
    Customer ||--o{ Order : places
    Order ||--|{ Product : contains
---
title: Physical Data Model
---
erDiagram
    Customer {
        int customerId PK
        varchar name
        varchar email
    }
    Order {
        int orderId PK
        date orderDate
        float totalAmount
        int customerId FK
    }
    Product {
        int productId PK
        varchar productName
        decimal price
    }
    OrderItem {
        int orderItemId PK
        int orderId FK
        int productId FK
        int quantity
    }
    Customer ||--o{ Order : places
    Order ||--|{ OrderItem : includes
    OrderItem ||--|{ Product : contains

Example: University Enrollment Management System

Entities

In database design, an entity is a distinct object that typically represents real-world objects, events, or concepts. Entities are represented as tables in a relational database, and they are similar to classes in object-oriented design. Each row in the table represents an instance of the entity. Each entity has attributes that describe its characteristics. For example, a Student entity may have attributes like student_id, first_name, last_name, and date_of_birth, among others. Entities can be related to one another, and these relationships are often captured using foreign keys or relationship tables, which link entities together based on logical associations. For example, a Student might be linked to a Course through an enrollment relationship.


Identifying Entities

Identifying entities in database design is a crucial step, as entities form the foundation of a database. A useful approach for identifying entities is to clarify the design objectives and look for nouns in the objective statement. For example, consider the University Enrollment Management System. The objective statement might be:

“The University Enrollment Management System” is designed to help a university track enrolled students, the subjects they are taking, and the teachers assigned to those subjects".

In this sentence, the nouns university, student, subject, and teacher are potential entities, as they represent key objects or concepts that the system needs to track.

---
title: Identifying Entities
---
erDiagram
    UNIVERSITY 
    STUDENT
erDiagram
    TEACHER
    SUBJECT

Defining Attributes

In database design, an attribute describes the characteristics of an entity. In a relational database, attributes are typically represented as columns in a table. Defining entity attributes is a crucial step in database design, as it ensures that the entity captures all the necessary details required by the system. When defining attributes, it’s important to consider what properties or characteristics the entity possesses.

For example, in a University Enrollment Management System, the Student entity might have attributes like first_name, last_name, date_of_birth, and address. After identifying the necessary attributes, the next step is to specify the approapriate data type for each attribute. For instance, attributes like first_name, last_name, and address would typically use the TEXT data type, while the date_of_birth attribute would be defined with the DATE data type.

---
title: Defining Attributes
---
erDiagram
    UNIVERSITY {
        string name
        string address
    }
    STUDENT {
        string first_name
        string last_name
        date date_of_birth
        string address
    }
erDiagram
    SUBJECT {
        string subject_name
        string category
        string students_enrolled
    }

    TEACHER {
        string first_name
        string last_name
        date date_of_birth
        string address
        string subject_taught
    }    

Normalization

Normalization in database design is the process of organizing the attributes of entities and their relationships to reduce data redundency, avoid operational anomalies, and improve data integrity. Normalization involves decomposing a large, complex database into smaller, simpler tables while preserving the relationships between entities. This decomposition follows a set of rules called normal forms. In real-world database design, the three most commonly used normal normal forms are First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF).

First Normal Form

First Normal Form (1NF) is a property of a relational database table that ensures the table is structured properly. It follows the following conditions:

  1. Each column must contain atomic values, meaning that the value in a field should be indivisible and not contain sets or lists of values.
  2. Each row must be unique, which can be achieved by using a primary key to uniquely identify each record.
  3. All values in single column should be of the same data type.

A Primary Key is a column (or a set of columns) in a relational database table that uniquely identifies each row in that table. It satisfies the following conditions:

  1. The value in the primary key must be unique for each row.
  2. The value in the primary key must not be null for any record.
  3. The value in the primary key should generally be immutable, meaning it should not change once assigned.

When designing tables, assess if any combination of columns can be used to uniquely identify a record. If such a combination exists, those columns can form the primary key. If no suitable combination is found, create a new column to serve as the primary key.


A primary key can be a set of columns when a single column is not sufficient to uniquely identify a row. For example, in the following table, neither the StudentID nor the CourseID alone can uniquely identify a row, because a student might enroll mutiple courses, and a course can have multiple students. However, the combination of both columns can uniquely identify each enrollment record.

StudentID CourseID EnrollmentDate
1 CS101 2024-01-01
1 MATH101 2024-01-02
2 CS101 2024-01-03

1NF: Student

In the following Student table, there is no single field that can uniquely identify a row. Therefore, it is necessary to add a new field. such as StudentID, to ensure each row has a unique identifier. Additionally, the Address field may contain multiple values, such as unit number, street number, street name, suburb, city, state, country, and postcode. To comply with the First Normal Form (1NF), the Address firld must be divided into multiple separate columns, ensuring that each column contains only atomic values.

---
title: Violation of 1NF - Student
---
erDiagram
    STUDENT {
        string first_name
        string last_name
        date date_of_birth
        string address
    }    
---
title: Comply with 1NF - Student
---
erDiagram
    STUDENT {
        number student_id PK
        string first_name
        string last_name
        date date_of_birth
        number unit_number
        number street_number
        string street_name
        string suburb
        string city
        string state
        string country
        string postcode
    }    

1NF: Subject

---
title: Violation of 1NF - Subject
---
erDiagram
    SUBJECT {
        string subject_name
        string category
        string students_enrolled
    }
---
title: Comply with 1NF - Subject
---
erDiagram
    SUBJECT {
        number subject_id PK
        string subject_name
        string category
        string students_enrolled
    }

1NF: Teacher

---
title: Violation of 1NF - Teacher
---
erDiagram
    TEACHER {
        string first_name
        string last_name
        date date_of_birth
        string address
        string subject_taught
    }    
---
title: Comply with 1NF - Teacher
---
erDiagram
    TEACHER {
        number teacher_id PK
        string first_name
        string last_name
        date date_of_birth
        number unit_number
        number street_number
        string street_name
        string suburb
        string city
        string state
        string country
        string postcode
        string subject_taught
    }    

1NF: University

---
title: Violation of 1NF - University
---
erDiagram
    UNIVERSITY {
        string name
        string address
    }
---
title: Comply with 1NF - University
---
erDiagram
    UNIVERSITY {
        number university_id PK
        string name
        number unit_number
        number street_number
        string street_name
        string suburb
        string city
        string state
        string country
        string postcode
    }    

Second Normal Form

The Second Normal Form (2NF) in database design is a level of normalization that builds upon the First Normal Form (1NF). A table is in 2NF if every non-key attribute is fully functionally dependent on the entire primary key, not just part of it, particularly in tables with composite primary keys (a key made up of two or more columns).

Relationships

Relationship in database design refers to the association between different entities (tables) in a database. It defines how data in one table is related to data in another table, enabling efficient organization and retrival of data. Relationships are established through a foreign key, which is a column in one table that references the primary key of another table. There are four main types of relationships: One-to-One (1:1), One-to-Many (1:N), Many-to-Many (N:M), and Self-Reference.

In a One to One relationship, each record in one table is related to exactly one record in another table. For example, consider a Person table and a Passport table, where each person has one unique passport, and each passport is associated with only one person. This type of relationship is relatively rare in real-world database design.

The One to Many relationship is the most common type in real-world database design. In a 1:N relationship, one record in a table can be linked to multiple records in another table. For example, in a Department table and an Employee table, one department can have many employees, but each employee blongs to exactly one department. This is a typical relational model for most business applications.

The Many-to-Many relationship is common but often requires an intermediary table to properly represent the association. In an N:M relationship, multiple records in one table can be asoociated with multiple records in another table. For instance, in a Student table and a Course table, a student can enroll in many courses, and each course can have many students enrolled. This relationship is typically handled by a joining table, such as Student_Course.

In a Self-Referencing relationship, a table is related to itself. In other words, a record in the table is associated with another record in the same table. For example, in an Employee table, each employee has a Manager, and that manager may also have a manager. This type of relationship is useful for representing hierarchical structures, such as organizational charts.


In database design, a foreign key is a column that references the primary key of another table. It is used to establish a relationship between two tables, ensuring referential integrity and maintaining consistent data across the database.


2NF: Subject & Category

In the following Subject table, the subject_name is dependent on the primary key subject_id, but the category is not fully dependent on the primary key, which violates 2NF. To resolve this issue, a new Category table should be created with two fields: category_id as the primary key and subject_id as a foreign key. The two tables have a One-to-Many relationship, where each category can have many subjects, while each subject bleongs to exactly one category.

---
title: Violation of 2NF - Subject
---
erDiagram
    SUBJECT {
        number subject_id PK
        string subject_name
        string category
    }
---
title: Comply with 2NF - Subject
---
erDiagram
    SUBJECT {
        number subject_id PK
        string subject_name
        number category_id FK
    }
    CATEGORY {
        number category_id PK
        string category_name
    }

    CATEGORY ||--|{ SUBJECT : ""

2NF: Student & Subject

In the following Subject table, the subject_id and students_enrolled create a Many-to-Many relationship. However, relational databases should not represent Many-to-Many relationships directly in s single table. This is because a single table cannot handle multiple instances of related data (such as multiple students enrolled in multiple subjects) within a single field. To resolve this issue, a junction table is introduced. It breaks down the Many-to-Many relationship into two One-to-Many relationships, ensuring data integrity and normalization.

---
title: Violation of 2NF - Subject
---
erDiagram
    SUBJECT {
        number subject_id PK
        string subject_name
        string students_enrolled
    }
subject_id subject_name students_name
1 Math Alice, Bob, Charlie
2 History David, Emily, Frank, Grace
3 Science Helen, Ian

In the above table, the students_enrolled field violates First Normal Form (1NF), which requires each column to contain only atomic values. Additionally, this design leads to data redundancy. For example, if the student Alice enrolls in multiple courses such as Math, History, and Science. her name will be repeated in each row, causing unnecessary duplication of data.

Furthermore, this design makes data manipulation more complex. For instance, when a student drops a course or a new student enrolls in a course, the list of students in the students_enrolled column must be manually updated, which is error-prone and inefficient.


---
title: Comply with 2NF - Subject
---
erDiagram
    SUBJECT {
        number subject_id PK
        string subject_name
    }
    STUDENT {
        number student_id PK
        string students_enrolled
    }
    ENROLLMENT {
        number subject_id FK
        number student_id FK
    }

    SUBJECT ||--|{ ENROLLMENT : "has"
    STUDENT ||--|{ ENROLLMENT : "enrolled in"

Comply with 2NF: Real-World Example

2NF: Teacher

---
title: Violation of 2NF - Teacher (subject_taught)
---
erDiagram
    TEACHER {
        number teacher_id
        string first_name
        string last_name
        date date_of_birth
        string subject_taught
    }    
---
title: Comply with 2NF - Teacher
---
erDiagram
    TEACHER {
        number teacher_id
        string first_name
        string last_name
        date date_of_birth
    }    

    SUBJECT {
        number subject_id PK
        string subject_name
        number category_id FK        
        number teacher_id FK
    }

    TEACHER ||--|{ SUBJECT : "teaches"

2NF: University

---
title: Comply with 2NF - UNIVERSITY
---
erDiagram
    UNIVERSITY {
        number university_id PK
        string name        
    }

    SUBJECT {
        number subject_id PK
        string subject_name
        number category_id FK        
        number teacher_id FK
        number university_id FK
    }

    UNIVERSITY ||--|{ SUBJECT : ""

Third Normal From

Third Normal Form (3NF) is a level of normalization that satisfies Second Normal Form (2NF) and eliminates transitive dependencies. In simpler terms, a non-key attribute in a table should not depend on another non-key attrubute. For example, if A $\rightarrow$ B and B $\rightarrow$ C, then A $\rightarrow$ C, which means C should be moved to a separate table to avoid redundency. This process helps reduce data redundency and improves the efficiency of relational databases.

3NF: Student

In the following Student table, the non-key attributes (unit_number, street_number, street_name, suburb, city, state, country, and postcode) are not directly dependent on the primary key student_id; instead, they depend on the student’s address. This results in a transitive dependency, which violates 3NF. Furthermore, since multiple students may share the same address, these address attributes will be repeated for each student, leading to data redundency.

---
title: Violation of 3NF - Student
---
erDiagram
    STUDENT {
        number student_id PK
        string first_name
        string last_name
        date date_of_birth
        number unit_number
        number street_number
        string street_name
        string suburb
        string city
        string state
        string country
        string postcode
    }
---
title: Comply with 3NF - Student
---
erDiagram
    STUDENT {
        number student_id PK
        string first_name
        string last_name
        date date_of_birth
        number address_id FK
    }

    ADDRESS {
        number address_id PK
        number unit_number
        number street_number
        string street_name
        string suburb
        string city
        string state
        string country
        string postcode
    }

    STUDENT ||--|{ ADDRESS : ""

3NF: Teacher

---
title: Comply with 3NF - Teacher
---
erDiagram
    TEACHER {
        number teacher_id PK
        string first_name
        string last_name
        date date_of_birth
        number address_id FK
    }

    ADDRESS {
        number address_id PK
        number unit_number
        number street_number
        string street_name
        string suburb
        string city
        string state
        string country
        string postcode
    }

    TEACHER ||--|{ ADDRESS : ""

3NF: University

---
title: Comply with 3NF - University
---
erDiagram
    UNIVERSITY {
        number university_id PK
        string name
        number address_id FK
    }

    ADDRESS {
        number address_id PK
        number unit_number
        number street_number
        string street_name
        string suburb
        string city
        string state
        string country
        string postcode
    }

    UNIVERSITY ||--|{ ADDRESS : ""

Design Considerations

Determine Data Types & Precision

Determining data types and precision in database design is crucial for ensuring optimal storage, performance, and accuracy of data. There are typically three steps to determine the apprpriate data type and precision for a field:

  1. Identifying the nature of the field: Determine the type of data the field will store—whether it’s numeric, string, or data/time. For example, the student_id field might be of type INT, while the first_time field might of type CHAR or VARCHAR.
  2. Choose the smallest sufficient data type: Select the samllest data type that can accommodate the required values. For example, if a field’s values range from -128 to 127, use TINYINT instead of INT to optimize performance and storage efficiency.
  3. Set precision and scale for numeric values: For numeric fields, it’s important to define the precision (total number of digits) and scale (number of digits after the decimal point). For example, a price field should be set to DECIMAL(10,2), which allows for 10 digits in total, with 2 digits after the decimal point.

Integrity Constraints

Integrity Constraints in database design are rules that ensure the accuracy, consistency, and validity of data. There are several types of integrity constraints, each serving a specific purpose:

  1. Entity Integrity

Entity Integrity ensures that each record in a table is unique and identifiable. This is achieved by setting a column in a table as a primary key. For example, the student_id field must be unique for each record in the Student table.

CREATE TABLE Student (
    student_id INT PRIMARY KEY,
    first_name VARCHAR(100),
    last_name VARCHAR(100)
);
  1. Referencial Integrity

Referencial Integrity ensures that relationships between tables are maintained consistently. Foreign keys enforce this constraint by ensuring that a value in one table corresbonds to a valid value in another table. For example, the student_id field in the Enrollment table serves as a foreign key referencing the primary key student_id in the Student table.

CREATE TABLE Enrollment (
    enrollment_id INT PRIMARY KEY,
    student_id INT,
    course_id INT,
    FOREIGN KEY (student_id) REFERENCES Students(student_id)
);
  1. Domain Integrity

Domain Integrity ensures that the data entered in a column matches the defined data type, format, and range. This is enforced using data types. default values. and check constraints. For example, the price field should only store positive values.

CREATE TABLE Product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    price DECIMAL(10, 2) CHECK (price > 0)
);
  1. User-Defined Integrity

User-defined integrity enforces business rules specific to the application or domain. For instance, in the Enrollment table, the combination of student_id and course_id should be unique to prevent a student from enrolling in the same course more than once.

CREATE TABLE Enrollment (
    enrollment_id INT PRIMARY KEY,
    student_id INT,
    course_id INT,
    UNIQUE (student_id, course_id)
);
  1. Null Integrity

Null integrity ensures that certain fields cannot store NULL values. This is enforced by defining the column with the NOT NULL constraint. For example, the student_id field in the Student table must not be NULL because it is used to identify each record.

CREATE TABLE Student (
    student_id INT PRIMARY KEY NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL
);