A database is a collection of data that is stored, managed, and accessed electronically. It typically contains one or more tables, which resemble Excel or Google sheets, with data organized into rows and columns. As the number of tables grows, managing the database can become complex, making it essential to design a scalable and maintainable strucutre.
Databases can be broadly categorized into two types: Relational database, primarily used for real-time data updates and transactions, and data warehouse, optimized for data analysis and reporting purposes.
Data modeling is the process of creating a conceptual representation of data structures, relationships, and rule to meet business requirements. It serves as a blueprint for designing and managing a database.
There are three types of data models: conceptual, logical, and physical. The conceptual data model provides a high-level overview, specifying relationships between objects without technical details. The logical data model refine the conceptual data model by adding more detail, such as fields for each object. The physical data model describe the internal schema of the database, including table structures, storage details, and implementation specifics.
---
title: Conceptual Data Model
---
erDiagram
Customer ||--o{ Order : places
Order ||--|{ Product : contains
---
title: Logical Data Model
---
erDiagram
Customer {
int customerId
string name
string email
}
Order {
int orderId
date orderDate
float totalAmount
}
Product {
int productId
string productName
float price
}
Customer ||--o{ Order : places
Order ||--|{ Product : contains
---
title: Physical Data Model
---
erDiagram
Customer {
int customerId PK
varchar name
varchar email
}
Order {
int orderId PK
date orderDate
float totalAmount
int customerId FK
}
Product {
int productId PK
varchar productName
decimal price
}
OrderItem {
int orderItemId PK
int orderId FK
int productId FK
int quantity
}
Customer ||--o{ Order : places
Order ||--|{ OrderItem : includes
OrderItem ||--|{ Product : contains
In database design, an entity is a distinct object that typically represents real-world objects, events, or concepts. Entities are represented as tables in a relational database, and they are similar to classes in object-oriented design. Each row in the table represents an instance of the entity. Each entity has attributes that describe its characteristics. For example, a Student entity may have attributes like student_id, first_name, last_name, and date_of_birth, among others. Entities can be related to one another, and these relationships are often captured using foreign keys or relationship tables, which link entities together based on logical associations. For example, a Student might be linked to a Course through an enrollment relationship.
Identifying entities in database design is a crucial step, as entities form the foundation of a database. A useful approach for identifying entities is to clarify the design objectives and look for nouns in the objective statement. For example, consider the University Enrollment Management System. The objective statement might be:
“The University Enrollment Management System” is designed to help a university track enrolled students, the subjects they are taking, and the teachers assigned to those subjects".
In this sentence, the nouns university, student, subject, and teacher are potential entities, as they represent key objects or concepts that the system needs to track.
---
title: Identifying Entities
---
erDiagram
UNIVERSITY
STUDENT
In database design, an attribute describes the characteristics of an entity. In a relational database, attributes are typically represented as columns in a table. Defining entity attributes is a crucial step in database design, as it ensures that the entity captures all the necessary details required by the system. When defining attributes, it’s important to consider what properties or characteristics the entity possesses.
For example, in a University Enrollment Management System, the Student entity might have attributes like first_name, last_name, date_of_birth, and address. After identifying the necessary attributes, the next step is to specify the approapriate data type for each attribute. For instance, attributes like first_name, last_name, and address would typically use the TEXT data type, while the date_of_birth attribute would be defined with the DATE data type.
---
title: Defining Attributes
---
erDiagram
UNIVERSITY {
string name
string address
}
STUDENT {
string first_name
string last_name
date date_of_birth
string address
}
Normalization in database design is the process of organizing the attributes of entities and their relationships to reduce data redundency, avoid operational anomalies, and improve data integrity. Normalization involves decomposing a large, complex database into smaller, simpler tables while preserving the relationships between entities. This decomposition follows a set of rules called normal forms. In real-world database design, the three most commonly used normal normal forms are First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF).
First Normal Form (1NF) is a property of a relational database table that ensures the table is structured properly. It follows the following conditions:
Each column must contain atomic values, meaning that the value in a field should be indivisible and not contain sets or lists of values.
Each row must be unique, which can be achieved by using a primary key to uniquely identify each record.
All values in single column should be of the same data type.
A Primary Key is a column (or a set of columns) in a relational database table that uniquely identifies each row in that table. It satisfies the following conditions:
The value in the primary key must be unique for each row.
The value in the primary key must not be null for any record.
The value in the primary key should generally be immutable, meaning it should not change once assigned.
When designing tables, assess if any combination of columns can be used to uniquely identify a record. If such a combination exists, those columns can form the primary key. If no suitable combination is found, create a new column to serve as the primary key.
A primary key can be a set of columns when a single column is not sufficient to uniquely identify a row. For example, in the following table, neither the StudentID nor the CourseID alone can uniquely identify a row, because a student might enroll mutiple courses, and a course can have multiple students. However, the combination of both columns can uniquely identify each enrollment record.
In the following Student table, there is no single field that can uniquely identify a row. Therefore, it is necessary to add a new field. such as StudentID, to ensure each row has a unique identifier. Additionally, the Address field may contain multiple values, such as unit number, street number, street name, suburb, city, state, country, and postcode. To comply with the First Normal Form (1NF), the Address firld must be divided into multiple separate columns, ensuring that each column contains only atomic values.
---
title: Violation of 1NF - Student
---
erDiagram
STUDENT {
string first_name
string last_name
date date_of_birth
string address
}
---
title: Comply with 1NF - Student
---
erDiagram
STUDENT {
number student_id PK
string first_name
string last_name
date date_of_birth
number unit_number
number street_number
string street_name
string suburb
string city
string state
string country
string postcode
}
---
title: Comply with 1NF - Teacher
---
erDiagram
TEACHER {
number teacher_id PK
string first_name
string last_name
date date_of_birth
number unit_number
number street_number
string street_name
string suburb
string city
string state
string country
string postcode
string subject_taught
}
---
title: Violation of 1NF - University
---
erDiagram
UNIVERSITY {
string name
string address
}
---
title: Comply with 1NF - University
---
erDiagram
UNIVERSITY {
number university_id PK
string name
number unit_number
number street_number
string street_name
string suburb
string city
string state
string country
string postcode
}
The Second Normal Form (2NF) in database design is a level of normalization that builds upon the First Normal Form (1NF). A table is in 2NF if every non-key attribute is fully functionally dependent on the entire primary key, not just part of it, particularly in tables with composite primary keys (a key made up of two or more columns).
Relationship in database design refers to the association between different entities (tables) in a database. It defines how data in one table is related to data in another table, enabling efficient organization and retrival of data. Relationships are established through a foreign key, which is a column in one table that references the primary key of another table. There are four main types of relationships: One-to-One (1:1), One-to-Many (1:N), Many-to-Many (N:M), and Self-Reference.
In a One to One relationship, each record in one table is related to exactly one record in another table. For example, consider a Person table and a Passport table, where each person has one unique passport, and each passport is associated with only one person. This type of relationship is relatively rare in real-world database design.
The One to Many relationship is the most common type in real-world database design. In a 1:N relationship, one record in a table can be linked to multiple records in another table. For example, in a Department table and an Employee table, one department can have many employees, but each employee blongs to exactly one department. This is a typical relational model for most business applications.
The Many-to-Many relationship is common but often requires an intermediary table to properly represent the association. In an N:M relationship, multiple records in one table can be asoociated with multiple records in another table. For instance, in a Student table and a Course table, a student can enroll in many courses, and each course can have many students enrolled. This relationship is typically handled by a joining table, such as Student_Course.
In a Self-Referencing relationship, a table is related to itself. In other words, a record in the table is associated with another record in the same table. For example, in an Employee table, each employee has a Manager, and that manager may also have a manager. This type of relationship is useful for representing hierarchical structures, such as organizational charts.
In database design, a foreign key is a column that references the primary key of another table. It is used to establish a relationship between two tables, ensuring referential integrity and maintaining consistent data across the database.
In the following Subject table, the subject_name is dependent on the primary key subject_id, but the category is not fully dependent on the primary key, which violates 2NF. To resolve this issue, a new Category table should be created with two fields: category_id as the primary key and subject_id as a foreign key. The two tables have a One-to-Many relationship, where each category can have many subjects, while each subject bleongs to exactly one category.
---
title: Violation of 2NF - Subject
---
erDiagram
SUBJECT {
number subject_id PK
string subject_name
string category
}
---
title: Comply with 2NF - Subject
---
erDiagram
SUBJECT {
number subject_id PK
string subject_name
number category_id FK
}
CATEGORY {
number category_id PK
string category_name
}
CATEGORY ||--|{ SUBJECT : ""
In the following Subject table, the subject_id and students_enrolled create a Many-to-Many relationship. However, relational databases should not represent Many-to-Many relationships directly in s single table. This is because a single table cannot handle multiple instances of related data (such as multiple students enrolled in multiple subjects) within a single field. To resolve this issue, a junction table is introduced. It breaks down the Many-to-Many relationship into two One-to-Many relationships, ensuring data integrity and normalization.
---
title: Violation of 2NF - Subject
---
erDiagram
SUBJECT {
number subject_id PK
string subject_name
string students_enrolled
}
subject_id
subject_name
students_name
1
Math
Alice, Bob, Charlie
2
History
David, Emily, Frank, Grace
3
Science
Helen, Ian
In the above table, the students_enrolled field violates First Normal Form (1NF), which requires each column to contain only atomic values. Additionally, this design leads to data redundancy. For example, if the student Alice enrolls in multiple courses such as Math, History, and Science. her name will be repeated in each row, causing unnecessary duplication of data.
Furthermore, this design makes data manipulation more complex. For instance, when a student drops a course or a new student enrolls in a course, the list of students in the students_enrolled column must be manually updated, which is error-prone and inefficient.
---
title: Comply with 2NF - Subject
---
erDiagram
SUBJECT {
number subject_id PK
string subject_name
}
STUDENT {
number student_id PK
string students_enrolled
}
ENROLLMENT {
number subject_id FK
number student_id FK
}
SUBJECT ||--|{ ENROLLMENT : "has"
STUDENT ||--|{ ENROLLMENT : "enrolled in"
---
title: Violation of 2NF - Teacher (subject_taught)
---
erDiagram
TEACHER {
number teacher_id
string first_name
string last_name
date date_of_birth
string subject_taught
}
---
title: Comply with 2NF - Teacher
---
erDiagram
TEACHER {
number teacher_id
string first_name
string last_name
date date_of_birth
}
SUBJECT {
number subject_id PK
string subject_name
number category_id FK
number teacher_id FK
}
TEACHER ||--|{ SUBJECT : "teaches"
---
title: Comply with 2NF - UNIVERSITY
---
erDiagram
UNIVERSITY {
number university_id PK
string name
}
SUBJECT {
number subject_id PK
string subject_name
number category_id FK
number teacher_id FK
number university_id FK
}
UNIVERSITY ||--|{ SUBJECT : ""
Third Normal Form (3NF) is a level of normalization that satisfies Second Normal Form (2NF) and eliminates transitive dependencies. In simpler terms, a non-key attribute in a table should not depend on another non-key attrubute. For example, if A $\rightarrow$ B and B $\rightarrow$ C, then A $\rightarrow$ C, which means C should be moved to a separate table to avoid redundency. This process helps reduce data redundency and improves the efficiency of relational databases.
In the following Student table, the non-key attributes (unit_number, street_number, street_name, suburb, city, state, country, and postcode) are not directly dependent on the primary key student_id; instead, they depend on the student’s address. This results in a transitive dependency, which violates 3NF. Furthermore, since multiple students may share the same address, these address attributes will be repeated for each student, leading to data redundency.
---
title: Violation of 3NF - Student
---
erDiagram
STUDENT {
number student_id PK
string first_name
string last_name
date date_of_birth
number unit_number
number street_number
string street_name
string suburb
string city
string state
string country
string postcode
}
---
title: Comply with 3NF - Student
---
erDiagram
STUDENT {
number student_id PK
string first_name
string last_name
date date_of_birth
number address_id FK
}
ADDRESS {
number address_id PK
number unit_number
number street_number
string street_name
string suburb
string city
string state
string country
string postcode
}
STUDENT ||--|{ ADDRESS : ""
---
title: Comply with 3NF - Teacher
---
erDiagram
TEACHER {
number teacher_id PK
string first_name
string last_name
date date_of_birth
number address_id FK
}
ADDRESS {
number address_id PK
number unit_number
number street_number
string street_name
string suburb
string city
string state
string country
string postcode
}
TEACHER ||--|{ ADDRESS : ""
---
title: Comply with 3NF - University
---
erDiagram
UNIVERSITY {
number university_id PK
string name
number address_id FK
}
ADDRESS {
number address_id PK
number unit_number
number street_number
string street_name
string suburb
string city
string state
string country
string postcode
}
UNIVERSITY ||--|{ ADDRESS : ""
Determining data types and precision in database design is crucial for ensuring optimal storage, performance, and accuracy of data. There are typically three steps to determine the apprpriate data type and precision for a field:
Identifying the nature of the field: Determine the type of data the field will store—whether it’s numeric, string, or data/time. For example, the student_id field might be of type INT, while the first_time field might of type CHAR or VARCHAR.
Choose the smallest sufficient data type: Select the samllest data type that can accommodate the required values. For example, if a field’s values range from -128 to 127, use TINYINT instead of INT to optimize performance and storage efficiency.
Set precision and scale for numeric values: For numeric fields, it’s important to define the precision (total number of digits) and scale (number of digits after the decimal point). For example, a price field should be set to DECIMAL(10,2), which allows for 10 digits in total, with 2 digits after the decimal point.
Integrity Constraints in database design are rules that ensure the accuracy, consistency, and validity of data. There are several types of integrity constraints, each serving a specific purpose:
Entity Integrity
Entity Integrity ensures that each record in a table is unique and identifiable. This is achieved by setting a column in a table as a primary key. For example, the student_id field must be unique for each record in the Student table.
Referencial Integrity ensures that relationships between tables are maintained consistently. Foreign keys enforce this constraint by ensuring that a value in one table corresbonds to a valid value in another table. For example, the student_id field in the Enrollment table serves as a foreign key referencing the primary key student_id in the Student table.
Domain Integrity ensures that the data entered in a column matches the defined data type, format, and range. This is enforced using data types. default values. and check constraints. For example, the price field should only store positive values.
User-defined integrity enforces business rules specific to the application or domain. For instance, in the Enrollment table, the combination of student_id and course_id should be unique to prevent a student from enrolling in the same course more than once.
Null integrity ensures that certain fields cannot store NULL values. This is enforced by defining the column with the NOT NULL constraint. For example, the student_id field in the Student table must not be NULL because it is used to identify each record.