7.3 Introduction to Data Models, Normalization, and SQL
In database management, data models define how data is structured and managed, while normalization ensures data integrity and efficiency. SQL is the language used to interact with relational databases, enabling the creation, modification, and querying of data.
1. Data Abstraction and Data Independence
Data Abstraction: The process of hiding the complex implementation details and showing only the essential features of data. There are three levels of abstraction:
Physical Level: Describes how data is stored on disk.
Logical Level: Describes what data is stored and the relationships among the data.
View Level: The way the data is presented to users.
Data Independence: The ability to change the schema at one level without affecting the schema at the next higher level.
Physical Data Independence: The ability to change the physical storage without affecting the logical structure.
Logical Data Independence: The ability to change the logical schema without changing the external schema or application programs.
2. Schema and Instances
Schema: The structure of the database, described by its tables, fields, and relationships. It defines the database's organization and constraints.
Physical Schema: Describes the physical storage of the data.
Logical Schema: Describes the logical structure, like tables, views, indexes, etc.
Instance: A snapshot of the data at a particular point in time. The instance represents the actual data stored in the database.
3. Entity-Relationship (E-R) Model
The E-R Model is a high-level conceptual data model that represents the data and its relationships in the form of entities and their connections (relationships).
Key Components of E-R Model:
Entity: An object or thing in the real world that is distinguishable from other objects. E.g., Student, Course.
Attributes: Properties or characteristics of an entity. E.g., Name, Age, ID.
Entity Sets: A collection of similar types of entities. E.g., all students in a school.
Keys: Attributes that uniquely identify an entity within an entity set. E.g., StudentID.
Types of Entity Sets:
Strong Entity Set: An entity set that can exist independently. It has a primary key.
Weak Entity Set: An entity set that cannot exist without a strong entity set. It doesn't have a primary key but can be identified by a combination of attributes from the strong entity and its own attributes.
E-R Diagram: A visual representation of entities, their attributes, and relationships in a database system.
4. Normalization and Normal Forms
Normalization is the process of organizing data to minimize redundancy and dependency, and to ensure data integrity.
Initial Unnormalized Table (UNF)
Consider a table that stores student, class, and teacher information:
StudentID
StudentName
ClassID
ClassName
TeacherID
TeacherName
Subjects
1
John Doe
101
Science
T1
Mr. Smith
Physics, Chemistry
2
Jane Doe
102
Mathematics
T2
Ms. Johnson
Algebra, Geometry
First Normal Form (1NF):
Rules:
Each column contains atomic (indivisible) values.
No repeating groups or arrays within columns.
Conversion: Split the multi-valued field Subjects
into separate rows.
StudentID
StudentName
ClassID
ClassName
TeacherID
TeacherName
Subject
1
John Doe
101
Science
T1
Mr. Smith
Physics
1
John Doe
101
Science
T1
Mr. Smith
Chemistry
2
Jane Doe
102
Mathematics
T2
Ms. Johnson
Algebra
2
Jane Doe
102
Mathematics
T2
Ms. Johnson
Geometry
Second Normal Form (2NF):
Rules:
Satisfies 1NF.
All non-key attributes must be fully functionally dependent on the primary key.
Remove partial dependencies (where non-key attributes depend only on part of a composite key).
Analysis:
In this table, the composite primary key is
(StudentID, Subject)
.Attributes like
ClassName
,TeacherID
, andTeacherName
depend only onClassID
, not the entire key.
Conversion: Split the table into two:
Table 1: Student_Subject
StudentID
Subject
ClassID
1
Physics
101
1
Chemistry
101
2
Algebra
102
2
Geometry
102
Table 2: Class_Teacher
ClassID
ClassName
TeacherID
TeacherName
101
Science
T1
Mr. Smith
102
Mathematics
T2
Ms. Johnson
Third Normal Form (3NF):
Rules:
Satisfies 2NF.
Remove transitive dependencies (where non-key attributes depend on other non-key attributes).
Analysis:
In
Class_Teacher
,TeacherName
depends onTeacherID
, not directly onClassID
.
Conversion: Split Class_Teacher
into two tables:
Table 2a: Class_Info
ClassID
ClassName
TeacherID
101
Science
T1
102
Mathematics
T2
Table 2b: Teacher_Info
TeacherID
TeacherName
T1
Mr. Smith
T2
Ms. Johnson
Boyce-Codd Normal Form (BCNF):
Rules:
Satisfies 3NF.
For every functional dependency, the left-hand side must be a superkey.
Analysis:
All functional dependencies in the current tables satisfy BCNF since the left-hand side of each dependency is a candidate key.
No further decomposition is needed.
5. Functional Dependencies, Integrity Constraints, and Domain Constraints
Functional Dependency (FD): A relationship between two sets of attributes in a relation where one set (the determinant) uniquely determines the other set.
Example: In a relation (StudentID, StudentName), StudentID → StudentName indicates that the StudentID uniquely determines the StudentName.
Integrity Constraints: Rules that ensure the accuracy and consistency of data.
Entity Integrity: Ensures that the primary key of a table is unique and not null.
Referential Integrity: Ensures that foreign keys match primary keys in related tables.
Domain Constraints: Specifies that the values in an attribute must come from a specific domain (set of allowed values).
6. Relations (Joined, Derived)
Relational Model: Represents data as a set of relations (tables), where each relation consists of tuples (rows) and attributes (columns).
Types of Relations:
Joined Relations: Combining data from two or more tables based on a common attribute (e.g., using SQL
JOIN
).Derived Relations: Relations that are not stored in the database but can be derived from other relations via queries or views.
7. SQL: Data Definition Language (DDL) and Data Manipulation Language (DML)
DDL (Data Definition Language): Used to define the structure of the database, including tables, views, and schema.
Commands:
CREATE
,ALTER
,DROP
,TRUNCATE
.
DML (Data Manipulation Language): Used to manage and manipulate data within the tables.
Commands:
SELECT
,INSERT
,UPDATE
,DELETE
.
Views: Virtual tables that are defined by SQL queries. Views do not store data but can simplify complex queries.
Assertions and Triggering:
Assertions: Used to specify conditions that must hold for all data in the database.
Triggers: Procedural code that is automatically executed in response to certain events on a table or view (e.g.,
AFTER INSERT
,BEFORE DELETE
).
8. Relational Algebra
Relational Algebra is a formal language for querying relational databases. It consists of a set of operations that take one or more relations as input and produce a new relation as output.
Common Operations in Relational Algebra:
Selection (σ): Filters rows based on a specified condition.
Projection (π): Selects columns from a relation.
Union (∪): Combines the tuples from two relations.
Intersection (∩): Returns the common tuples between two relations.
Difference (−): Returns tuples from one relation that are not in another.
Join (⨝): Combines tuples from two relations based on a matching condition.
9. Query Cost Estimation, Optimization, and Decomposition
Query Cost Estimation: The process of estimating the cost (in terms of time, resources, etc.) required to execute a query.
This includes evaluating the cost of different query execution plans.
Query Optimization: The process of improving the efficiency of a query by selecting the best execution plan. This can be done by:
Minimizing the number of joins.
Using indexes.
Rewriting queries.
Query Decomposition: The process of breaking down a complex query into smaller subqueries that are easier to execute.
Conclusion
Data models, normalization, and SQL are the core concepts of database management systems. Understanding these concepts allows for efficient data storage, retrieval, and management. Normalization ensures that the database is free of redundant data, while SQL provides the means to interact with and manipulate the data in a structured way.
Last updated