7.3 Introduction to Data Models, Normalization, and SQL

In database management, data models define how data is structured and managed, while normalization ensures data integrity and efficiency. SQL is the language used to interact with relational databases, enabling the creation, modification, and querying of data.

1. Data Abstraction and Data Independence

Data Abstraction: The process of hiding the complex implementation details and showing only the essential features of data. There are three levels of abstraction:
- Physical Level: Describes how data is stored on disk.
- Logical Level: Describes what data is stored and the relationships among the data.
- View Level: The way the data is presented to users.
Data Independence: The ability to change the schema at one level without affecting the schema at the next higher level.
- Physical Data Independence: The ability to change the physical storage without affecting the logical structure.
- Logical Data Independence: The ability to change the logical schema without changing the external schema or application programs.

2. Schema and Instances

Schema: The structure of the database, described by its tables, fields, and relationships. It defines the database's organization and constraints.
- Physical Schema: Describes the physical storage of the data.
- Logical Schema: Describes the logical structure, like tables, views, indexes, etc.
Instance: A snapshot of the data at a particular point in time. The instance represents the actual data stored in the database.

3. Entity-Relationship (E-R) Model

The E-R Model is a high-level conceptual data model that represents the data and its relationships in the form of entities and their connections (relationships).

Key Components of E-R Model:

Entity: An object or thing in the real world that is distinguishable from other objects. E.g., Student, Course.
Attributes: Properties or characteristics of an entity. E.g., Name, Age, ID.
Entity Sets: A collection of similar types of entities. E.g., all students in a school.
Keys: Attributes that uniquely identify an entity within an entity set. E.g., StudentID.

Types of Entity Sets:

Strong Entity Set: An entity set that can exist independently. It has a primary key.
Weak Entity Set: An entity set that cannot exist without a strong entity set. It doesn't have a primary key but can be identified by a combination of attributes from the strong entity and its own attributes.

E-R Diagram: A visual representation of entities, their attributes, and relationships in a database system.

4. Normalization and Normal Forms

Normalization is the process of organizing data to minimize redundancy and dependency, and to ensure data integrity.

Initial Unnormalized Table (UNF)

Consider a table that stores student, class, and teacher information:

StudentID

StudentName

ClassID

ClassName

TeacherID

TeacherName

Subjects

John Doe

101

Science

Mr. Smith

Physics, Chemistry

Jane Doe

102

Mathematics

Ms. Johnson

Algebra, Geometry

First Normal Form (1NF):

Rules:

Each column contains atomic (indivisible) values.
No repeating groups or arrays within columns.

Conversion: Split the multi-valued field Subjects into separate rows.

StudentID

StudentName

ClassID

ClassName

TeacherID

TeacherName

Subject

John Doe

101

Science

Mr. Smith

Physics

John Doe

101

Science

Mr. Smith

Chemistry

Jane Doe

102

Mathematics

Ms. Johnson

Algebra

Jane Doe

102

Mathematics

Ms. Johnson

Geometry

Second Normal Form (2NF):

Rules:

Satisfies 1NF.
All non-key attributes must be fully functionally dependent on the primary key.
Remove partial dependencies (where non-key attributes depend only on part of a composite key).

Analysis:

In this table, the composite primary key is (StudentID, Subject).
Attributes like ClassName, TeacherID, and TeacherName depend only on ClassID, not the entire key.

Conversion: Split the table into two:

Table 1: Student_Subject

StudentID

Subject

ClassID

Physics

101

Chemistry

101

Algebra

102

Geometry

102

Table 2: Class_Teacher

ClassID

ClassName

TeacherID

TeacherName

101

Science

Mr. Smith

102

Mathematics

Ms. Johnson

Third Normal Form (3NF):

Rules:

Satisfies 2NF.
Remove transitive dependencies (where non-key attributes depend on other non-key attributes).

Analysis:

In Class_Teacher, TeacherName depends on TeacherID, not directly on ClassID.

Conversion: Split Class_Teacher into two tables:

Table 2a: Class_Info

ClassID

ClassName

TeacherID

101

Science

102

Mathematics

Table 2b: Teacher_Info

TeacherID

TeacherName

Mr. Smith

Ms. Johnson

Boyce-Codd Normal Form (BCNF):

Rules:

Satisfies 3NF.
For every functional dependency, the left-hand side must be a superkey.

Analysis:

All functional dependencies in the current tables satisfy BCNF since the left-hand side of each dependency is a candidate key.
No further decomposition is needed.

5. Functional Dependencies, Integrity Constraints, and Domain Constraints

Functional Dependency (FD): A relationship between two sets of attributes in a relation where one set (the determinant) uniquely determines the other set.
- Example: In a relation (StudentID, StudentName), StudentID → StudentName indicates that the StudentID uniquely determines the StudentName.
Integrity Constraints: Rules that ensure the accuracy and consistency of data.
- Entity Integrity: Ensures that the primary key of a table is unique and not null.
- Referential Integrity: Ensures that foreign keys match primary keys in related tables.
Domain Constraints: Specifies that the values in an attribute must come from a specific domain (set of allowed values).

6. Relations (Joined, Derived)

Relational Model: Represents data as a set of relations (tables), where each relation consists of tuples (rows) and attributes (columns).

Types of Relations:

Joined Relations: Combining data from two or more tables based on a common attribute (e.g., using SQL JOIN).
Derived Relations: Relations that are not stored in the database but can be derived from other relations via queries or views.

7. SQL: Data Definition Language (DDL) and Data Manipulation Language (DML)

DDL (Data Definition Language): Used to define the structure of the database, including tables, views, and schema.
- Commands: CREATE, ALTER, DROP, TRUNCATE.
DML (Data Manipulation Language): Used to manage and manipulate data within the tables.
- Commands: SELECT, INSERT, UPDATE, DELETE.

Views: Virtual tables that are defined by SQL queries. Views do not store data but can simplify complex queries.

Assertions and Triggering:

Assertions: Used to specify conditions that must hold for all data in the database.
Triggers: Procedural code that is automatically executed in response to certain events on a table or view (e.g., AFTER INSERT, BEFORE DELETE).

8. Relational Algebra

Relational Algebra is a formal language for querying relational databases. It consists of a set of operations that take one or more relations as input and produce a new relation as output.

Common Operations in Relational Algebra:

Selection (σ): Filters rows based on a specified condition.
Projection (π): Selects columns from a relation.
Union (∪): Combines the tuples from two relations.
Intersection (∩): Returns the common tuples between two relations.
Difference (−): Returns tuples from one relation that are not in another.
Join (⨝): Combines tuples from two relations based on a matching condition.

9. Query Cost Estimation, Optimization, and Decomposition

Query Cost Estimation: The process of estimating the cost (in terms of time, resources, etc.) required to execute a query.
- This includes evaluating the cost of different query execution plans.
Query Optimization: The process of improving the efficiency of a query by selecting the best execution plan. This can be done by:
- Minimizing the number of joins.
- Using indexes.
- Rewriting queries.
Query Decomposition: The process of breaking down a complex query into smaller subqueries that are easier to execute.

Conclusion

Data models, normalization, and SQL are the core concepts of database management systems. Understanding these concepts allows for efficient data storage, retrieval, and management. Normalization ensures that the database is free of redundant data, while SQL provides the means to interact with and manipulate the data in a structured way.

Previous7.2 Sorting, Searching, Hashing and Graphs Next7.4 Transaction Processing, Concurrency Control, and Crash Recovery

Last updated 4 months ago

7.3 Introduction to Data Models, Normalization, and SQL

1. Data Abstraction and Data Independence

Data Abstraction: The process of hiding the complex implementation details and showing only the essential features of data. There are three levels of abstraction:
- Physical Level: Describes how data is stored on disk.
- Logical Level: Describes what data is stored and the relationships among the data.
- View Level: The way the data is presented to users.
Data Independence: The ability to change the schema at one level without affecting the schema at the next higher level.
- Physical Data Independence: The ability to change the physical storage without affecting the logical structure.
- Logical Data Independence: The ability to change the logical schema without changing the external schema or application programs.

2. Schema and Instances

Schema: The structure of the database, described by its tables, fields, and relationships. It defines the database's organization and constraints.
- Physical Schema: Describes the physical storage of the data.
- Logical Schema: Describes the logical structure, like tables, views, indexes, etc.
Instance: A snapshot of the data at a particular point in time. The instance represents the actual data stored in the database.

3. Entity-Relationship (E-R) Model

The E-R Model is a high-level conceptual data model that represents the data and its relationships in the form of entities and their connections (relationships).

Key Components of E-R Model:

Entity: An object or thing in the real world that is distinguishable from other objects. E.g., Student, Course.
Attributes: Properties or characteristics of an entity. E.g., Name, Age, ID.
Entity Sets: A collection of similar types of entities. E.g., all students in a school.
Keys: Attributes that uniquely identify an entity within an entity set. E.g., StudentID.

Types of Entity Sets:

Strong Entity Set: An entity set that can exist independently. It has a primary key.
Weak Entity Set: An entity set that cannot exist without a strong entity set. It doesn't have a primary key but can be identified by a combination of attributes from the strong entity and its own attributes.

E-R Diagram: A visual representation of entities, their attributes, and relationships in a database system.

4. Normalization and Normal Forms

Normalization is the process of organizing data to minimize redundancy and dependency, and to ensure data integrity.

Initial Unnormalized Table (UNF)

Consider a table that stores student, class, and teacher information:

StudentID

StudentName

ClassID

ClassName

TeacherID

TeacherName

Subjects

John Doe

101

Science

Mr. Smith

Physics, Chemistry

Jane Doe

102

Mathematics

Ms. Johnson

Algebra, Geometry

First Normal Form (1NF):

Rules:

Each column contains atomic (indivisible) values.
No repeating groups or arrays within columns.

Conversion: Split the multi-valued field Subjects into separate rows.

StudentID

StudentName

ClassID

ClassName

TeacherID

TeacherName

Subject

John Doe

101

Science

Mr. Smith

Physics

John Doe

101

Science

Mr. Smith

Chemistry

Jane Doe

102

Mathematics

Ms. Johnson

Algebra

Jane Doe

102

Mathematics

Ms. Johnson

Geometry

Second Normal Form (2NF):

Rules:

Satisfies 1NF.
All non-key attributes must be fully functionally dependent on the primary key.
Remove partial dependencies (where non-key attributes depend only on part of a composite key).

Analysis:

In this table, the composite primary key is (StudentID, Subject).
Attributes like ClassName, TeacherID, and TeacherName depend only on ClassID, not the entire key.

Conversion: Split the table into two:

Table 1: Student_Subject

StudentID

Subject

ClassID

Physics

101

Chemistry

101

Algebra

102

Geometry

102

Table 2: Class_Teacher

ClassID

ClassName

TeacherID

TeacherName

101

Science

Mr. Smith

102

Mathematics

Ms. Johnson

Third Normal Form (3NF):

Rules:

Satisfies 2NF.
Remove transitive dependencies (where non-key attributes depend on other non-key attributes).

Analysis:

In Class_Teacher, TeacherName depends on TeacherID, not directly on ClassID.

Conversion: Split Class_Teacher into two tables:

Table 2a: Class_Info

ClassID

ClassName

TeacherID

101

Science

102

Mathematics

Table 2b: Teacher_Info

TeacherID

TeacherName

Mr. Smith

Ms. Johnson

Boyce-Codd Normal Form (BCNF):

Rules:

Satisfies 3NF.
For every functional dependency, the left-hand side must be a superkey.

Analysis:

All functional dependencies in the current tables satisfy BCNF since the left-hand side of each dependency is a candidate key.
No further decomposition is needed.

5. Functional Dependencies, Integrity Constraints, and Domain Constraints

Functional Dependency (FD): A relationship between two sets of attributes in a relation where one set (the determinant) uniquely determines the other set.
- Example: In a relation (StudentID, StudentName), StudentID → StudentName indicates that the StudentID uniquely determines the StudentName.
Integrity Constraints: Rules that ensure the accuracy and consistency of data.
- Entity Integrity: Ensures that the primary key of a table is unique and not null.
- Referential Integrity: Ensures that foreign keys match primary keys in related tables.
Domain Constraints: Specifies that the values in an attribute must come from a specific domain (set of allowed values).

6. Relations (Joined, Derived)

Relational Model: Represents data as a set of relations (tables), where each relation consists of tuples (rows) and attributes (columns).

Types of Relations:

Joined Relations: Combining data from two or more tables based on a common attribute (e.g., using SQL JOIN).
Derived Relations: Relations that are not stored in the database but can be derived from other relations via queries or views.

7. SQL: Data Definition Language (DDL) and Data Manipulation Language (DML)

DDL (Data Definition Language): Used to define the structure of the database, including tables, views, and schema.
- Commands: CREATE, ALTER, DROP, TRUNCATE.
DML (Data Manipulation Language): Used to manage and manipulate data within the tables.
- Commands: SELECT, INSERT, UPDATE, DELETE.

Views: Virtual tables that are defined by SQL queries. Views do not store data but can simplify complex queries.

Assertions and Triggering:

Assertions: Used to specify conditions that must hold for all data in the database.
Triggers: Procedural code that is automatically executed in response to certain events on a table or view (e.g., AFTER INSERT, BEFORE DELETE).

8. Relational Algebra

Relational Algebra is a formal language for querying relational databases. It consists of a set of operations that take one or more relations as input and produce a new relation as output.

Common Operations in Relational Algebra:

Selection (σ): Filters rows based on a specified condition.
Projection (π): Selects columns from a relation.
Union (∪): Combines the tuples from two relations.
Intersection (∩): Returns the common tuples between two relations.
Difference (−): Returns tuples from one relation that are not in another.
Join (⨝): Combines tuples from two relations based on a matching condition.

9. Query Cost Estimation, Optimization, and Decomposition

Query Cost Estimation: The process of estimating the cost (in terms of time, resources, etc.) required to execute a query.
- This includes evaluating the cost of different query execution plans.
Query Optimization: The process of improving the efficiency of a query by selecting the best execution plan. This can be done by:
- Minimizing the number of joins.
- Using indexes.
- Rewriting queries.
Query Decomposition: The process of breaking down a complex query into smaller subqueries that are easier to execute.

Conclusion

Previous7.2 Sorting, Searching, Hashing and Graphs Next7.4 Transaction Processing, Concurrency Control, and Crash Recovery

Last updated 4 months ago