Data Table Relation's Normalization

    Mar 12, 2024

    Introduces functional dependencies, and the database paradigms 1NF, 2NF, 3NF, BCNF, and 4NF. and use examples to explain their concepts and connections.

    Functional Dependencies

    In relational databases, functional dependencies describe the dependency relationships between attributes in a relational schema. If, given a relation RR, the values of attribute set XX uniquely determine (through conditions or function arguments, for example) a value in attribute set YY, then we say that YY is functionally dependent on XX, typically denoted by the symbol XYX \rightarrow Y.

    Specifically, if for every record in relation RR, whenever two records have the same values for attribute set XX, they must also have the same values for attribute set YY, then we say that YY is functionally dependent on XX.

    For example, suppose we have a relation RR containing attribute set {A,B,C}\{ A,B,C \}, where the values of attribute AA uniquely determine the values of attribute BB, i.e., ABA \rightarrow B. Then, we say that attribute BB is functionally dependent on attribute AA.

    Functional dependency is an important concept in relational database design. It helps in understanding the relationships between data, normalizing database designs to reduce redundant data, while ensuring the consistency and integrity of data.

    Functional dependencies can be classified into trivial functional dependencies and non-trivial functional dependencies.

    If XYX \rightarrow Y and YXY \subseteq X, then XYX \rightarrow Y is called a trivial functional dependency. Otherwise, it is a non-trivial functional dependency.

    For functional dependencies, the dependency relationship between XX and YY can also be further classified into partial functional dependency, full functional dependency, and transitive functional dependency.

    • Partial Functional Dependency: Let X,YX,Y be two attribute sets of relation RR. If XYX→Y and XXX' \subsetneq X, where there exists XX' such that XYX'→Y, then YY is said to be partially functionally dependent on XX, denoted as XPYX \overset{P}{\rightarrow} Y. In other words, partial functional dependency means that some attributes can be removed without affecting the dependency.

    For example: In a student basic information table RR (student ID, passport number, name), of course, the student ID attribute values are unique. In relation RR, (student ID, passport number) → (name), (student ID) → (name), (passport number) → (name); therefore, the name is partially functionally dependent on (student ID, passport number).

    • Full Functional Dependency: Let X,YX,Y be two attribute sets of relation RR. If XYX→Y, XXX' \subsetneq X, and for all XX', XYX'→Y, then YY is said to be fully functionally dependent on XX, denoted as XFYX \overset{F}{\rightarrow} Y. In other words, for full functional dependency, no extra attributes can be deleted, otherwise, the property of dependency will not be maintained.

    Example: In a student basic information table RR (student ID, class, name), assuming different classes can have the same student ID, and student IDs within a class cannot be the same. In relation RR, (student ID, class) → (name), but (student ID) → (name) does not hold, and (class) → (name) does not hold either. Therefore, the name is fully functionally dependent on (student ID, class).

    • Transitive Functional Dependency: Let X,Y,ZX,Y,Z be three attribute sets of relation RR, where XY(YX)X→Y (Y \cancel{\rightarrow} X), and for all YY, YZY→Z. Then ZZ is said to be transitively functionally dependent on XX. This means that ZZ indirectly depends on XX.

    Example: In relation RR (student ID, dormitory, fee), (student ID) → (dormitory), dormitory ≠ student ID, (dormitory) → (fee), fee ≠ dormitory. Thus, it meets the requirements of transitive functional dependency.

    Multi-Valued Dependencies

    The previous section introduced functional dependencies, which are actually a special case of multi-valued dependencies. Multi-valued dependencies extend the concept of functional dependencies.

    Let R(U)R(U) be a relational schema over the attribute set UU, and let XX, YY, and ZZ be subsets of UU such that Z=UXYZ = U - X - Y. A multi-valued dependency XYX \twoheadrightarrow Y holds if and only if for any relation rr in RR, each value on (X,Z)(X, Z) corresponds to a set of values on YY, and this set of values depends only on the values of XX and is independent of the values of ZZ.

    Consider a relational schema RR with the attribute set {Student,Course,Textbook}\{Student, Course, Textbook\}, representing students' course enrollments and the textbooks they use. If a student uses multiple textbooks for a single course, and the combination of student and course uniquely determines the textbooks used, while the combination of student and course is independent, then we have a multi-valued dependency.

    Suppose we have the following relation instance:

    StudentCourseTextbook
    AliceMathCalculus
    AliceMathLinear Algebra
    AlicePhysicsMechanics
    BobMathCalculus

    In this example, the choice of textbook depends on the combination of student and course. For instance, Alice uses both Calculus and Linear Algebra textbooks for her Math course, while she uses Mechanics for her Physics course. However, whether Linear Algebra and Calculus are chosen depends solely on whether the course is Math. This scenario demonstrates a multi-valued dependency, where the combination of student and course uniquely determines the set of textbooks used, while the relationship between student and course is independent, yet textbooks are not independent of courses as one course can have multiple textbooks.

    Therefore, in this scenario, there exists a multi-valued dependency relationship CourseTextbookCourse \twoheadrightarrow Textbook.

    Keys

    • Let KK be an attribute or a combination of attributes in R<U,F>R<U, F>, and KFUK \overset{F}{\rightarrow} U, then KK is a candidate key for RR.
    • If there are multiple candidate keys, one of them can be designated as the primary key.
    • Any attribute set that contains a candidate key is called a prime attribute. Otherwise, it is a non-prime attribute.
    • If the entire attribute set constitutes the candidate key, then this attribute set is called a superkey.
    • Both primary keys and candidate keys are commonly referred to as keys.

    Database Normalization

    1NF

    1NF1NF (First Normal Form) is one of the fundamental normal forms in relational databases. It requires that each attribute in a relation schema is atomic, meaning it cannot be further divided. In other words, each attribute in the relation schema should be single-valued, rather than containing multiple values or complex data types.

    Specifically, 1NF1NF requires that each cell in the relation contains only one value, rather than multiple values or complex data types. This helps ensure the atomicity of data, simplifying data processing and querying.

    For example, consider a student table containing names and phone numbers. If phone numbers are stored as a comma-separated list of multiple numbers, then this table does not conform to 1NF1NF. To adhere to 1NF1NF, phone numbers should be split into separate attributes, with each attribute containing only one phone number.

    2NF

    2NF2NF (Second Normal Form) is another normal form in relational databases, built upon the foundation of the first normal form (1NF1NF). 2NF2NF requires that each non-prime attribute in a relation schema is fully functionally dependent on the candidate key, rather than partially dependent on the candidate key.

    Specifically, if a relation schema has multiple attributes forming the candidate key, then each non-prime attribute should depend on all combinations of these attributes, rather than just depending on some of them.

    For example, consider a relation schema containing the following attributes: StudentID, CourseID, CourseName, StudentName, and Grade. In this schema, (StudentID, CourseID) forms the candidate key, representing the enrollment of students in courses.

    StudentIDCourseIDCourseNameStudentNameGrade
    101001MathAliceA
    101002PhysicsAliceB
    102001MathBobA
    102003ChemistryBobC

    In this example, the StudentName attribute is a non-prime attribute and is partially dependent on the candidate key (StudentID), rather than fully dependent on all attributes of the candidate key (StudentID, CourseID). For instance, the name of student Alice is only associated with Student ID 101, irrespective of the course enrollment.

    Therefore, this relation schema does not satisfy the second normal form (2NF2NF). To adhere to 2NF2NF, we should remove the StudentName attribute from this relation schema and place it in a separate table, where the candidate key is StudentIDStudentID. This way, the student name becomes fully dependent on the candidate key, rather than partially dependent on it.

    3NF

    Given a relation schema R<U,F>1NFR<U,F> \in 1NF and no set of attributes XX, attribute group YY, and non-candidate attribute ZZ such that XY,YZ,YXX \rightarrow Y, Y \rightarrow Z, Y \cancel{\rightarrow} X, it belongs to 3NF3NF. In other words, if a relation is in 2NF2NF and each non-candidate attribute does not transitively depend on the candidate key, then it is in 3NF3NF.

    Let's continue with the example of student course enrollment mentioned earlier and attempt to conform to the third normal form (3NF3NF).

    In this example, we have already removed the student name from the main table and placed it in a separate table as follows:

    Student Table:

    StudentIDStudentName
    101Alice
    102Bob

    Enrollment Table:

    StudentIDCourseIDGrade
    101001A
    101002B
    102001A
    102003C

    Now, let's see if it conforms to 3NF3NF. In this relation schema, we have a transitive dependency: Student ID (StudentID) determines Student Name (StudentName), while Course ID (CourseID) determines Student ID (StudentID). Hence, there exists a transitive functional dependency: CourseIDStudentNameCourseID \rightarrow StudentName.

    To conform to 3NF3NF, we need to eliminate this transitive dependency. One way to do this is by creating a new table that maps CourseIDCourseID and StudentIDStudentID to StudentNameStudentName to eliminate this transitive dependency. This way, we can ensure that each non-prime attribute is fully dependent on the primary key.

    Student Enrollment Relation Table:

    StudentIDCourseIDGrade
    101001A
    101002B
    102001A
    102003C

    Student Table:

    StudentIDStudentName
    101Alice
    102Bob

    Course Table:

    CourseIDCourseName
    001Mathematics
    002Physics
    003Chemistry

    Now, each table conforms to the third normal form (3NF3NF). The student table and the course table are associated with the student enrollment relation table without any transitive dependency.

    BCNF

    BCNFBCNF (Boyce-Codd Normal Form) is a more stringent form of normalization in relational databases, resulting from normalizing relation schemas further from the third normal form (3NF3NF). BCNFBCNF requires that every non-trivial functional dependency (non-trivial XYX \rightarrow Y, where YY is not a superset of XX) is determined by the keys of the relation.

    In other words, if RR is a relation schema and XYX \rightarrow Y is a non-trivial functional dependency of RR, and XX does not contain any candidate keys, then RR does not conform to BCNFBCNF, meaning every determinant must include a key.

    For example, consider a scheduling table now, recording student ID, course ID, and teacher ID. A teacher teaches only one course, and a course can be taught by multiple teachers, and each student selects a specific course, which corresponds to a fixed teacher. Assuming the candidate key is student ID, course ID. We can list the following functional relationships:

    • (StudentID,CourseID)TeacherID(StudentID,CourseID) \rightarrow TeacherID
    • (StudentID,TeacherID)CourseID(StudentID,TeacherID) \rightarrow CourseID
    • TeacherIDCourseIDTeacherID \rightarrow CourseID

    This relation does not conform to BCNFBCNF because we find that in the third functional dependency, TeacherIDTeacherID does not contain any keys. While the first and second have keys.

    4NF

    4NF4NF is a form of database normalization that further emphasizes the elimination of multi-valued dependencies on top of BCNFBCNF. A relation schema satisfies 4NF4NF if, in addition to satisfying the requirements of BCNFBCNF, it also has no non-trivial multi-valued dependencies. Non-trivial multi-valued dependencies refer to those where neither the left side nor the right side is a superkey of the relation.

    Let's consider an example that complies with BCNFBCNF but not with 4NF4NF.

    Suppose we have a relation schema RR containing attributes: Student, Course, Grade, and Teacher. And there exist the following functional dependencies:

    1. (Student,Course)Grade(Student, Course) \rightarrow Grade
    2. CourseTeacherCourse \rightarrow Teacher
    3. StudentTeacherStudent \twoheadrightarrow Teacher

    Assume that the candidate key for the relation schema is Student,CourseStudent, Course.

    This schema conforms to BCNFBCNF because every functional dependency is determined by the candidate key.

    However, it does not conform to 4NF4NF due to the presence of the third multi-valued dependency. This is because each student may attend multiple courses, and each course may have different teachers.

    Therefore, although this relation schema satisfies BCNFBCNF, it does not satisfy 4NF4NF.

    Relationship

    Understanding from 1NF1NF to 4NF4NF is progressive. The relationship between them is as follows:

    • 1NF1NF eliminates partial functional dependencies of non-prime attributes on the candidate key to achieve 2NF2NF.
    • 2NF2NF eliminates transitive functional dependencies of non-prime attributes on the candidate key to achieve 3NF3NF.
    • 3NF3NF eliminates partial and transitive functional dependencies of prime attributes on the candidate key to achieve BCNFBCNF.
    • BCNFBCNF eliminates non-trivial and functional dependencies of multi-valued dependencies to achieve 4NF4NF.