Normal Forms in RDBMS

Table of Contents

1. Database normalization

Database normalization (or normalisation) is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy.

Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the First normal form (1NF) in 1970. Codd went on to define the Second normal form (2NF) and Third normal form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974.

Informally, a relational database table is often described as "normalized" if it meets Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies.

数据库范式的目标——没有插入、更新、删除异常的情况下最大限度地减少数据冗余。

参考:http://www.bkent.net/Doc/simple5.htm

2. Anomalies

数据库操作异常有更新异常、插入异常和删除异常,后面将一一介绍它们。

参考:https://en.wikipedia.org/wiki/Database_normalization#Free_the_database_of_modification_anomalies

2.1. Update Anomaly

可能修改某个元组的信息,但是没有改变其他元组中的相同信息,导致数据库信息不一致。当然,读者可能认为这是用户不小心。但是可以重新对关系模式进行设计,使引起这种错误的风险不存在。

Table 1: 实例:容易引起更新异常的表设计
Employee ID Employee Address Skill
426 87 Sycamore Grove Typing
519 96 Walnut Avenue Public Speaking
519 96 Walnut Avenue Garpentry

比如在表 1 中我们想要修改员工号为 519 的住址(一共要修改两处位置),但由于某种原因只修改了一处,导致数据库不一致,这就是更新异常,表 2 所示。

Table 2: 实例:发生了更新异常
Employee ID Employee Address Skill
426 87 Sycamore Grove Typing
519 94 Chestnut Street Public Speaking
519 96 Walnut Avenue Garpentry

2.2. Insertion Anomaly

如果一个教职工刚刚来到学校,尚未分配教学任务,就无法把这个教职工的信息存入数据库。这称为插入异常。

如表 3 中,无法在下面表中加入新教职工("424 Dr. Newsome 29-Mar-1007"),除非把 Course Code 设置为 null。这就是插入异常。

Table 3: Faculty and There Courses
Faculty ID Faculty Name Faculty Hire Date Course Code
389 Dr. Giddens 10-Feb-1985 ENG-206
407 Dr. Sqperstein 19-Apr-1999 CMP-101
407 Dr. Sqperstein 19-Apr-1999 CMP-201

2.3. Deletion Anomaly

如果一个教职工所教的课程结束,删除信息时把教职工的其它信息也都删除了。这称为删除异常。

接着看上面的例子。现在我们想把教职工 Dr. Giddens 所教课程删除,除非把 Course Code 设置为 null,否则无法做到。这就是删除异常。

3. 1NF(每一列不可再分解)

A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.

 Customer ID   First Name   Surname     Telephone Number 
         123   Robert       Ingram          555-861-2025 
         456 
             
 Jane       
            
 Wright    
           
     555-403-1659 
     555-776-4100 
         789   Maria        Fernandez       555-808-9633 

上表中,第 2 条记录的 Telephone Number 列有两个值,这不满足第一范式的要求。要使其满足第一范式要求,可以改为下面形式:

 Customer ID   First Name   Surname     Telephone Number 
         123   Robert       Ingram          555-861-2025 
         456   Jane         Wright          555-403-1659 
         456   Jane         Wright          555-776-4100 
         789   Maria        Fernandez       555-808-9633 

4. 2NF(非主键列依赖于“整个”candidate key)

A table is in 2NF if it is in 1NF and every non-prime attribute of the table is dependent on the whole of every candidate key.

Table 6: Employees' skills (underlining denotes the key of this table)
Employee Skill Current Work Location
Brown Light Cleaning 73 Industrial Way
Brown Typing 73 Industrial Way
Harrison Light Cleaning 73 Industrial Way
Jones Shorthand 114 Main Street
Jones Typing 114 Main Street
Jones Whittling 114 Main Street

6 不满足第二范式的要求,因为列"Current Work Location"仅仅依赖于 Employee(它只是 candidate key {Employee, Skill}的一部分),而没有依赖于整个 candidate key {Employee, Skill}。

注:候选键(candidate key)是某个关系变量的一组属性所组成的集合,它需要同时满足下列两个条件:(1) 这个属性集合始终能够确保在关系中能“唯一地标识记录”;(2) 在这个属性集合中找不出真子集能够满足条件(1)。

详细解释如下:
Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to appear more than once (the employee might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.

The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore, the table is not in 2NF.

6 容易产生更新异常,更新"Current Work Location"时容易出现不一致的情况。把它改变成下面两个表(表 7 和表 8)后,它们能满足第二范式的要求,同时更新异常也消除了。

Table 7: Employees
Employee Current Work Location
Brown 73 Industrial Way
Harrison 73 Industrial Way
Jones 114 Main Street
Table 8: Employees' Skills
Employee Skill
Brown Light Cleaning
Brown Typing
Harrison Light Cleaning
Jones Shorthand
Jones Typing
Jones Whittling

参考:https://en.wikipedia.org/wiki/Second_normal_form

4.1. 更新异常可能出现在第二范式中

尽管前面的例子中,通过把表改造为符合第二范式的形式后,更新异常也消除了。但并不是说满足第二范式的表就没有更新异常。

9 满足第二范式,但也存在更新异常:

Table 9: Tournament Winners (underlining denotes the key of this table)
Tournament Year Winner Winner Date of Birth
Des Moines Masters 1998 Chip Masterson 14 March 1977
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

Even though Winner and Winner Date of Birth are determined by the whole key {Tournament, Year} and not part of it, particular Winner / Winner Date of Birth combinations are shown redundantly on multiple records. This leads to an update anomaly: if updates are not carried out consistently, a particular winner could be shown as having two different dates of birth.
The underlying problem is the transitive dependency to which the Winner Date of Birth attribute is subject. Winner Date of Birth actually depends on Winner, which in turn depends on the key Tournament / Year.

5. 3NF(非主键列“直接”依赖于整个 candidate key,不存在传递依赖)

要满足 3NF,首先必须是 2NF,另外非主键列必须直接依赖于 candidate key,不能存在传递依赖。如不能存在:非主键列 C 依赖于非主键列 B,非主键列 B 依赖于主键 A 的情况。

10 是一个满足第二范式,但不满足第三范式的例子。前面介绍过它存在更新异常,为方便再次摘抄如下:

Table 10: Tournament Winners (underlining denotes the key of this table)
Tournament Year Winner Winner Date of Birth
Des Moines Masters 1998 Chip Masterson 14 March 1977
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

这个表的列"Winner Date of Birth"依赖于"Winner",而"Winner"依赖于{Tournament, Year},存在传递依赖情况。不满足第三范式要求。

通过把表 10 拆分为下面两个表(表 11 和表 12),可以使它满足第三范式,同时消除了更新异常。

Table 11: Tournament Winners
Tournament Year Winner
Indiana Invitational 1998 Al Fredrickson
Cleveland Open 1999 Bob Albertson
Des Moines Masters 1999 Al Fredrickson
Indiana Invitational 1999 Chip Masterson
Table 12: Winner Dates of Birth
Winner Date of Birth
Chip Masterson 14 March 1977
Al Fredrickson 21 July 1975
Bob Albertson 28 September 1968

参考:https://en.wikipedia.org/wiki/Third_normal_form

6. 数据库范式简单总结

1NF: Only one value per column
2NF: All the non primary key columns in the table should depend on the entire primary key.
3NF: All the non primary key columns in the table should depend DIRECTLY on the entire primary key.

参考:http://stackoverflow.com/questions/723998/what-are-1nf-2nf-and-3nf-in-database-design

Author: cig01

Created: <2012-03-12 Mon>

Last updated: <2017-02-21 Tue>

Creator: Emacs 27.1 (Org mode 9.4)