Normal Forms in RDBMS
Table of Contents
1. Database normalization
Database normalization (or normalisation) is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy.
Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the First normal form (1NF) in 1970. Codd went on to define the Second normal form (2NF) and Third normal form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974.
Informally, a relational database table is often described as "normalized" if it meets Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies.
数据库范式的目标——没有插入、更新、删除异常的情况下最大限度地减少数据冗余。
2. Anomalies
数据库操作异常有更新异常、插入异常和删除异常,后面将一一介绍它们。
参考:https://en.wikipedia.org/wiki/Database_normalization#Free_the_database_of_modification_anomalies
2.1. Update Anomaly
可能修改某个元组的信息,但是没有改变其他元组中的相同信息,导致数据库信息不一致。当然,读者可能认为这是用户不小心。但是可以重新对关系模式进行设计,使引起这种错误的风险不存在。
Employee ID | Employee Address | Skill |
---|---|---|
426 | 87 Sycamore Grove | Typing |
519 | 96 Walnut Avenue | Public Speaking |
519 | 96 Walnut Avenue | Garpentry |
比如在表 1 中我们想要修改员工号为 519 的住址(一共要修改两处位置),但由于某种原因只修改了一处,导致数据库不一致,这就是更新异常,表 2 所示。
Employee ID | Employee Address | Skill |
---|---|---|
426 | 87 Sycamore Grove | Typing |
519 | 94 Chestnut Street | Public Speaking |
519 | 96 Walnut Avenue | Garpentry |
2.2. Insertion Anomaly
如果一个教职工刚刚来到学校,尚未分配教学任务,就无法把这个教职工的信息存入数据库。这称为插入异常。
如表 3 中,无法在下面表中加入新教职工("424 Dr. Newsome 29-Mar-1007"),除非把 Course Code 设置为 null。这就是插入异常。
Faculty ID | Faculty Name | Faculty Hire Date | Course Code |
---|---|---|---|
389 | Dr. Giddens | 10-Feb-1985 | ENG-206 |
407 | Dr. Sqperstein | 19-Apr-1999 | CMP-101 |
407 | Dr. Sqperstein | 19-Apr-1999 | CMP-201 |
2.3. Deletion Anomaly
如果一个教职工所教的课程结束,删除信息时把教职工的其它信息也都删除了。这称为删除异常。
接着看上面的例子。现在我们想把教职工 Dr. Giddens 所教课程删除,除非把 Course Code 设置为 null,否则无法做到。这就是删除异常。
3. 1NF(每一列不可再分解)
A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.
Customer ID | First Name | Surname | Telephone Number |
123 | Robert | Ingram | 555-861-2025 |
456 |
Jane |
Wright |
555-403-1659 555-776-4100 |
789 | Maria | Fernandez | 555-808-9633 |
上表中,第 2 条记录的 Telephone Number 列有两个值,这不满足第一范式的要求。要使其满足第一范式要求,可以改为下面形式:
Customer ID | First Name | Surname | Telephone Number |
123 | Robert | Ingram | 555-861-2025 |
456 | Jane | Wright | 555-403-1659 |
456 | Jane | Wright | 555-776-4100 |
789 | Maria | Fernandez | 555-808-9633 |
4. 2NF(非主键列依赖于“整个”candidate key)
A table is in 2NF if it is in 1NF and every non-prime attribute of the table is dependent on the whole of every candidate key.
Employee | Skill | Current Work Location |
---|---|---|
Brown | Light Cleaning | 73 Industrial Way |
Brown | Typing | 73 Industrial Way |
Harrison | Light Cleaning | 73 Industrial Way |
Jones | Shorthand | 114 Main Street |
Jones | Typing | 114 Main Street |
Jones | Whittling | 114 Main Street |
表 6 不满足第二范式的要求,因为列"Current Work Location"仅仅依赖于 Employee(它只是 candidate key {Employee, Skill}的一部分),而没有依赖于整个 candidate key {Employee, Skill}。
注:候选键(candidate key)是某个关系变量的一组属性所组成的集合,它需要同时满足下列两个条件:(1) 这个属性集合始终能够确保在关系中能“唯一地标识记录”;(2) 在这个属性集合中找不出真子集能够满足条件(1)。
详细解释如下:
Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to appear more than once (the employee might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.
The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore, the table is not in 2NF.
表 6 容易产生更新异常,更新"Current Work Location"时容易出现不一致的情况。把它改变成下面两个表(表 7 和表 8)后,它们能满足第二范式的要求,同时更新异常也消除了。
Employee | Current Work Location |
---|---|
Brown | 73 Industrial Way |
Harrison | 73 Industrial Way |
Jones | 114 Main Street |
Employee | Skill |
---|---|
Brown | Light Cleaning |
Brown | Typing |
Harrison | Light Cleaning |
Jones | Shorthand |
Jones | Typing |
Jones | Whittling |
4.1. 更新异常可能出现在第二范式中
尽管前面的例子中,通过把表改造为符合第二范式的形式后,更新异常也消除了。但并不是说满足第二范式的表就没有更新异常。
表 9 满足第二范式,但也存在更新异常:
Tournament | Year | Winner | Winner Date of Birth |
---|---|---|---|
Des Moines Masters | 1998 | Chip Masterson | 14 March 1977 |
Indiana Invitational | 1998 | Al Fredrickson | 21 July 1975 |
Cleveland Open | 1999 | Bob Albertson | 28 September 1968 |
Des Moines Masters | 1999 | Al Fredrickson | 21 July 1975 |
Indiana Invitational | 1999 | Chip Masterson | 14 March 1977 |
Even though Winner and Winner Date of Birth are determined by the whole key {Tournament, Year} and not part of it, particular Winner / Winner Date of Birth combinations are shown redundantly on multiple records. This leads to an update anomaly: if updates are not carried out consistently, a particular winner could be shown as having two different dates of birth.
The underlying problem is the transitive dependency to which the Winner Date of Birth attribute is subject. Winner Date of Birth actually depends on Winner, which in turn depends on the key Tournament / Year.
5. 3NF(非主键列“直接”依赖于整个 candidate key,不存在传递依赖)
要满足 3NF,首先必须是 2NF,另外非主键列必须直接依赖于 candidate key,不能存在传递依赖。如不能存在:非主键列 C 依赖于非主键列 B,非主键列 B 依赖于主键 A 的情况。
表 10 是一个满足第二范式,但不满足第三范式的例子。前面介绍过它存在更新异常,为方便再次摘抄如下:
Tournament | Year | Winner | Winner Date of Birth |
---|---|---|---|
Des Moines Masters | 1998 | Chip Masterson | 14 March 1977 |
Indiana Invitational | 1998 | Al Fredrickson | 21 July 1975 |
Cleveland Open | 1999 | Bob Albertson | 28 September 1968 |
Des Moines Masters | 1999 | Al Fredrickson | 21 July 1975 |
Indiana Invitational | 1999 | Chip Masterson | 14 March 1977 |
这个表的列"Winner Date of Birth"依赖于"Winner",而"Winner"依赖于{Tournament, Year},存在传递依赖情况。不满足第三范式要求。
通过把表 10 拆分为下面两个表(表 11 和表 12),可以使它满足第三范式,同时消除了更新异常。
Tournament | Year | Winner |
---|---|---|
Indiana Invitational | 1998 | Al Fredrickson |
Cleveland Open | 1999 | Bob Albertson |
Des Moines Masters | 1999 | Al Fredrickson |
Indiana Invitational | 1999 | Chip Masterson |
Winner | Date of Birth |
---|---|
Chip Masterson | 14 March 1977 |
Al Fredrickson | 21 July 1975 |
Bob Albertson | 28 September 1968 |
6. 数据库范式简单总结
1NF: Only one value per column
2NF: All the non primary key columns in the table should depend on the entire primary key.
3NF: All the non primary key columns in the table should depend DIRECTLY on the entire primary key.
参考:http://stackoverflow.com/questions/723998/what-are-1nf-2nf-and-3nf-in-database-design