I also have enormous trouble with 'normalizing' data into separated tables to prevent redundant data.
I'd be doing you no favours if I said normalization is an easy concept to get your teeth around, but on the other hand it's not as difficult as might be imagined by someone looking at the definitions of the normal forms for the first time, particularly if you
concentrate for the time being on the first three normal forms. That's not to say that the higher normal forms are unimportant; if a table needs normalizing beyond 3NF then it should be, but in most cases a table which is normalized to 3NF will be in the
higher normal forms without any further action.
Let's start by looking briefly at the formal definitions of the first three normal forms. Don't worry if these sound like gobbledygook at first:
First Normal Form: A relvar is in 1NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute.
Second Normal Form: A relvar is in second normal form if and only if it is in 1NF and every non-key attribute is irreducibly dependent on the primary key.
Third Normal Form: A relvar is in 3NF if and only if it is in 2NF and every non-key attribute is nontransitively dependent on the primary key.
Loosely speaking, in the language of the relational model, a relvar (relation variable) equates to a table, a tuple to a row (record) and an attribute to a column (field).
So what does this all mean? To understand this we need to step back a little further and consider just what is meant by the word 'dependent' in these definitions as it is this concept which is at the root of the definitions. The term is really a shorthand
for 'functionally dependent on', or 'functionally determined by'.
A column is functionally determined by another column (or columns) if from the value of the first column(s) the value of the second is always known to be the same.
So in a Contacts table for instance, if there is a row for me, and the value of the ContactID primary key of the table is 42, then from the ContactID value 42, wherever we encounter it, we always know that the value of the FirstName attribute is 'Ken' and that
of the LastName attribute 'Sheridan'. For a table to be in Third Normal Form all non-key columns must be functionally determined solely by the whole of the primary key. So if the Contacts table also includes CityID and CountyID columns, where CityID represents
'Stafford' then CityID is functionally determined solely by the whole of the primary key ContactID because it's where I live and there is no other column in the table from which you can deduce that I live in Stafford.
In this hypothetical Contacts table the value at the CountyID column position of my row represents Staffordshire. Now this is again functionally determined by the whole of the primary key ContactID, because it's the County where I live, but it is not solely
determined by ContactID as from CityID we can also deduce that I live in Staffordshire because that is where Stafford, not surprisingly, is located. So CountyID is transitively functionally determined by ContactID via CityID, which means the table is not
normalized to Third Normal Form (3NF).
So what, you might ask? The reason this is important is that it leaves the table wide open to update anomalies. Say I move to Lancaster, in which case the CityID is updated to whatever is the value which represents Lancaster, but CountyID is not updated to
the value representing Lancashire. My cousin, who is also in the table, has always lived in Lancaster. We now have two inconsistent rows, one which tells us Lancaster is in Staffordshire, one which tells us it's in Lancashire. It's pretty obvious which
is correct to anyone with a passing acquaintance with English geography or the etymology of English county names, but that's beside the point. Redundancy is not merely inefficient, but opens the door to such update anomalies, and as Murphy's Law tells us
"Anything that can go wrong, will go wrong ".
By normalizing the table by the removal of the redundant CountyID column the problem is eliminated, because we know from the one row for Lancaster in the Cities table that it is in Lancashire. To see the county for each contact we simply join the Contacts
and Cities tables on CityID.
However, a column might not be transitively functionally dependent even though at first sight it might be thought to be so. The classic example of this is a UnitPrice attribute of a product in an OrderDetails table, where an Orders table has a UnitPrice column.
It might be thought that to include a UnitPrice column in the OrderDetails table as well as a ProductID column would be redundant as the price is determined by the ProductID. This is not the case, though, as over time the unit price of a product will change,
but the price in the OrderDetails table should be fixed as that at the time the order was created. The UnitPrice column in Products is determined by its key, ProductID, but that in OrderDetails is determined not by the ProductID column in OrderDetails, but
by the key of that table, which is a composite one of OrderID and ProductID. So the UnitPrice column in OrderDetails is determined solely by the whole of the key, and is therefore not transitively functionally dependent on the key. Consequently the table
is normalized to Third Normal Form and the inclusion of UnitPrice columns in both tables is legitimate and necessary. When a new OrderDetails record is inserted the current unit price for the product can be looked up from Products and assigned to the UnitPrice
column in OrderDetails where its value will remain fixed whatever changes are made to the UnitPrice column's value in Products.
So, if we come back to the definitions of the first three normal forms, what these amount to is that any non-key column in a table must be functionally determined solely by the whole of the table's key, bearing in mind that key in this context can be defined
as the table's primary key, or can be a 'candidate key', i.e. a column or set of columns whose values must be distinct even if not defined as the primary key. There is a useful mnemonic for remembering this: 'The key, the whole key and nothing but the key,
so help me Codd'. It was Edgar F. Codd who first introduced the database relational model back in 1971.
So, to create a set of tables normalized to Third Normal Form is a question of decomposing the single table represented by the Excel worksheet into a set of related tables in which each table's columns are determined. Each table should represent a distinct
entity, remembering that some may also represent a relationship type between other tables (sometimes colloquially called 'junction' tables), and each column of each table must represent an attribute of the entity type which is determined solely by the whole
of the table's key. Much of this is intuitive, but a reasonable understanding of the 'rules' as outlined above means you can ask the right questions of the design to determine if a table is properly normalized.
The 'logical model' made up of the tables and the relationships types between them is of course a model of entity types and relational types which exist in the real world. It was Codd's realization that by modelling the reality in this way certain mathematical
operations of relational algebra could be applied to the model. He originally identified eight such operations. Under the skin it is this mathematics which is the basis of how a relational database works, though over the years it has been developed from
Codd's original ideas. Of course, we don't have to understand the mathematics itself as database management systems like Access wrap it up and serve it to us in ways which we mere mortals can handle. What we need to be certain of, however, is that our understanding
of that part of the real world which we are modelling is correct, so that we can correctly translate it into tables and relationship types. For this reason, while we can of course advise, you are in a far better position than we are to put together a solid
model, as you will have a far deeper understanding of the reality being modelled. One thing I would advise strongly, though, is that you draw the model out diagrammatically on paper first, creating an 'entity relationship diagram'. This is similar what you
see in the relationship window in Access if it is well arranged. By asking questions of the model you can then see to what degree it will do what you want , or where there might be holes in it which need fixing. Only when you are fully satisfied with the
model should you create the necessary tables and relationships. The it's a question of appending the data from the Excel worksheet into the tables. This is generally not too difficult a task once the correct sequence is established You'll find a simple
example as DecomposerDemo.zip in my public databases folder at:
https://skydrive.live.com/?cid=44CC60D7FEA42912&id=44CC60D7FEA42912!169
You might have to copy the text of the link into your browser's address bar (not the link location). For some reason it doesn't always seem to work as a hyperlink.
Remember that you should only append 'base data', not values which can be computed from the base data; that would introduce redundancy.