ERD maken
Wat is een ERD? Het entity-relationship-model of entity-relationship diagram (ERD) is een model of diagram voor het inzichtelijk te maken van een conceptueel datamodel Het is een visuele weergave van de entiteiten, relaties en regels die gelden of aanwezig zijn in een logisch ontwerp Het is een onderdeel van het ontwerpproces van een relationele database Een ERD bevat entiteiten, attributen en relaties Een ERD maak je – in de regel – van minimaal de derde normaalvorm of Wordt gebaseerd op een analyse en ontwerp
Waarom een ERD? Het maken van een ERD heeft twee redenen: De fysieke inrichting van de database wordt er van afgeleid Het geeft een functionele controle of alles gerealiseerd kan worden Een ERD is de vertaling van een functioneel ontwerp (of klassendiagram) naar de technische implementatie in een fysieke database (bv mySQL)
VoorbeelD ERD
Notaties binnen een ERD
ERD 101 Cardinaliteit: Een Gene komt op één Chromosome voor Een Chromosome heeft nul of meer Gene’s
Normaliseren In this series of examples, we will discuss normalization of a simple data model that relates genes and publications that reference them. In a first attempt, we might define two tables: one contains data describing genes and one contains data describing a literature reference. Since there may be more than one publication referencing a gene, we provide for up to five references. We store the number of references a gene has in a separate column. We also store the organism name and the taxonomy ID in the gane table to identify the gene source. This model, while plausible on the surface, illustrates three common errors that can plague datamodeles.
Normaliseren The first, and perhaps most obvious problem is the fact that we have stored several values of exactly the same type in the gene table. What happens if a gene has more than five references? This would break our model. And it is really inefficient to have to store null values whenever a gene has less than five references. A more subtle problem arises when we store values that can be derived in an exact, functional way from the data that is stored: can we guarantee that the data is consistent? What happens if the value and the data somehow become inconsistent? A third problem arises from storing the organism name and taxonomy ID in the gene table. Surely, the relationship between these two data items does not depend on which gene they are associated with. All three examples relate to unnecessary and redundant data that is not only inefficient to store but invites inconsistencies and is a potential source of errors in the database.
Eerste Normaalvorm The first, and perhaps most obvious problem is the fact that we have stored several values of exactly the same type in the gene table. What happens if a gene has more than five references? This would break our model. And it is really inefficient to have to store null values whenever a gene has less than five references. The soultion for the first problem is to eliminate the duplicated fields and store the related itmes in a separate table, a so called join table. In our example, each entry in the join table describes one reference to one particular gene. It is no problem at all to have the same paper talk about more than one gene, and it is easy to story any number of references to a gene in this table.
Tweede normaalvorm In the second form, we ensure a table stores only attributes of an entity that actually depend on the entity. In our example, organism_name and Tax_Id depend on each other, but not on the gene_name. It is sufficient to store eiether one of them with the gene_name. However, in practice abstract ID keys are usually preferred over keys that have semantics, such as a name or a date, simply because it is easy to keep an internal ID unique and usable as a key, but it may be hard to force the real world to behave in the same way. If two different organisms would have the same name, our model would break if we would rely on the name being unique. Finally, we eliminate evrything that does not need to be stored in the table, becausxe it can be computed from data in the table. For instance, the number of references for a gene can be easily obtained by an SQL select statement such as SELECT COUNT FROM Reference WHERE Reference.gene_name="Mbp1" AND 'Reference.Tax_ID="12345".
Derde normaalvorm These operations finally give us the datamodel in 3. Normal Form. Note that for efficiency problems, datamodels may be intentionally denormalized: we may wish to store the results of expensive computations, for example, or we may store views (the results of SELECT operations) if they are complex and take a long time to ccompute, although the results could be derived from the data. However, such situations should never pass unintentionally and never undocumented.
Checklist ERD: de 10 geboden PK voor elke entiteit bepaald? FK’s bepaald? Zijn enitteiten verbonden? Klopt de cardinaliteit? (oplezen!) Zit er een ATAMHA in? Zit er een onnodige slang in? Kan ik de benodigde informatie opslaan? Kan ik de benodigde informatie eruit halen? Hoe ga ik de informatie tonen in de applicatie? (SQL!!!!) Wat zegt mijn buurman er van? (SVP niet op toets of examen…..)
Databaseschema Table Name: Gene Semantics: A gene, its attributes and a reference to the chromosome table. Column Name: Type: Properties: Semantics: Gene ID: INT PRIMARY KEY, AUTOINCREMENT, NOT NULL Artificial key, no semantics outside of this database Gene Name: VARCHAR(10) not NULL The common name of the gene Chromosome ID: Foreign Key, references Chromosome References the Chromosome the gene is found on. Assumes a gene is found on exactly one Chromosome. Start Defines the first nucletodide (in chromosome coordinates) that is annotated as belonging to the gene. End Defines the last nucletodide (in chromosome coordinates) that is annotated as belonging to the gene. If End < Start, the gene is o the (-) strand.
Databaseschema 2 Table Name: Chromosome Semantics: A chromosome, some attributes, and its sequence. Column Name: Type: Properties: Semantics: Chromosome ID: INT PRIMARY KEY, AUTOINCREMENT, NOT NULL Artificial key, no semantics outside of this database Chromosome Name: VARCHAR(10) not NULL The number/letter of the chromosome. Not required to be unique. sequence LONGTEXT - Holds the (+) strand sequence. Sequence defines chromosome coordinates. Gaps are filled with the character "N".