Without any context as to where I'm going with this, which of the following do you think is better (purposefully vague):
- Storing data about data.
- Interpreting data about data.
Yes I'm talking about metadata, in its most unassuming definition.
I'm able to argue both sides of the fence on this one, and can even be impartial on the grounds of "it depends". Because of that, I'm not here to pick one or the other but to get other developers thinking about this when designing data stores and data driven applications.
We'll start with a simple scenario:
Design an application to manage attendance information for a conference. Data must be held in a relational database. The application needs to know how many people can attend a conference and when the conference is full.
Based on that, let's assume two tables, Conference and Attendee. Also assume a foreign key from Attendee to Conference on ConferenceID. Something like this:
We have the schema laid out but there is another requirement to take into consideration:
The application needs to know how many people can attend a conference and when the conference is full.
I'm sure some of you already know where I'm going with this one. Lets analyze our options:
- Store data about data. Since our application needs to know when to stop registering attendees for any given conference we need to know when the MaxAttendees attribute on our Conference table has been satisfied. Our first choice is to store this data, in the form of a bit field on the Conference table. While the sum of the attendees for any given conference is less than the max for that conference, our new Full field will be 0 (false). Once the sum of the attendees hits our max, we set Full to 1.
- Interpret data about data. Similar to our solution above, but instead of storing this field we can gather this information when it is needed. For example, if someone using our application tries registering for a conference we can determine at that time whether or not the conference is full.
This is another debate of maintainability and performance, let's see why:
- Store data about data. If we add a Full field to our Conference table, we create another line of maintenance. Each time we add an attendee for a given conference we need to determine if the conference is full, at which time we set our Full field to 1. In the event we remove an attendee from a conference we can simply set Full to 0 if it isn't already. Since this field requires maintenance through logic, we also have the possibility of bugs in our application. But since we store this data, we will always know the availability of a conference at anytime without any additional logic in our application and without any type of join between our two tables. Note: you may be thinking that this implementation is not the product of performance considerations because each attendee insert needs to do some type of aggregation to determine conference availability. This is OK since the next implementation needs to do the same thing.
- Interpret data about data. Without a Full field for a conference, we need to determine at the availability of a conference by joining our two tables together and doing a sum on the attendees for each insert. That can get costly. Also, if we have additional applications touching this data (such as reports) those applications need to do the same thing. But since we don't have an additional field to manage the potential for bugs decreases and the maintainability of our application increases. By managing complexity we reduce the amount of logic needed for this application. Note: we've managed complexity by removing the need to maintain a Full field. We obviously still need to aggregate our attendee data to determine conference availability.
I hope you see the trade offs here. Neither solution is the silver bullet; your implementation should depend on your application. Also keep in mind that these considerations do not just apply to database design. This very well could have been properties in classes or attributes in an XML file. The point here is the title of this post, should we store data about data or interpret it? It depends, there are a lot of considerations to be made when answering that question. Is concurrency an issue? What type of load is anticipated? How many developers will be working on this solution? Those are just the more obvious ones, I'm sure there are many others depending on the actual application.
This is a design consideration and one that should be well thought out. It wouldn't be a lot of fun to change your implementation halfway through your project because you hastily made a decision.
This is not a cut and dry post, it wasn't supposed to be. When advice is given concerning design, the advice isn't always an answer. But since its advice, it should be enough to drive good design decisions. I hope this was just that.
