One of the first things you learn at university in your first year of computer science is data normalisation. I don’t know about the other people out there, but I found it such an utterly boring course. Mankind has such an obsession with categorising every single piece of data that this behaviour is crammed into the minds of naïve and unknowing computer science students, just fresh from high school.
Why is that? Well it does make it easer for us mortals to grasp the vast amount of data that is around us. Think about the early days of Yahoo! that used a directory approach and manual work to build a whole categorised database of web links.
But then cracks start appearing in our approach. Familiar with the “in which folder am I going to put this email?”-problem? First of all you start to have a gazillion folders and then the problem pops up that an email can belong to two different folders. After a while you start to wonder “where the hell is that message that I filed?”.
Social Networks and Cloud driving the change
As a technologist, I also understand the underlying technical reasons why we want to normalise and categorise data. You first start with “objectifying” your view of the world in an object-oriented language like Java so it is easier to program and maintain those systems. This is persisted in classic row-column structure in a database where you create different tables for different data concepts. You have the student object with its properties stored in the Student table and course object with its properties in the Course table. Now you can create meaningful relations between the two and develop your software accordingly.
When your application starts to get a huge load, you will either spend an awful lot of money on expensive Oracle cluster software or you will use concepts like sharding to split databases to keep it performing. But still, you stick to your normalized Student and Course table structures.
An approach that started to get traction in some of the successful internet startups was the concept of key/value pairs. While not a novel concept at all, it got more popular because we started to get more and more websites that couldn’t cope with the amount of users. The idea is that you don’t normalise data in Course and Student tables, but you just have a key and an associated value combination in one huge table. This approach scales extremely well because it can be easily spread over a cluster of machines. Google’s BigTable uses this as well as Facebook’s Cassandra. It became more “mainstream” with Amazon’s introduction of their SimpleDB that has this key/value concept, as well as Microsoft’s key/value storage in Azure.
SSD discs to the rescue!
The only problem is… that developers have been so brainwashed with the relational row/column concept that the average corporate developer had a hard time grasping this key/value concept and building meaningful apps on top of it.
Both Amazon and Microsoft have realized that and have added a relational layer on top of their key/value data storage engine. This means that the vendors can use key/value for achieving extreme scalability, but they let developers interact through a relational model interface. (It does create a performance penalty, but negligible for most users I think).
But still, even this key/value approach requires that you are modelling your apps and data in a particular way. What if we didn’t have to do ANY effort at all? What if we could just take the vast amount of data we have in our company, dump it on a disc and just being able to find meaningful information AS IS? Yes, we do have the Google Appliance for that, or Microsoft Enterprise Search, but they are merely indexing in a batch process and storing the results in a cache. What if we want a real-time extremely fast search that gives you on the spot information regardless of the data structures?
That’s exactly the gap Oracle wants to fill with their Exadata v2 server. Consider it as a beast of a server, crammed with more than a terabyte of Solid-State Drive (flash) discs that has your data fully stored, it’s fairly comparable with having ALL your data hot in memory, ready to be queried.
Can you grasp the opportunity this presents us? We can achieve real-time search inside our enterprise on vast amounts of unstructured data!
This is again an example that we shouldn’t limit ourselves by the current state of technology. In the past 10 years I’ve seen again and again that eventually, technology will catch up to make things possible that we couldn’t do before.
Most people will quote Einstein and say “imagination is more important than knowledge”, I’d rather tell you to be more like the kitteh at the right. I’m pretty sure that one day, he/she will be able to fax itself to the Burger King to get some delicious “cheezburgerz“.
Impact on Enterprise 2.0
So how does this relate to Enterprise 2.0 I hear you asking. Well, I argue that we shouldn’t lose too much time and effort on meticulously categorising and tagging information and knowledge across all the different systems we have. This is because I believe that it is almost a lost cause anyway: it doesn’t scale and it’s often not the quality you want.
My point is that the tools and technology we get to our disposal are advancing at such a rapid speed, that we slowly start to reach the point where we can rely on software and hardware to do this boring heavy lifting for us.
I shared my view in a previous blog post that knowledge should be a happy by-product of your daily work, however this creates such a vast amount of information that this overload creates a problem on itself. Having intelligent software that can make sense of this, automatically tag it and categorise it, automatically make relationships and assumptions, would dramatically increase our efficiency.
At Headshift we’re doing a lot of research on helping companies staying ahead of the pack and we’ve been looking at software solutions to more intelligently and automatically manage your knowledge as described. It seems that now we finally have the right hardware innovations to make it even juicier…
This post originally appeared on the Headshift blog.