Scalability is unsolved problem #4.
A great deal of work has been done on database scalability. We’ve moved on from Third Normal Form. For giant problems we have Hadoop. But these only address scaling our data access. When we get some large data sets with many dimensions, and we start seeing large numbers of combinations and operations. We hit the wall on computational scalability.
It’s not hard to imagine data with seven dimensions. It’s also not hard to think of some combinations of probability testing that will require a lot of calculations if we use brute force.
In the top equation we have the cube of 7 factorial. That number is on the order of ten to the eleventh power. That’s 100 billion. That is a lot of operations. When I see numbers like that, I ask two questions.
First, how does it compare to the number of Hydrogen atoms in the universe?
That’s about 10 to 80th power, or 100 quinvigintillion. It makes our 10^11 number look less awful. That’s our second equation. It sits on top of whole sky picture. Most of the trillions of stars you can see are made of hydrogen atoms. It’s the most common element in the universe. Think of how long it would take to count them; how many lifetimes. Asking how long it takes to do any computation at that scale is humbling.
But, another question is “are we doing something worse?”
If we aren’t careful, we’ll start building power towers. The Power Tower is also known as Tetration or superexponentiaion or hyperpower.
We see one of those in the third equation. We say it this way, “the 3rd tetration of 7”.
It’s a big big number of calculations. The number of hydrogen atoms in the universe are long gone. We are way past the universe of numbers with names. We don’t want this to happen to us.
Yet, this IS what often happens when we plunge in to high dimensional data sets. Even quantum computing is not going to solve this class of problem. And, most big data sets have high dimensional attributes; it is easy to fall into crunching at this scale It’s called the curse of dimensionality.
So, what do we do?
Well, most practitioners use dimensionality reduction in some form. When we do that, we lose all sorts of information. Clustering, PCA, all of these are lossy procedures but we don’t have a good alternative. Otherwise, in some forms of analytics… purely naive Bayes, random forests, big neural nets… all of these end up with the 10^80 comparison at some point unless we intervene with some trick like dimensional reduction.
At Lone Star, we have some techniques for reducing transmission bandwidth, computational intensity and other scalability problems. We believe our methods are breakthroughs, but no one has completely solved this problem.
Dealing with computational scalability, and without information loss, is our fourth unsolved problem.