GemFire SQLFabric - "NoSQL database" scalability using SQL
presented by Jags Ramnarayan and Gideon Low
Jags started by proposing that the ACID semantics, which basically aim to ensure every change to a database is written to disk, are limited in the performance they can achieve due to being essentially I/O-bound. He also suggested the desire for these semantics is rooted in history when storage hardware and networks were slow and unreliable. Though he didn't say it explicitly, I think he was suggesting that this is no longer the case, with the implication being that we no longer need ACID.
He outlined how the "NoSQL" movement actually has nothing to do with SQL at all - i.e. the query language - but with the data structures - relational databases - that have typically been the subject of these queries. His point: You don't have to reinvent data querying just because you're re-inventing the data structure.
That, of course, led into the design of GemFire SQLFabric, the data management technology that was acquired by SpringSource/VMWare in May 2010. Jags said that, from an application developer's perspective, using GemFire SQLFabric is mostly identical to using other databases, just with a different JDBC URL and some custom DDL extensions.
I didn't jot down Jags' description of GemFire, but here is the spiel from the front page:
GemFire Enterprise is in-memory distributed data management platform that pools memory (and CPU, network and optionally local disk) across multiple processes to manage application objects and behavior
Jags outlined a bit of the structure of a GemFire cluster and then said that, because of the structure, it didn't give great performance for key-based access, or for joins, which left me wondering what is was good for! (Edit: Turns out I misheard him. The performance is better for joins compared to other object data grids.) I think he clarified the join part later, though, when he discussed that data needing to be joined in a query must be co-located on a single node.
The GemFire table creation DDL provides extensions for specifying how a table is to be replicated or partitioned, and how much redundancy should back the partitions. These extensions allow the DBA to ensure that data to be queried together is co-located.
If no partitioning or replication options are specified in the table DDL, GemFire will make decisions about default options for these based on the relationships (i.e. foreign keys) apparent in the table definition.
He said that GemFire uses the JDBC drivers and query planning and optimisation code from Apache Derby.
While talking about joins, Jags mentioned that co-location of joined data is required in order to achieve linear scaling. He mentioned that co-location of data is only currently a restriction of GemFire, implying that they intend to remove this restriction, though he didn't mention whether they would be tackling the linear scaling problem when they do this.
He talked about the way in which many of the design decisions they make in GemFire are focussed on making the absolute minimum number of disk seeks. I think that's hard-core stuff! I've been coding commercially for over ten years now and I've never once thought about how many disk seeks my code is causing.
Gideon showed some of the networking that occurs to make GemFire work and discussed how there is a central component called a 'Locator' that the cluster nodes use to find each other and which also performs load balancing of client requests. Strangely, this seemed like a classic single-point-of-failure to me, but there was no discussion about that problem.
I came away not really being sure what GemFire could be used for. Jags' comments about ACID at the start seemed to suggest that he thinks we are over-obsessed with reliability in the modern age. However, in my finance day job, we need to be 100% certain that pretty much every piece of data we push to the databases is stored for good when the transaction ends. Even 99.9999% (yes, six nines!) is not good enough: if 1 in every million DB transactions goes missing, money will go missing and we'll have a very angry customer on the phone. Unfortunately, they didn't cover during the talk how (or whether) GemFire handles reliability requirements like these.
Having said all that, however, I noticed that GemStone have an essay on their site called "The Hardest Problems in Data Management", in which they discuss the demanding needs of financial applications and suggest that while the popular "eventually consistent" distributed databases do not measure up to these demands, their implementation does. Having read a few paragraphs, they certainly seem to know what they're talking about from a theory perspective. If you're seriously looking for a solution in this space, I would suggest you have a good read of their documentation rather than just relying on my scratchy notes here.
Want to learn more?
|From Amazon...||From Book Depository...|