A few weeks ago, on January 16, Divya Mehra and I have delivered a webinar on using the JBoss Data Grid for improving the scalability and performance of web applications. As expected, the webinar has elicited a lot of questions, not all of which could be answered in the allotted time. As they were really interesting though we are trying to answer them here, also for the benefit of a larger audience.
In our application we have WebSphere running and use JSF + Richfaces and Spring with Oracle 11g. Can JBoss Data Grid be integrated even if we do no use JBoss EAP?
Yes it can. You can use it in either mode: library or remote client-server.
Can data grid operations participate in a distributed transaction? If so, does it ensure that the commit occurs throughout a cluster? Will all nodes be updated?
Yes, JBoss Data Grid supports distributed transactions in both library and remote client-server mode. The nodes can be enlisted as XA resources. Changes will be propagated across the cluster, but are guaranteed to be propagated to all nodes by the end of the transaction only if the replication mode is synchronous. If the replication mode is asynchronous, there are no such guarantees, although the changes will be eventually replicated.
What usage statistics are available from JDG (e.g. cache hits / misses)? Does a UI for viewing these come with the product or is it just JMX data that the client application has to process?
The data collected from monitoring the JBoss Data Grid activity is available as JMX MBeans. It can be visualized using the JBoss Operations Network, included in the product, or with any other JMX tools and clients (jconsole, VisualVM).
Eventual consistency will be available only in V7.0? What happens now if remote distributed JDG nodes cannot communicate and replicate an update? Would the client hang in that case?
Eventual consistency refers to a weaker consistency model than the strong consistency model which is currently implemented by JBoss Data Grid, maximizing availability at the expense of consistency (refer to the CAP theorem for details). With the current strong consistency model, in the case of communication failure, the behaviour of the client will largely depend on the synchronization model chosen – in the case of synchronous replication, the client may block until the client times out (but never indefinitely). In the case of asynchronous replication it will return immediately (even if replication to other nodes isn’t complete yet). Since JBoss Data Grid currently focuses on a strong concurrency model, eventual consistency will be effectively a new feature of the framework, allowing to deal with the state being inconsistent across the cluster under the control of the framework (for example, a GET operation may involve examining all copies of the data and a quorum-based decision as to which is the correct one).
If I use JBoss Data Grid in Library Mode, how do I add a node concretely?
In Library mode, each application creates its own node, embedded in the application. If multiple deployments of the application exist (for example in a cluster environment), the nodes will communicate with each other.
Is there any way to delete or clear data from the JBoss Data Grid without having to restart it?
Sure, there is. Data can be removed explicitly through remove() or clear() operations in the application, and their equivalents in the command-line interface (CLI) available in JBoss Data Grid 6.1. Eviction and expiration strategies can be set up as well, for removing unused or stale data from the cache.
Is there a strategy for distributing data in the grid and somehow coordinating parallel work on that distributed data?
JBoss Data Grid will provide a distributed execution framework in JBoss Data Grid 6.1, which will provide a flexible framework for coordinating parallel work. Building on top of that, at a higher level, map/reduce capabilities will be available as well.
When you add additional data grid nodes, do you use distributed locking across nodes and, if so, what impact does that have to the performace of the data grid as more nodes are added?
JBoss Data Grid uses lazy remote locking by default, which reduces traffic. But generally speaking, in a data grid scenario, only a subset of the data is available on each node (and there is a limited number of replicas across the cluster), so the addition of new nodes does not necessarily result in an increase of the number of locks. It all depends on how many nodes the data is replicated. The impact of distributed locking can be further reduced if the grouping API is used, to group keys commonly updated together within a single transaction.
But the demo performance data of 20% improvement is based on an in-memory cache without remote calls compared to remote database calls. It’s not a fair comparison. What about a remote cache compared to a remote database?
We will try to produce more data to illustrate other scenarios. It should be noted, however, that JBoss Data Grid is an in-memory data grid by design. Also, while remote calls add to the overhead, the biggest cost in the type of scenario we envisioned comes from the more expensive IO operations elicited by the database access, as well as the reduced concurrency when locking data in the database.
Does the JBoss Data Grid API have a way to ask for all of the keys (across the cluster) — sort of a master index?
In JBoss Data Grid 6.1, this will be possible through a map-reduce operation.
I saw data being accessed using get/put. Do you use SQL or NoSQL to access data?
JBoss Data Grid uses a key-value pair model so access to data is similar to a map. However, other methods of searching for data, such as index-based querying and map-reduce will be available in JBoss Data Grid 6.1.
Does the data grid allow me to specify how many copies of a piece of data are created and distributed? If I have a 20 node cluster, can I request that it be replicated to 2 of the 20 for example?
Yes, the replication and distribution strategies are configurable.