Clustered implementation with local indexes is built upon same strategy with volatile in-memory index buffer along with delayed flushing on persistent storage.
As this implementation designed for clustered environment, it has additional mechanisms for data delivery within cluster. Actual text extraction jobs are done on the same node that does content operations (for example: write operation). Prepared "documents" (Lucene term that means block of data ready for indexing) are replicated within cluster nodes and processed by local indexes. So each cluster instance has the same index content. When new node joins the cluster, it has no initial index, so it must be created. There are some supported ways of doing this operation. The simplest is to simply copy the index manually but this is not intended for use. If no initial index is found, JCR will use the automated scenarios. They are controlled via configuration (see the index-recovery-mode parameter) offering full re-indexing from database or copying from another cluster node.
To use cluster-ready strategy based on local indexes,
the following configuration must be applied when each node has its own copy of index on local file system.
Indexing directory must point to any folder on local file system and "
must be set to
<property name="index-dir" value="/mnt/nfs_drive/index/db1/ws" />
<property name="jbosscache-configuration" value="jbosscache-indexer.xml" />
<property name="jgroups-configuration" value="udp-mux.xml" />
<property name="jgroups-multiplexer-stack" value="true" />
<property name="jbosscache-cluster-name" value="JCR-cluster-indexer-ws" />
<property name="max-volatile-time" value="60" />
<property name="rdbms-reindexing" value="true" />
<property name="reindexing-page-size" value="1000" />
<property name="index-recovery-mode" value="from-coordinator" />
Common usecase for all cluster-ready applications is a hot joining and leaving of processing units. All nodes that are joining cluster for the first time or after some downtime must be in a synchronized state.
When having a deal with shared value storages, databases and indexes, cluster nodes are synchronized anytime. However it is an issue when local index strategy is used. If the new node joins cluster having no index, it will be retrieved or recreated. Node can be restarted also and thus index is not empty. Usually existing index is thought to be actual, but can be outdated.
JCR offers a mechanism called RecoveryFilters that will automatically retrieve index for the joining node on startup. This feature is a set of filters that can be defined via QueryHandler configuration:
Filter number is not limited so they can be combined:
If any one fires, the index is re-synchronized. Please take in account that DocNumberRecoveryFilter is used in cases no filter is configured. So, if resynchronization should be blocked or strictly required on start, then ConfigurationPropertyRecoveryFilter can be used.
This feature uses the standard index recovery mode defined by previously described parameter (can be "from-indexing" or "from-coordinator" (default value)).
<property name="index-recovery-mode" value="from-coordinator"
There are couple implementations of filters:
return true, for cases when index must be force
resynchronized (recovered) each time;
Return value of system property
"org.exoplatform.jcr.recoveryfilter.forcereindexing". So index
recovery can be controlled from the top without changing
documentation using system properties;
value of QueryHandler configuration property
"index-recovery-filter-forcereindexing" so the index recovery can be
controlled from configuration separately for each workspace.
<property name="index-recovery-filter-forcereindexing" value="true" />
number of documents in index on coordinator side and
self-side and return true if differs. Advantage of this filter
comparing to other is it will skip reindexing for workspaces where
index was not modified. For example, there are 10 repositories with 3
workspaces in each one. Only one is really heavily used in cluster: frontend/production. So using this filter
will only re-index
those workspaces that are really changed without affecting other
indexes thus greatly reduce the startup time.