Gigaspaces XAP training

Vaibhav Hajela
39 min readMay 22, 2022

Day 1

Agenda

  • Initial introduction
  • Scope of Training
  • Pre-test
  • Kickoff

Lets understand need of Gigaspaces XAP?

Why ?

  • Challenges
    - Working with DB
    - Latency
    - Resilience
    - Messaging
    - Cross Platform Interaction

What?

Cross language platform

  • Java
  • C# .NET
  • C++

High Availability and Fault tolerance

Async Persistence

Multisite replication

Data Grid Architecture

The GigaSpaces data grid is built from the following sub-systems:

Open Interfacing Layer

Supports any language, any platform, any API — Achieve interoperability, easy migration, reduced learning curves, and faster time to market by leveraging existing assets — such as code and programming expertise — through:

  • Standard API Support: Memcached, SQL, JPA, Spring, REST, a standard Map API and more.
  • Multi-language Interoperability: Java, .NET, and C++
  • Multi-platform Support: Any OS, physical or virtual
  • API Mashup: Easily leverage modern APIs alongside existing standard APIs — enables you to use the right tool for the job at hand.

OpenSpaces

OpenSpaces is the GigaSpaces native programming API. It is an open-source Spring-based application interface designed to make Space-based development easy, reliable, and scalable. In addition, the programming model is non-intrusive, based on a simple POJO programming model and a clean integration point with other development frameworks.

The OpenSpaces API is divided into four parts:

  • Core API
  • Messaging and Events
  • Space-Based Remoting
  • Integrations

Core API

The core package of OpenSpaces provides APIs for direct access to a data grid, internally referred to as a “Space.” The main interface is the GigaSpace, which enables the basic interaction with the data grid. The core components include basic infrastructure support such as Space Java version | .NET version construction, simplified API using the GigaSpaces interface including Transaction Management Java version | .NET version and declarative transaction support. Core components also include support for Map/Cache construction and a simplified API using GigaMap.

Events

The events package is built on top of the core package, and provides simple object-based event processing components through the event containers, making it roughly equivalent to Java EE’s message-driven beans. (The primary differences outside of API semantics are in the power of OpenSpaces’ selection criteria and routing.) The event package enables simple construction of event-driven applications.

Another alternative for events is the usage of JMS 1.1 on top of GigaSpaces , which is supported within the product and is recommended for client applications integrating with SBA applications.

The events module includes components for simplified EDA/Service Bus development. These components allow unified event-handling and provide two mechanisms for event-generation: a Polling Container Java version | .NET version uses polling received operations against the Space, and a Notify Container Java version | .NET version which uses the Space’s built-in notification support.

Space Based Remoting

The Remoting Java version | .NET version package provides capabilities for clients to access remote services. Remoting in GigaSpaces is implemented on top of the data grid’s clustering model, which provides location transparency, fail-over, and performance to remote service invocations. GigaSpaces implements remoting , using the SDpace as the transport layer, similar to the Spring remoting components.

Remoting can be viewed as the alternative to Java EE Session Beans, or Java RMI, as it provides all of their capabilities as well as supporting synchronous and asynchronous invocations, and dynamic scripting languages — enabling you to use Groovy or Ruby in your space-based applications.

Elastic Application Container

GigaSpaces is an end-to-end-scalable execution environment with “elastic deployment” to meet extreme throughput requirements.

  • Linear scalability: Elastically deployed/provisioned to cope with extreme demand/throughput during application runtime with no human intervention
  • Flexibility: Running a variety of application module types, from simple web modules to complex event processing modules
  • Simpler to move into production:
  • Smooth, risk-free deployment through identical development and production environments
  • Faster deployment through eliminating silos
  • Continuous deployment with no downtime

Lightweight application containers provide a business logic execution environment at the node level. They also translate SBA semantics and services to the relevant container development framework implementation. For example, space transactions are translated to Spring transactions, when a Spring lightweight container is used.

The Grid Service Container (GSC) is responsible for providing Grid capabilities, whereas the lightweight container implementation is responsible at the single VM level. These architectures are very powerful, as it enables applications to take advantage of the familiar programming models and services at the single VM level, and in addition provides grid capabilities and services.

GigaSpaces provides several default implementations as part of the product, and an additional plugin API, to enable other technology integrations.

More information on the usage of the above integrations can be found in the Developer Guide Java version | .NET version.

Unified In-Memory Services

Data access, messaging, parallel processing services, speeding up your application performance.

  • In-memory speed: Delivering unmatched performance by removing all physical I/O bottlenecks from the runtime flow
  • Scalability: Intelligently distribute any data and messaging load across all available resources
  • Capacity: Support terabytes of application data
  • High Availability: Built-in hot backup and self-healing capabilities for zero downtime
  • Consistency: Maintain data integrity with 100% transactional data handling

As an application platform, GigaSpaces provides integrated, memory-based runtime capabilities. The core of these capabilities is backed by the Space technology.

The core middleware capabilities are:

In-Memory Data Grid

An In-Memory Data Grid (IMDG) is a way of storing data across a grid of memory nodes. This service provides the application with:

  1. Data storage capabilities.
  2. Data query capabilities — single object, multiple object and aggregated complex queries.
  3. Caching semantics — the ability to retrieve information from within-memory data structures.
  4. Ability to execute business logic within the data — similar to database storage procedure capabilities.

It is important to note that the data grid, although a memory-based service, is fully transactional, and follows the ACID (Atomicity, Concurrency, Isolation and Durability) transactional rules.

The data grid uses the unified clustering layer, to provide a highly available and reliable service.

The main API to access the data grid service is the GigaSpace interface. In addition, one can use the Map API (using the GigaMap interface) to access the data grid. Please refer to the Developer Guide for usage examples.

Messaging Grid

The messaging grid aspect of the Space provides messaging capabilities such as:

  1. Event-Driven capabilities — the ability to build event-driven processing applications. This model enables fast (in-memory-based) asynchronous modular processing, resulting in a very efficient and scalable processing paradigm.
  2. Asynchronous production and consumption of information.
  3. One-to-one, Many-to-One, One-to-Many and Many-to-Many relationships.
  4. FIFO ordering. Java version | .NET version
  5. Transaction Management Java version | .NET version

The core APIs used for messaging are the OpenSpaces Notify and Polling Containers. In addition, a JMS 1.1 implementation API is available to be used with existing JMS based applications. More information can be found in the Messaging and Events section.

Processing Services

Processing services include parallel processing capabilities.

Parallel Processing

Sometimes the scalability bottleneck is within the processing capabilities. This means that there is a need to gain more processing power to be executed concurrently. In other words, there is a need for parallel processing. When there is no state involved, it is common to spawn many processes on multiple machines, and to assign a different task to each process.

However, the problem becomes much more complex when the tasks for execution require sharing of information. GigaSpaces has built-in services for parallel processing. The master/worker pattern is used, where one process serves as the master and writes objects into the space, and many worker services each take work for execution and share the results. The workers then request a new piece of work, and so on. This pattern is important in practice, since it automatically balances the load.

Compute Grid

The compute grid is a mechanism that allows you to run user code on all/some nodes of the grid, so that the code can run locally with the data.

Compute grids are an efficient solution when a computation requires a large data set to be processed, so that moving the code to where the data is, is much more efficient than moving the data to where the code is.

The efficiency derives from the fact that the processing task is sent to all the desired grid nodes concurrently. A partial result is calculated using the data on that particular node, and then sent back to the client, where all the partial results are reduced to a final result.

The process is widely known as map/reduce, and is used extensively by companies like Google whenever a large data set needs to be processed in a short amount of time.

Web Container

GigaSpaces can host your Java web modules so your application is entirely managed and scaled on a single platform, providing load balancing and extreme throughput, and ensuring end to end scalability.

GigaSpaces allows you to deploy web applications (packaged as WAR files) onto the GSC. This support provides:

  • Dynamic allocation of several instances of a web application (probably fronted by a load balancer).
  • Management of the instances running (if a GSC fails, the web application instances running on it will be instantiated on a different GSC).
  • SLA-monitor-based dynamic allocation and de-allocation of web application instances.

The deployed WAR is a pure Java EE-based web application. The application can be the most generic web application, and automatically make use of the Service Grid features. The web application can define a Space (either embedded or remote) very easily (either using Spring or not).

Service Grid Layer

The basic unit of deployment in the data grid is the Processing Unit.

Once packaged, a processing unit is deployed onto the GigaSpaces runtime environment, which is called the Service Grid. It is responsible for materializing the processing unit’s configuration, provisioning its instances to the runtime infrastructure and making sure they continue to run properly over time.

When developing your processing unit, you can run and debug the processing unit within your IDE. You will typically deploy it to the GigaSpaces runtime environment when it’s ready for production or when you want to run it in the real-life runtime environment

Architecture

The service grid is composed of a number of components:

Core Components

A processing unit can be deployed to the Service Grid using one of GigaSpaces deployment tools (UI, CLI, API), which uploads it to the GSM Grid Service Manager, the component which manages the deployment and life cycle of the processing unit). The GSM analyzes the deployment descriptor and determines how many instances of the processing unit should be created, and which containers should run them. It then ships the processing unit code to the running GSC’s Grid Service Container and instructs them to instantiate the processing unit instances. The GSC provides an isolated runtime for the processing unit instance, and exposes its state to the GSM for monitoring. This phase in the deployment process is called provisioning.

Once provisioned, the GSM continuously monitors the processing unit instances to determine if they’re functioning properly or not. When a certain instance fails, the GSM identifies that and re-provisions the failed instance on to another GSC, thus enforcing the processing unit’s SLA.

In order to discover one another in the network, the GSCs and GSMs use a Lookup Service, also called LUS. Each GSM and GSC registers itself in the LUS, and monitors the LUS to discover other GSM and GSC instances.

Finally, the GSA Grid Service Agent component is used to start and manage the other components of the Service Grid (i.e. GSC, GSM, LUS). Typically, the GSA is started with the hosting machine’s startup. Using the agent, you can bootstrap the entire cluster very easily, and start and stop additional GSCs, GSMs and lookup services at will.

All of the above components are fully manageable from the GigaSpaces management interfaces such as the UI, CLI and Admin API.

Grid Service Manager (GSM)

The Grid Service Manager is the component which manages the deployment and life cycle of the processing unit.

When a processing unit is uploaded to the GSM (using one of GigaSpaces deployment tools: UI, CLI, API), the GSM analyzes the deployment descriptor and determines how many instances of the processing unit should be created, and which containers should host them. It then ships the processing unit code to the relevant containers and instructs them to instantiate the processing unit instances. This phase in the deployment process is called provisioning.

Once provisioned, the GSM continuously monitors the processing unit instances to determine if they’re functioning properly or not. When a certain instance fails, the GSM identifies that and re-provisions the failed instance on to another GSC, thus enforcing the processing unit’s SLA.

It is common to start two GSM instances in each Service Grid for high-availability reasons: At any given point in time, each deployed processing unit is managed by a one GSM instance, and the other GSM(s) serve as its hot standby. If the active GSM fails for some reason, one of the standbys automatically takes over and start managing and monitoring the processing units that the failed GSM managed.

Grid Service Container (GSC)

The Grid Service Container provides an isolated runtime for one (or more) processing unit instance, and exposes its state to the GSM.

The GSC can be perceived as a node on the grid, which is controlled by The Grid Service Manager. The GSM provides commands of deployment and un-deployment of the Processing Unit instances into the GSC. The GSC reports its status to the GSM.

The GSC can host multiple processing unit instances simultaneously. The processing unit instances are isolated from each other using separate Class loaders (in java) or AppDomains (in .NET).

It is common to start several GSCs on the same physical machine, depending on the machine CPU and memory resources. The deployment of multiple GSCs on a single or multiple machines creates a virtual Service Grid. The fact is that GSCs are providing a layer of abstraction on top of the physical layer of machines. This concept enables deployment of clusters on various deployment typologies of enterprise data centers and public clouds.

The Lookup Service (LUS)

The Lookup Service provides a mechanism for services to discover each other. Each service can query the Lookup service for other services, and register itself in the Lookup Service so other services may find it. For example, the GSM queries the LUS to find active GSCs.

Note that the Lookup service is primarily used for establishing the initial connection — once service X discovers service Y via the Lookup Service, it usually creates a direct connection to it without further involvement of the Lookup Service.

Service registrations in the LUS are lease-based, and each service periodically renews its lease. That way, if a service hangs or disconnects from the LUS, its registration will be cancelled when the lease expires.

The Lookup Service can be configured for either a multicast or unicast environment (default is multicast).

Another important attribute in that context is the lookup group. The lookup group is a logical grouping of all the components that belong to the same runtime cluster. Using lookup groups, you can run multiple deployments on the same physical infrastructure, without them interfering with one another. For more details please refer to Lookup Service Configuration.

It is common to start at least two LUS instances in each Service Grid for high-availability reasons. Note that the lookup service can run in the same process with a GSM, or in standalone mode using its own process.

The following services use the LUS:

  • GigaSpaces Manager
  • GigaSpaces Agent
  • Processing Unit Instances (actual instances of a deployed Processing Unit)
  • Space Instances (actual instances of a Space that form a topology)

For advanced information on the lookup service architecture, refer to The Lookup Service.

Grid Service Agent (GSA)

The Grid Service Agent (GSA) is a process manager that can spawn and manage Service Grid processes (Operating System level processes) such as The Grid Service Manager, The Grid Service Container, and The Lookup Service. Typically, the GSA is started with the hosting machine’s startup. Using the agent, you can bootstrap the entire cluster very easily, and start and stop additional GSCs, GSMs and lookup services at will.

Usually, a single GSA is run per machine. If you’re setting up multiple Service Grids separated by Lookup Groups or Locators, you’ll probably start a GSA per machine per group.

The GSA exposes the ability to start, restart, and kill a process either using the Administration API or the GigaSpaces Management Center.

Process Management

The GSA manages Operating System processes. There are two types of process management, local and global.

Local processes simply start the process type (for example, a Grid Service Container without taking into account any other process types running by different GSAs.

Global processes take into account the number of process types Grid Service Manager for example) that are currently running by other GSAs (within the same lookup groups or lookup locators). It will automatically try and run at least X number of processes across all the different GSAs (with a maximum of 1 process type per GSA). If a GSA running a process type that is managed globally fails, another GSA will identify the failure and start it in order to maintain at least X number of global process types.

Optional Components

  • The Apache Load Balancer Agent is used when deploying web applications.
  • The Transaction Manager (TXM) is an optional component. When executing transactions that spans multiple space partitions you should use the Jini Transaction Manager or the Distributed Transaction Manager. See the Transaction Management section for details.

In-Memory Data Grid Layer

Architecture Perspectives

This section describes the In-Memory Data Grid (IMDG) architecture, to provide a comprehensive understanding of its functionality, behavior, and accessibility.

  • GigaSpaces Component Perspective — explains the key capabilities of the data grid, namely the Open Spaces framework, the space-based core middleware and the middleware facilities it provides, and the SLA-Driven Container.
  • Runtime Perspective — explains how the GigaSpaces components execute and interact in runtime on multiple physical machines.
  • SOA/EDA Perspective — explains how the data grid and the Space-Based Architecture are actually a special case of SOA/EDA, and can be used to implement a Service Oriented Architecture that supports high-performance, stateful services.
  • Remote Client Perspective — explains how the data grid is viewed and accessed by remote clients, whether they are running inside Processing Units or as independent POJO services.

For a general explanation of the data grid architecture, refer to Data Grid Architecture.

Data Grid Component Perspective

The following diagram shows a component view of GigaSpaces. The main components are described in more detail below.

OpenSpaces

OpenSpaces is the primary framework for developing applications in GigaSpaces. OpenSpaces uses Spring as a POJO-driven development infrastructure, and adds runtime and development components for developing POJO-driven EDA/SOA-based applications, and scaling them out simply across a pool of machines, without dependency on a J2EE container.

To achieve these goals, OpenSpaces adds the following components to the Spring development environment:

  • Processing Unit — the core unit of work. Encapsulates the middleware together with the business logic in a single unit of scaling and failover.
  • SLA-Driven Container — a lightweight container that enables dynamic deployment of Processing Units over a pool of machines, based on machine availability, CPU utilization, and other hardware and software criteria.
  • In-Memory Data Grid — provides in-memory distributed data storage.
  • Declarative Event Containers — for triggering events from the space into POJOs in pull or push mode.
  • Remoting — utilizes the space as the underlying transport for invoking remote methods on the POJO services inside the Processing Unit. This approach allows the client to invoke methods on a service even if it changes physical location, and enables re-routing of requests to available services in case of failover.
  • Declarative transaction support for GigaSpaces In-Memory Data Grid.

Core Middleware

The data grid relies on the JavaSpaces (space-based) model as its core middleware, and provides specialized components, implemented as wrapper facades on top of the space implementations, to deliver specific data or messaging semantics. The data grid exposes both the JavaSpaces API, with different flavors suited to the usage scenario (SQLQuery for data, FIFO for messaging, etc.), and other standard APIs such JCache/JDBC and JMS.

Middleware virtualization facilities

  • Space-Based Clustering — provides all clustering services necessary to stateful applications. Based on a clustered JavaSpaces implementation.
  • In-Memory Data Grid — provides data caching semantics on top of the GigaSpaces core middleware; addresses the key issues of distributed state sharing. Supports a wide set of APIs including JDBC for SQL/IMDB, hash table through Map/JCache interface, and JavaSpaces. All common caching topologies are supported, including replication and partitioning of data. The table below summarizes the key features of this component.
  • FeatureBenefitExtended and Standard Query based on SQL, and ability to connect to the data grid using a standard JDBC connector.Makes the data grid accessible to standard reporting tools, and makes accessing the data grid just like accessing a JDBC-compatible database, reducing the learning curve.SQL-based continuous query support.Brings relevant data close to the local memory of the relevant application instance.GigaSpaces Management Center — central management, monitoring and control of all data grid instances on the network.
  • Allows the entire data grid to be controlled and viewed from an administrator’s console.
  • The GigaSpaces Management Center has been deprecated and will be removed in a future release.
  • Ops Manager — Browser-based dashboard for monitoring and troubleshooting your GigaSpaces environmentEnables viewing and administering GigaSpaces deployments across cloud, on-premise and hybrid platforms.Mirror Service — transparent persistence of data from the entire data grid to a legacy database or other data source.Allows seamless integration with existing reporting and back-office systems.Real-time event notification — application instances can selectively subscribe to specific events.Provides capabilities usually offered by messaging systems, including slow-consumer support, FIFO, batching, pub/sub, content-based routing.
  • Messaging Grid — enables services to communicate and share information across the distributed data grid. Supports a variety of messaging scenarios using the JavaSpaces or JMS API.
  • Parallel Processing — enables parallel execution of low latency, high-throughput business transactions, using the Master-Worker pattern.

SLA-Driven Container

SLA-Driven Containers (known also as Grid Service Containers or GSCs), enable deployment of Processing Units over a dynamic pool of machines, based on SLA definitions. Each container is a Java process which provides a hosting environment for application services bundled in a Processing Unit. The container virtualizes the underlying compute resources, and performs mapping between the application runtime and the underlying resources, based on SLA criteria such as CPU and memory usage, hardware configuration and software resource availability (JVM, DB, etc). It also provides self-healing capabilities for handling failure or scaling events.

SLA-Driven In-Memory Data Grid

Data grid instances are constructed out of Space instances. The data grid can be deployed just like any other service, within a Processing Unit. This provides the option to associate SLA definitions with the data grid . A common use is to relocate data grid instances based on memory utilization; another use is to use SLA definitions to handle deployment of different data grid topologies over the same containers.

The SLA-Driven Containers can also add self-healing characteristics. When one container crashes, the failed data grid instances are automatically relocated to the available containers, providing the application with continuous high-availability.

Runtime Perspective

From a runtime perspective, a data grid cluster is a cluster of machines, each running one or more instances of SLA-Driven Containers. The containers are responsible for exposing the hardware resources to the data grid applications.

An application running with the data grid is built of multiple Processing Units. The Processing Units are packaged as part of a bundle; bundle structure is compliant with Spring/OSGI. Each bundle contains a deployment descriptor named pu.xml, a Spring application context file with OpenSpaces component extensions. This file contains the Processing Unit's SLA definition, as well as associations between the application components, namely the POJO services, the space middleware components, and most commonly, a data grid.

The application is deployed through a GSM (Grid Service Manager) which performs match-making between the SLA definitions of the application’s Processing Unit and the available SLA-Driven Containers. The SLA definitions include the number of instances that need to be deployed, the number of instances that should run per container and across the entire network, and system requirements such as the required JVM version or database version.

Different applications may have one or more instances of their Processing Units running in the same container at the same time. Even though the applications share the same JVM instance, they are kept isolated through application-specific classloaders.

SOA/EDA Perspective

The Space-Based Architecture (SBA) can be viewed as a special case of SOA/EDA, designed specifically for high-performance stateful applications.

A classic SOA is based on the Enterprise Service Bus (ESB) model, as shown in the diagram above. In this model, services become loosely-coupled through the use of a messaging bus. Scaling is done by adding more services into the bus and load-balancing the requests between them.

Most implementations of this model rely on web services to handle message flow between the services. These implementations cannot handle state-fullness of the services. So while the loosely-coupled concept of SOA can be promising for simplifying and scaling of services over the network, most existing implementations of this model are not suited for handling high-performance applications, especially not in the context of stateful services.

With SBA, a similar model can be implemented using the space. The space functions as an in-memory messaging bus — an ESB for delivering and routing transactions — but also as an In-Memory Data Grid which can support stateful services.

To avoid the I/O overhead associated with the interaction of these services with either the messaging layer or the data layer, SBA introduces the concept of a Processing Unit, which is essentially a deployment/runtime optimization. Instead of having each component of the architecture separate and remote, we bundle together the relevant message queue, its associated services, and the data into a single unit: the Processing Unit, which always runs in a single VM. In this case, the interaction between the messaging, the services, and the data layer is done in-process as well as in-memory, ensuring the lowest possible latency.

The services that reside within a Processing Unit are just like any other services in the web services world. Their lifecycle can be managed individually, and they can be deployed and upgraded dynamically without bringing down the entire Processing Unit (assuming they are implemented as OSGI services).

The services can also be accessed by any other services, whether they reside in the same processing unit or are remote clients. In the case of collocated services, interaction is very efficient, since it is done entirely in-memory. In the case of remote services, Processing Unit services can be accessed in various forms, including the classic remoting topology. The following section describes the different interaction and runtime options for clients interacting with SBA Processing Unit services.

Remote Client Perspective

Applications deployed in the data grid are distributed across multiple machines. In the classic tier-based approach, remote client interactions were mainly RPC-based, or in some cases message-driven. RPC-based communication assumes direct reference to a remote server. This approach doesn’t work in GigaSpaces-based applications, because they span multiple physical machines and change their location during runtime based on the SLA. This lead to a requirement that client interaction with GigaSpaces applications can be done through a virtualized remote reference, which can keep track of different application instances during runtime, and route client requests to the appropriate instance based on the request type, its content, and so on.

Modes of Interaction

GigaSpaces-based applications enable several modes of communication between the client application and the actual server instances, all relying on the space to enable virtualization of the interaction.

Data-Driven Interaction

Data-driven interaction is common in analytics scenarios. It means that the client application interacts primarily with the application data, by performing queries and updates. The business logic is triggered as a result of this interaction, by means of notifications (the equivalent of database applications) or extended queries (the equivalent of stored procedure).

This mode of interaction can be achieved in two ways:

  • By interacting directly using the space interface. In this case, a write operation on the space is the equivalent of an update/put or insert; a read operation is the equivalent of a select; and a notify operation is the equivalent of a trigger.
  • Using the wrapper facades provided by GigaSpaces, such as the JCache/Map interface or JDBC driver, which perform this mapping implicitly.

Message-Driven Interaction

Message-driven interaction is common in transaction processing scenarios and is based on the Command Pattern (also known as the Master-Worker Pattern). In this pattern, applications interact by sending command messages; services on the server side await these messages and are triggered by their arrival. Ordinarily, the business logic of the services is to interact with the IMDG to retrieve current state information, to reference the data, and finally to synchronize state to enable workflow with other services.

In GigaSpaces, this mode of interaction can be achieved in two ways:

  • By interacting directly with the space interface. In this case, a write operation is the equivalent of a send and a notify or take operation is the equivalent of a receive or subscribe, respectively.
  • Using the JMS interface, which is provided as a wrapper on top of the space API, and maps between the JMS operation and the required space operations.

RPC-Based Interaction

RPC (Remote Procedure Call) is used to invoke business logic method on a remote server. It is different from the message driven interaction in two respects:

  • Synchronous by nature — based on a request-response interaction.
  • Type-safe — ensures type safety of the operation and arguments at compile time, because it is directly bounded with the remote service interface.

This mode of interaction is achieved by space-based remoting. This method leverages the fact that the space is already exposed as a remote entity, and has an existing virtualization mechanism, to enable remote invocation on services that are spread across multiple Processing Units, possibly running on multiple physical machines.

With space-based remoting, a remote stub in generated for the remote service, using dynamic proxies. When a method is invoked on this proxy, the stub implicitly maps it to a command that is written to the space and is routed to the appropriate server instance. On the server-side, a generic delegator takes these commands and execute the method on the specific bean instance, based on the method name and arguments provided in the command. The result is also returned through the space, is received by the dynamic proxy, and is returned transparently to the client as the return value of the method.

Runtime Components of Gigaspaces XAP are:

  • Grid Service Manager (GSM) The Grid Service Manager is the component which manages the deployment and life cycle of the processing unit.
  • Grid Service Container (GSC) The Grid Service Container provides an isolated runtime for one (or more) processing unit instance and exposes its state to the GSM.
  • The Lookup Service (LUS) The Lookup Service provides a mechanism for services to discover each other. Each service can query the lookup service for other services, and register itself in the lookup service so other services may find it. For example, the GSM queries the LUS to find active GSCs.
  • Grid Service Agent (GSA) The GSA is a process manager that can spawn and manage service grid processes (Operating System level processes) such as the Grid Service Manager, The Grid Service Container, and The Lookup Service. Using the agent, you can bootstrap the entire data grid very easily, and start and stop additional GSCs, GSMs and lookup services at will. Usually, a single GSA is run per machine.

Installation

Download the latest XAP release

Installation

  • Unzip the distribution into a working directory; GS_HOME
  • Set the JAVA_HOME environment variable to point to the JDK root directory
  • Start your favorite Java IDE
  • Create a new project
  • Include all jar files from the GS_HOME/lib/required in the classpath

Deploying a data grid

To start the service grid locally with a single container, run the bin/gs.(sh|bat) as follows:

  • gs.(sh|bat) host run-agent — auto — containers=1

To deploy a data grid called myDataGrid, run:

  • gs.(sh|bat) space deploy myDataGrid

Data Partitioning

The space can have a single instance that runs on a single JVM, or multiple instances that run on multiple JVMs. When there are multiple instances, the spaces can be set up in one of several topologies. This architecture determines how the data is distributed across the JVMs.

Available topologies:

  • Replicated — data is copied to all of the JVMs in the cluster.
  • Partitioned — data is distributed across all of the JVMs, each containing a different data subset. A partition is a subset of data that is distributed by a routing key.
  • Partitioned with backup — data resides in a partition, and also in one or more backup space instances for this partition.

With a partitioned topology, data or operations on data are routed to one of several space instances (partitions). Each space instance holds a subset of the data, with no overlap. Business logic can be collocated within the partition to allow for fast parallel processing.

Installation local setup

Running Gigaspaces locally

Overview — Cluster Managers

A data grid requires a cluster manager. The following cluster managers are available:

  • Standalone
  • Service Grid
  • Kubernetes
  • ElasticGrid

The “Service Grid” is recommended for beginners, which is what we’ll show here. If you’re using the open source package, you’ll need to use the “Standalone” cluster manager, which is discussed later in this page.

Tip: The cluster manager includes a web-based UI which is started at http://localhost:8090

Deploying a data grid

To start the service grid locally with a single container, run the bin/gs.(sh|bat) as follows:

  • gs.(sh|bat) host run-agent — auto — containers=1

To deploy a data grid called myDataGrid, run:

  • gs.(sh|bat) space deploy myDataGrid

Now that we have a remote data-grid, we can connect to it.

Using maven:

mvn compile && mvn exec:java -Dexec.mainClass=HelloWorld -Dexec.args="-name myDataGrid -mode remote"

Stopping the data grid

To terminate the local service grid agent, run:

  • gs.(sh|bat) host kill-agent

Deploying a data-grid with 2 partitions (optional)

Use the same commands, but specify 2 containers (1 per instance), and add the --partitions parameter:

  • gs.(sh|bat) host run-agent — auto — containers=2
  • gs.(sh|bat) space deploy myDataGrid — partitions=2

Deploying a data-grid with 2 highly-available partitions (optional)

Use the same commands, but specify 4 containers (1 per instance), and add the --ha parameter:

  • gs.(sh|bat) host run-agent — auto — containers=4
  • gs.(sh|bat) space deploy myDataGrid — partitions=2 — ha

Starting standalone data grid instance/s (without Service Grid)

Without the Service Grid, you will need to run the following commands using bin/gs.(sh|bat)

Single data grid instance

  • gs.(sh|bat) space run — lus myDataGrid

Data grid with 2 partitions

  • gs.(sh|bat) space run — lus — partitions=2 myDataGrid

Manually load each instance separately

Each partition instance loads separately, as follows:

  1. Specify --partitions=2 for two partitions
  2. Specify --instances=1_1 or --instances=2_1 for each partition instance

From the ${GS_HOME}/bin directory, run (in 2 separate terminals):

  • gs.(sh|bat) space run — lus — partitions=2 — instances=1_1 myDataGrid
  • gs.(sh|bat) space run — lus — partitions=2 — instances=2_1 myDataGrid

This will simulate a data-grid of 2 partitioned instances (without backups).

Data grid with 2 highly available partitions (with backups for each partition)

  • gs.(sh|bat) space run — lus — partitions=2 — ha myDataGrid

Manually load each instance separately

Each partition instance can be assigned a backup, as follows:

  1. Specify --partitions=2 for two partitions, --ha for high availability meaning a single backup for each partition.
  2. Specify --instances=1_1 to load primary of partition id=1, --instances=1_2 to load the backup instance of partition id=1

First partition:

  • gs.(sh|bat) space run — lus — partitions=2 — ha — instances=1_1 myDataGrid
  • gs.(sh|bat) space run — lus — partitions=2 — ha — instances=1_2 myDataGrid

Second partition:

  • gs.(sh|bat) space run — lus — partitions=2 — ha — instances=2_1 myDataGrid
  • gs.(sh|bat) space run — lus — partitions=2 — ha — instances=2_2 myDataGrid

Connecting Gigaspace

Deploying First App

Space Vs Space Instance Vs PU

In the GigaSpace lingo, a data grid is called a Space, and a data grid node is called a Space Instance. The space is hosted within a Processing Unit (PU), which is the GigaSpaces unit of deployment.

Distributed Processing

Event Driven Processing

Mirroring and Persistence :

Deploying JPA app

Service Level Agreement (SLA)

The GigaSpaces runtime environment (also referred to as the Service Grid) provides SLA-driven capabilities via the Grid Service Manager (GSM) and the Grid Service Container (GSC) runtime components. The GSC is responsible for running one or more Processing Units. The GSM is responsible for analyzing the deployment and provisioning the processing unit instances to the available GSCs.

The SLA definitions are only enforced when deploying the Processing Unit to the Service Grid, because this environment actively manages and controls the deployment using the GSM(s). When running within your IDE or in standalone mode, these definitions are ignored.

The SLA definitions can be provided as part of the Processing Unit package, or during the Processing Unit’s deployment process. These definitions determine the number of Processing Unit instances that should run, and deploy-time requirements such as the amount of free memory or CPU, or the clustering topology for Processing Units that contain a Space. The GSM reads the SLA definitions, and deploys the Processing Unit to the available GSCs accordingly.

Defining the SLA for Your Processing Unit

The SLA contains the deployment criteria in terms of clustering topology (if it contains a Space) and deployment-time requirements. It can be defined in multiple ways:

  • Include an sla.xml file that contains the definitions within the Processing Unit's JAR file. This file can be located at the root of the Processing Unit JAR or under the META-INF/spring directory, alongside the Processing Unit's XML file. This is the recommended method.
  • Embed the SLA definitions within the Processing Unit’s pu.xml file.
  • Provide a separate XML file with the SLA definitions to the GSM at deployment time via one of the deployment tools.
  • Use the deployment tools themselves to provide/override the Processing Unit’s SLA (see below). For example, the GUI deployment dialog enables you to type in various SLA definitions, such as the number of instances, number of backups, and Space topology.

The SLA definition, whether it comes in a separate file or embedded inside the pu.xml file, is composed of a single <os-sla:sla> XML element. A sample SLA definition is shown below (you can use plain Spring definitions, or GigaSpaces-specific namespace bindings):

  • Namespace
  • Plain
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:os-sla="http://www.openspaces.org/schema/sla"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.openspaces.org/schema/sla http://www.openspaces.org/schema/16.1.1/sla/openspaces-sla.xsd">
<os-sla:sla cluster-schema="partitioned" number-of-instances="2" number-of-backups="1" max-instances-per-vm="1">
...
</os-sla:sla>
</beans>

The SLA definition shown above creates four instances of a Processing Unit using the partitioned space topology. It defines two partitions (number-of-instances="2"), each with one backup (number-of-backups="1"). In addition, it requires that a primary and a backup instance of the same partition not be provisioned to the same GSC (max-instances-per-vm="1").

It is up to the developer to configure the SLA correctly. Trying to deploy a Processing Unit with a cluster schema that requires backups without specifying numberOfBackups will cause the deployment to fail.

In older releases, the SLA definition also included dynamic runtime policies, such as creating additional Processing Unit instances based on CPU load, relocating a certain instance when the memory becomes saturated, etc. These capabilities are still supported, but have been deprecated in favor of the Administration and Monitoring API which supports the above and much more.

Defining the Space Cluster Topology

When a Processing Unit contains a Space, the SLA definition specifies the clustering topology for that Space. The cluster-schema XML element defines this, and is an attribute of the os-sla:sla XML element, as shown above.

The Space’s clustering topology is defined within the SLA definition rather than the Space definition with the pu.xml file for the following reasons:

  • The clustering topology directly affects the number of instances of the Processing Unit, and is therefore considered part of its SLA. For example, a partitioned Space can have two primaries and two backups, and a replicated Space can have five instances. This means that the Processing Unit containing them will have the same number of instances.
  • Separating the clustering topology from the actual Space definition enables truly implementing the “write once, scale anywhere” principal. You can run the same Processing Unit within your IDE for unit tests using the default single-Space-instance clustering topology, and then deploy it to the runtime environment and run the same Processing Uunit with the real clustering topology (for example, with 4 partitions).

You can choose from numerous clustering topologies:

  • default: Single Space instance, no replication or partitioning
  • sync-replicated: Multiple Space instances. When written to one of the Space instances, objects are synchronously replicated to all Space instances. The maximum capacity of this topology is that of the smallest JVM in the cluster.
  • async-replicated Multiple space instances. When written to one of the Space instances, objects are asynchronously replicated to all Space instances. The maximum capacity of this topology is that of the smallest JVM in the cluster.
  • partitioned Multiple Space instances. Objects are distributed across all of the Space instances, so that each instance contains a separate subset of the data and forms a separate partition. The partitioning (distribution) of objects is based on their routing property. Optionally, when using this topology, you can designate one or more backup instances to each of the partitions, so that when an object is written to a certain partition, it is synchronously replicated to the backup copy(ies) of that partition. The maximum capacity of this topology is the overall capacity of all of the JVMs in the cluster, divided by the number of backups+1.

From the client application’s perspective (the one that connects to the Space from another process), the clustering topology is transparent in most cases.

The number-of-backups parameter should be used with the partitioned cluster schema. It is not supported with the sync-replicated or async-replicated cluster schema.

SLA-Based Distribution and Provisioning

When deployed to the service grid, the Processing Unit instances are distributed based on the SLA definitions. These definitions form a set of constraints and requirements that should be met when a Processing Unit instance is provisioned to a specific container (GSC). The SLA definitions are considered during the initial deployment, relocation, and re-provisioning of an instance after failure.

Default SLA Definitions

If no SLA definition is provided either within the Processing Unit XML configuration or during deploy time, the following default SLA is used:

  • Namespace
  • Plain
<os-sla:sla number-of-instances="1" />

Maximum Instances per VM/Machine

The SLA definition allows you to define the maximum number of instances for a certain Processing Unit, either per JVM (GSC) or physical machine (regardless of the number of JVMs/GSCs that are running on it).

The max-instances parameter has different semantics when applied to Processing Units that contain a Space with primary-backup semantics (that uses the partitioned cluster schema and defines at least one backup), and when applied to a Processing Unit that doesn”t contain an embedded Space, or that contains a Space with no primary-backup semantics.

When applied to a Processing Unit that doesn’t contain an embedded Space, or that contains a Space with no primary-backup semantics, the max-instances parameter defines the total number of instances that can be deployed on a single JVM or on a single machine.

When applied to a Processing Unit that contains a Space with primary-backup semantics, the max-instances parameter defines the total number of instances (which belong to the same primary-backup group or partition) that can be provisioned to a single JVM or a single machine.

The most common use of the max-instances parameter is for a Processing Unit that contains a Space with primary-backup semantics. Setting the value to 1 ensures that a primary and its backup(s) cannot be provisioned to the same JVM (GSC)/physical machine.

max-instances-per-vm
The max-instances-per-vm defines the maximum number of instances per partition for a Processing Unit with an embedded Space. A partition may have a primary or backup instance. max-instances-per-vm=1 means you won't have a primary and a backup of the same partition provisioned to the same GSC. (It is allowed to have multiple partitions with primary or backup instances provisioned to the same GSC.) You cannot limit the amount of instances from different partitions that a GSC may host.

If you have enough GSCs (the number of partitions x2), you will have a single instance per GSC. If you don’t have enough GSCs, the primary and backup instances of the different partitions will be distributed across all the existing GSCs, and the system tries to distribute the primary instances in an even manner across all the GSCs on all the machines. If you increase the amount of GSCs after the initial deployment, you must “rebalance” the system, meaning distribute all the primaries across all the GSCs.

You can perform this activity via API. Rebalancing the instances will increase the capacity of the data grid (because it will result in more GSCs hosting the data grid). The rebalance concept is based on the assumption that there are initially more partitions than GSCs. The ratio between partitions to GSCs is called the scaling factor; meaning how much your data grid can expand itself without shutdown, and without increasing the amount of partitions.

Example: Start with 4 GSCs and 20 partitions with backups (40 instances). Each GSC will initially host 10 instances, and can end up with 40 GSCs (after one or more rebalancing operations) where each GSC hosts a single instance. In this example, the capacity of the data grid is increased to 10 times larger without downtime, while the amount of partitions remains the same. See the capacity planning section for more details.

The following is an example of setting the max-instances-per-vm parameter:

  • Namespace
  • Plain
<os-sla:sla max-instances-per-vm="1" />

The following is an example of setting the maximum instances per machine parameter:

  • Namespace
  • Plain
<os-sla:sla max-instances-per-machine="1" />

Total Maximum Instances per VM

This refers to the total amount of Processing Unit instances that can be instantiated within a GSC. It should not be confused with the max-instances-per-vm parameter, which controls how many instances a partition may have within a GSC. The total maximum instances per VM does not control the total amount of instances from different Processing Units, or other partitions, that can be provisioned into a GSC. To control the total maximum instances per VM, use the com.gigaspaces.grid.gsc.serviceLimit system property and set its value before running the GSC:

set GS_GSC_OPTIONS=-Dcom.gigaspaces.grid.gsc.serviceLimit=2

The default value of the com.gigaspaces.grid.gsc.serviceLimit is 1.

Monitoring the Liveness of Processing Unit Instances

The GSM monitors the liveness of all the Processing Unit instances it provisioned to the GSCs. The GSM pings each instance in the cluster to see whether it is available.

You can control how often a Processing Unit instance is monitored by the GSM, and in case of failure, how many times the GSM will try again to ping the instance and how long it will wait between retry attempts.

This is done using the <os-sla:member-alive-indicator> element, which contains the following attributes:

AttributeDescriptionDefaultinvocation-delayHow often (in milliseconds) an instance is monitored and verified to be alive by the GSM5000 (5 seconds)retry-countAfter an instance has been determined to not be alive, how many times to check it before giving up on it3retry-timeoutAfter an instance has been determined to not be alive, the timeout interval between retries (in milliseconds)500 (0.5 seconds)

When a Pprocessing Unit instance is determined to be unavailable, the GSM that manages its deployment tries to re-provision the instance on another GSC (according to the SLA definitions).

  • NameSpace
  • Plain
<os-sla:sla>
<os-sla:member-alive-indicator invocation-delay="5000" retry-count="3" retry-timeout="500" />
</os-sla:sla>

Troubleshooting the Liveness Detection Mechanism

For troubleshooting purposes, you can lower the logging threshold of the relevant log category by modifying the log configuration file located under $GS_HOME/config/log/xap_logging.properties on the GSM. The default definition is as follows:

org.openspaces.pu.container.servicegrid.PUFaultDetectionHandler.level = INFO You can change it to one of the below thresholds for more information:

LevelDescriptionCONFIGLogs the configurations appliedFINELogs when a member is determined to be not aliveFINERLogs when a member is indicated to be not alive (on each retry)FINESTevery fault detection attempt

Level FINE is generally sufficient for service-failure troubleshooting.

Failover Considerations

If you are using Kubernetes, all service failures are handled by Kubernetes operations.

GigaSpaces applications provide continuous high-availability even when the infrastructure processes or entire (physical/virtual) machines fail. This capability is provided out of the box but does require some attention to configuration to meet the needs of your specific application.

N+1 and N+2 Configurations

When determining the optimal high-availability configuration for your particular application, you have to balance the cost of additional hardware (or virtual machines) against the risk of downtime. In most cases, it pays to have additional resources available to avoid downtime, which can compromise system health and result in instability, poor reliability, no durability, and incompleteness.

The two most common GigaSpaces deployment configurations are referred to as N+1 and N+2. This refers to the number of machines running GigaSpaces that can fail without compromising the data grid and its applications, in order to continue delivering reasonable performance and to remain in good health. In an N+1 configuration, the core N machines have sufficient RAM and CPU power to run the application if one of the N+1 machines fail. In an N+2 configuration, the same is true if two of the machines fail or become unavailable.

In either configuration, the data grid (or any deployed business logic) is distributed across all available machines. Each machine hosts a set of services and there are at least two GSMs and two LUSs running. When deploying a regular (static) PU, you may have spare services available on each machine to accommodate a failure. If a machine becomes unavailable, the backup PU instance corresponding to the primary nodes on that machine becomes the primary, and the GSM provisions a new backup instance in one of the spare services. In this case, you may have to call the rebalance utility to redistribute the primary and backups evenly across all services. This failover is transparent to clients of the application, and to the business logic running within it — see Hosts, Zones & Machine Utilization.

Grid Failure Handling Strategy

When deploying your GigaSpaces-based system in production, you should consider the following failure handling strategies to determine which is the worst case scenario that should be taken into account for your specific environment. Your final failover plan should address, GSC, Machine/VM, and complete Data Center failure.

Single Service Failure

This is the simplest scenario to plan for, and is generally sufficent for small deployments that are not mission critical, and do not require continuous high-availability to survive multiple failures. The GSA will manage the service life-cycle locally or globally and accomadate service failure.

Multiple Service Failures

Consider using Kubernetes services to perform the orchestration, and to deploy new services when needed. See Rescaling Your Application in GigaSpaces.

To accommodate a failure scenario that involves continuous operation if more than one service fails at a time, the gs agents will restart the failing services.

If you require horizontal scaling, you can repartition by starting a new service (see On Demand Scale Out/In).

If you require up/down vertical scaling, proceed as follows:

  • Restart the desktop service after changing the configuration.
  • Let the service recover from the primary partition.
  • Demote the primary.
  • Restart the instance that previously was the primary (which will now be the backup).

Complete VM/Machine Failure

In a scenario where the failure of an entire machine could be damaging to your system, you must allocate enough room in another machine that will allow it to host the services that are not available. This can be done in advance, or by monitoring or using the admin api to detect this situation, and starting the services only when required. Note that the service will be empty until it requires some system resources.

The Cloudify platform can be used to automate GigaSpaces installation and configuration of the machine upon recovery.

Complete Data Center Failure

See Multi-Region Replication for Disaster Recovery

The Cloudify platform can be used to automate GigaSpaces installation and configuration of the data center upon recovery.

Guaranteed Notifications

When using notifications (notify container, session messaging API), should enable Guaranteed Notifications to address a primary Space failure while sending notifications. This allows the backup (when promoted to a primary), to continue notification delivery. Guaranteed Notifications are managed on the client side.

When considering a notify container that is embedded with the Space, note that guaranteed notifications are not supported in this scenario. This means that upon failure, notifications that have been send might not be fully processed. In this case, blocking read/asychronous read should be considered as an alternative to notifications.

Balanced Primary-Backup Provisioning

To plan for failure of the data grid nodes or a machine running GigaSpaces, you should consider having even distribution of the primary and backup instances across all existing machines running GigaSpaces. This will ensure balanced CPU utilization across all GigaSpaces machines.

For this purpose, swap the primary and backup partitions, or see Deterministic Deployment.

CPU Utilization

CPU utilization should not exceed 40% on average, to support complete machine/VM failure. This buffer will enable the machines running GigaSpaces to cope with the additional capacity required when one or more machines running the grid fail or go through maintenance.

Local GigaSpaces Component Failure

LUS

To enable failover on LUS (Lookup Service) failure, at least two LUSs should be started for redundancy. You can use global or local LUS configuration to ensure that two LUS will be running.

When using a multicast discovery configuration, it is recommended to use the global LUS configuration to ensure that two LUS will run on two different machines, even if a machine running a LUS fails. When using a unicast lookup discovery configuration and a LUS fails, the clients and service grid components may throw exceptions because internally they are frequently trying to perform lookup discovery for the missing lookup.

You can configure the lookup discovery intervals using the com.gigaspaces.unicast.interval property.

GSA

The GSA acts as a process manager, watching the GSC, LUS and GSM processes. GSA failure is a very rare scenario. If it happens, you should look for unusual hardware, operating system, or JDK failure that may have caused it. To address GSA failure you should run it as a service so that the operating system will restart it automatically if it fails (see how you can run it as a Windows Service or a Linux Service).

GSM

The GSM is responsible for deployment and provisioning of deployed PUs(stateless, statefull). The GSM is utilized during deploymnt, PU failure, and PU migration from one service to another. To support high availability, you should have two GSMs started per grid. You can use either global or local GSM configuration to ensure that two GSM will be running. In most cases, global GSM configuration is recommended unless you require hot deploy functionality.

GigaSpaces Distributed Transactions

GigaSpaces Distributed Transactions involve a remote or local GigaSpaces Distributed Transaction Manager and one or more data grid. With a remote Distributed Transaction Manager, you should consider running at least two remote transaction managers. It is recommended to use a local Distributed Transaction Manager as it makes the overall architecture simpler.

XA Transactions

XA Transactions involves the XA transaction manager, GigaSpaces data grid node(s) and some additional participant(s). The transaction manager is usually deployed independently. The transaction manager can fail, so it should be deployed in a high availability configuration. Client code should support transaction manager failure by caching relevant transaction exceptions, and retrying the activity by aborting the old transaction, starting a new transaction, executing relevant operations and committing. Atomikos and JBoss transaction managers are GigaSpaces certified and recommended.

WAN Gateway Failure

The WAN gateway acts as a broker, and is responsible for replicating activities conducted in the local data grid in another (remote) data grid. The WAN Gateway does not hold state, so its failure does not result any data loss. However if it fails, data is not replicated between the source and destination data grid. The WAN Gateway does not have to be deployed in a cluster configuration (aka primary-backup). By default, GigaSpaces will try to start the WAN Gateway if it fails. The WAN Gateway is usually configured to use a specific port on specific machine(s), therefore you should configure the WAN Gateway PU to be provisioned on specific machine(s).

Mirror Service Failure

The Mirror Service is like the WAN Gateway, acting as a broker. It is responsible for persisting activities conducted in the data grid into external data sources, such as a database.

The Mirror Service does not hold state, so its failure does not result in data loss. However, Mirror Service failure means data will not get stored in the external data source. The mirror does not have to be deployed in a cluster configuration (aka primary-backup). By default, GigaSpaces will try to start the Mirror Service if it fails. In many cases the Mirror Service accesses a database, which may be set to accept connections only from specific machines with specific ports. To address this, configure the database to allow connections from all machines that may run the Mirror Service, which is all GigaSpaces machines by default.

Zookeeper deployment will be performed only if the majority of the Zookeeper partitions are functioning — see Consistency Biased.

Multisite Replication (WAN)

Most Common Prod issues

--

--