Home | Download | Purchase | knowledge

 
 


a nested transaction created by the Assign Inventory Item

Transactions are a necessary part of nearly all business interactions. Transactions exist to ensure that all parts of a particular business operation are properly recorded; if any single part fails, then the transaction as a whole should fail in order to maintain data consistency. The advent of service-oriented architectures has added a layer of complexity to transaction management due to the nature of Web-based services; services are often asynchronous, stateless, distributed, and opaque. In order for businesses to gain full value from a service-oriented approach, service developers must understand the mechanics of transaction management, including resource recruitment, business function coordination, concurrency control, and recovery.

This two-part article discusses the nature of, and issues surrounding, the implementation of transactions in complex service-oriented architectures. Here in Part One, I describe transaction management via a transaction coordination service (TCS), which must be capable of organizing and controlling complex business operations including resource recruitment, business function management, concurrency control, and failure recovery. Part Two will present a candidate architecture for a Transaction Control Service.

Automation and business operations

A hundred years ago business operations were very labor-intensive. Rooms filled with bustling, busy clerks were required to track each business interaction and maintain an organized and consistent accounting system for orders, deliveries, and payments. They used manual techniques, such as double-book accounting, to ensure that errors were detected and resolved before they could materially affect the business. A business "transaction" was a full accounting, starting with the taking of an order to the final payment from the customer for the product or service.

Today, software automation has permitted a nearly unattended way to conduct business operations. Web orders are taken, charges assessed, and product delivered without the intervention of a single human worker. Consequently, it has become faster and cheaper for businesses to conduct operations, while maintaining a very high level of accounting accuracy. It is also true that transactions have become much more complex, and may involve multiple companies in the final delivery to the client. Consider, for example, a bundled telecommunications service where a customer purchases satellite channel access, broadband Internet, cellular service, and Internet game subscriptions from a telecommunications vendor. The vendor in turn has agreements with other vendors to provide each service and share in part of the eventual payments. Without some form of automated monitoring, it would be quite difficult (and expensive) to manage such a complex business interaction. Thus, business operations may now involve many independent software systems, interacting in a distributed, asynchronous manner.

The introduction of service-oriented architectures (SOA) has taken this interoperability to an even higher level of complexity. In a SOA-based system, services are offered from one system to another (often from multiple competing companies) in a loosely-coupled, platform-independent manner. These services can provide any level of business function, from order management, to billing, to inventory, to fulfillment. By collecting a set of services together, it is possible to build an arbitrarily complex business operation, all seamlessly interacting across a distributed network. The tricky part comes when information from one operation is required by another: for instance, an order number is needed for inventory assignment, which in turn generates a record that is used by a billing system to compute the monthly billing statement. Ensuring data consistency in this environment requires some from of transaction management by a transaction coordination service (TCS). A TCS must be capable of organizing and controlling complex business operations including resource recruitment, business function management, concurrency control, and failure recovery.

Coordinating distributed services

Working with distributed services requires a strong understanding of the business functions to be automated. The price for the flexibility of a service-oriented architecture, where services can be swapped in or out rapidly, is a premium on organizing and controlling business activities. Each service may only provide a small part of the overall business transaction, with results of one operation feeding into the processing of another (such as order fulfillment and billing). To be successful in creating a complex, interdependent series of services, the Services Coordinator (i.e., the System Architect) must define each service interface, the work products created and consumed by those services, and the rules for handling exceptional conditions.

The first step is to define the business function, which is a defined part of the business process that is supported by information automation. A specification for declaring the business flows and participants using XML has been recently published by a consortium of business partners (including IBM, BEA Systems, Microsoft, SAP AG, and Siebel Systems). This specification is called the Business Process Execution Language for Web Services (BPEL4WS)1. This specification permits the definition of business functions in a platform independent manner, similar to the declaration of Web services by WSDL (Web Services Description Language). Of course, the client system that implements the BPEL definition must parse and interpret the specification's XML tags, but this approach is much more flexible than hard-coding the service descriptions and sequence information. Business transactions are described by another set of standards -- WS-Transaction -- where business functions are tied together (WS-Coordination) to form either independent units of work (WS-AtomicTransaction) or a collection of long-running operations (WS-BusinessActivity)2.

Business function coordination

In SOA terms, a business function is defined as the collection of distributed services that are performed to produce a defined work product. For example, lookupOrder (see Figure 1) would be a business function that returns a defined data structure detailing a customer's order information. A more complex example would be finalizeOrder (see Figure 2), where a series of operations are performed to fulfill the order and update billing information. Each activity in a business workflow is defined by the invoked service(s), the input parameter lists, and the output data.

 

Figure 1. Lookup order service

 

Figure 2. Finalize order service

Activity definition (services and messages)

Most services are defined by a WSDL description, as shown in Figure 3, which includes a description of the supported message types and service binding information.3 A service is defined by six major elements: the data types, input/output messages, message passing portTypes, binding protocols, binding address port, and the named service. The message data types and structures may conform to an existing business data transfer definition, such as a variety of ebXML implementations,4 or it may be defined by each service provider. The binding protocol is often based on SOAP (Simple Object Access Protocol) or HTTP (Hypertext Transport Protocol), but a service-oriented architecture does not require that a single protocol will be used by all services, just that the protocols will be defined. Connection port and service information is also typically defined, although if a UDDI (Universal Description and Discovery and Integration) registry is used, then the activity definition may include information used to identify a valid service from the registry.

 

Figure 3. WSDL service model

Functional organization (work groups)

Once the business function work units and services are defined, the next step is to organize these activities into meaningful work groups. A business function work group is a collection of business functions that have an overall unifying purpose.5 For example an OrderManagementGroup, as shown in Figure 4, would be responsible for business activities involved in searching, retrieving, creating, modifying, and canceling customer orders. Each of these activities may be performed by an ordered collection of services (such that an order creation business function first checks to see that the order doesn't already exist), involving multiple passed messages. Organization of activities also provides the starting point for transaction demarcation (discussed below), where particular business operations are performed in the context of a transaction. Work groups may be further connected into chained-operations, where either a serialized operation (such as order creation, processing, and fulfillment), or parallel operations (such as inventory assignment, bill generation, and account updating) are performed.

 

Figure 4. Service groups (service partition)

Coordination points (transaction demarcation)

Once the business activities are determined, and the set of services is defined and grouped, the next step is to indicate transaction demarcation points. Not all business operations require transaction management, only those containing multiple operations needing to be coordinated in such a way that all operations must complete successfully. In the example shown in Figure 5, the LookupCustomer and CreateCustomerRecord are not wrapped in a transaction, but the FinalizeCustomerOrder enters into a Create Order Transaction demarcated set of services. In this example, the Transaction Management Service (TCS) is used by the FinalizeCustomerOrder service to control the sequential operations of four other services, and supplies the messages necessary for each operation to the TCS (shown as attachments to the FinalizeCustomerRequest).

The transaction coordination diagram is used to show the constituents of a defined transaction, the order of processing (sequential vs. parallel), and nested transactions. In the example in Figure 5, the GenerateShippingRequest is a nested transaction created by the Assign Inventory Item service. Nested transactions may occur for many reasons, for example if a service is intended to be used independently or as part of a larger transaction context. Note that the same TCS can be used by nested transactions, ensuring that if the nested transaction fails, the parent transaction will also fail.

At the conclusion of the transaction process, the TCS polls all the transaction participants to determine if the transaction should be committed. In a two-phase commit protocol, the TCS first polls the participants to see if they are ready, and then issues the commit to each in turn. If a transaction participant fails to prepare, then all of the other participants are issued a roll-back command. In the case where participating services are not transaction-aware (e.g., stateless), then the TCS will invoke a compensating operation (i.e., "cancel").

 

Figure 5. Submit order transaction demarcation

Processing control (serialized/parallel)

As noted above, transaction control may be serial, parallel, or a combination of both. Transactions defined as serialized must occur in a particular order, and are typically needed when the transaction context information from completion of one operation is a required input to another (e.g., when there is an order number noted on an inventory assignment or billing record reference). Parallel operations, on the other hand, are independent of one another and so can be executed simultaneously. For example, a travel itinerary consisting of air travel, hotel stay, and car rental may be executed in parallel, since the results of one operation is not necessary for the completion of another, as shown in Figure 6. Combined processing may occur when one part of a transaction is serialized, while other parts are parallel. An example of this type of transaction would be an order for bundled products, where third-party vendors may be independently provisioned.

 

Figure 6. Parallel transaction processing

Processing context (map-in, map-out)

Finally, the TCS requires transaction context information mappings to be assigned prior to the start of the transaction. This means that the Service Coordinator must understand each service's required information, and ensure that the necessary transaction context is provided with each service call. In the sequential example shown earlier in Figure 5, the CreateOrder service generates an OrderID that is part of the required interface on the other two services. It is up to the Service Coordinator to provide the map-in and map-out information to allow for this context information to be added to the subsequent service calls. As shown in Figure 7, a map-out would show the OrderID produced by the CreateOrder service mapped to the input map-in of the Billing and Inventory operations. This mapping allows the TCS to propagate context information to all elements of the transaction.

 

Figure 7. Transaction context mapping

Implementing a TCS

A service-oriented architecture consists of a set of services, and each service requires certain system resources to perform its job. A reservation service may need access to scheduling information, while a shipping service may need to call said reservation service to arrange for a particular inventory item to be sent to a customer. Services can declare their capabilities, submit this information to a registry, and provide for secure communications. The TCS can utilize this information during a transaction by locating the correct set of services, discovering their capabilities, and propagating context information to each service in turn.

Resource discovery and registration (UDDI)

There are two ways the TCS can learn about which services participate in a transaction: either programmatically, by the Services Coordinator specifying the service details in the registration message to the TCS, or by looking them up in a registry. The UDDI (Universal Description Discovery and Integration) registry specification is designed to allow service implementers to register services along with integration information (such as security, transactional awareness, recovery, etc.).

Resource capability declaration (policy)

Services involved in transactions are required to declare their capabilities to the TCS.6 In particular, a service must declare itself capable of handling transactions, or to provide a compensating service if it cannot (e.g., CreateOrder service must have a CancelOrder service). Moreover, the capability declaration notes security policy so that multiple services can participate in a secure transaction communication.

Security (authentication/authorization)

Services are responsible for implementing security. The security policy7 defines the assertions used to create a secure access channel between the TCS and the service. The policy defines the secure protocol and credential passing required for secure interoperability between the TCS and the service.

Concurrency control in a distributed environment

In addition to other resource management required by services, the ability to handle concurrency is critical to correct transaction management. Each service that participates in a TCS mediated transaction must be able to ensure that changes performed during a transaction are not overwritten by other transactions.8 For example, a transaction may be started that updates customers orders with newly purchased products. While that order is processing, another transaction is started to cancel some of the changes. If these transactions collide, then products that were expected to be cancelled will be provided instead and others that were to be provided might be cancelled -- not at all what the customer wants!

Therefore, services must implement some form of data locking/release strategy when they are participating in a transaction. This is very similar to the operations used by relational databases to maintain consistency across multiple concurrent transactions. These strategies involve checking the time-stamp of each operation and determining the correct order for scheduling, as well as when/how records are locked (e.g., optimistic vs. pessimistic locking). Finally, when resource dead-locks occur (where one service holds a lock that another requires, and vice versa), the service will need to implement some form of dead-lock resolution.9

Recovery and retry

A last requirement for the TCS is to implement retry and recovery for when transactions fail. There are two kinds of transaction failure to consider: the first is when all of the services are transaction-capable, and the second is when one or more non-transaction-capable services are involved. In the first case, the TCS can use the two-phase commit protocol10 to manage the transaction steps. In the second, the service must have a compensating service to allow for the cancellation of the first action. A typical example of the first case is where all of the services access standard relational databases (which implement two-phase commit), or are defined with the OpenGroup X/Open transaction semantics.11 An example of the second (unfortunately far more common) is where an operation is committed as soon as the service completes processing -- such as for a hotel reservation using the hotel's Web-service interface.

The TCS may also implement retry semantics. In this case, the TCS stores a long-running transaction to a durable storage device, and attempts to complete the transaction at a latter time. For example, if a transaction is established where a travel itinerary is purchased, the air reservation portion may complete prior to the hotel, car, golf, dinner, cruise, etc. portions. The TCS may elect to retry any or all of the remaining transaction elements prior to failing the initial air reservation. This would be an example of a "guaranteed" transaction, where the TCS will attempt to complete the transaction by resubmitting the failed elements a set number of times.

Challenges to implementing transactions in service-oriented architectures

There are a number of challenges unique to implementing transaction management in a service-oriented architecture. Chief among these is the nature of services themselves: services are loosely-coupled, so they tend to be stateless, asynchronous, distributed, and opaque. Stateless services are unaware of transactional state; therefore they cannot be requested to "roll-back" a set of changes if a transaction fails. If a service is implemented as a Web-service, the current protocols take advantage of the asynchronous nature of the Internet; this means that a service may not respond in a timely manner to a request. For parallel operations this is not a factor, but consider when a transaction is serial and information from one service call may be needed by another.

Services implemented as Web-services are accessible from any location, so they are by definition distributed; this affects transaction management by introducing concerns regarding latency, reliability, and security.

Finally, services are only known by a defined interface to the TCS; there are no details on the internals of the service's processing. This leads to a "black-box" usage model, where a service may utilize other services without the client's knowledge, thereby propagating changes with secondary effects. For transactions, this could mean that a service that is part of a transaction calls another service that is also part of the transaction. This could lead to significant concurrency problems as the first call may change data needed by the second, leading to a very hard problem to trace.

Given all of these challenges, creating a reliable TCS is a difficult undertaking for anyone implementing a service-oriented architecture. A TCS must be capable of managing an arbitrarily complex transaction, with nesting, concurrency, security, scheduling, and all of the other issues discussed in this article. So what is a poor service architect to do? Part Two of this article will address these issues by presenting a candidate architecture for a Transaction Control Service.