Transactions are a necessary part of nearly all business interactions.
Transactions exist to ensure that all parts of a particular business operation
are properly recorded; if any single part fails, then the transaction as a whole
should fail in order to maintain data consistency. The advent of
service-oriented architectures has added a layer of complexity to transaction
management due to the nature of Web-based services; services are often
asynchronous, stateless, distributed, and opaque. In order for businesses to
gain full value from a service-oriented approach, service developers must
understand the mechanics of transaction management, including resource
recruitment, business function coordination, concurrency control, and recovery.
This two-part article discusses the nature of, and issues surrounding, the
implementation of transactions in complex service-oriented architectures. Here
in Part One, I describe transaction management via a transaction coordination
service (TCS), which must be capable of organizing and controlling complex
business operations including resource recruitment, business function
management, concurrency control, and failure recovery. Part Two will present a
candidate architecture for a Transaction Control Service.
Automation and business
operations
A hundred years ago business operations were very labor-intensive. Rooms
filled with bustling, busy clerks were required to track each business
interaction and maintain an organized and consistent accounting system for
orders, deliveries, and payments. They used manual techniques, such as
double-book accounting, to ensure that errors were detected and resolved before
they could materially affect the business. A business "transaction" was a full
accounting, starting with the taking of an order to the final payment from the
customer for the product or service.
Today, software automation has permitted a nearly unattended way to conduct
business operations. Web orders are taken, charges assessed, and product
delivered without the intervention of a single human worker. Consequently, it
has become faster and cheaper for businesses to conduct operations, while
maintaining a very high level of accounting accuracy. It is also true that
transactions have become much more complex, and may involve multiple companies
in the final delivery to the client. Consider, for example, a bundled
telecommunications service where a customer purchases satellite channel access,
broadband Internet, cellular service, and Internet game subscriptions from a
telecommunications vendor. The vendor in turn has agreements with other vendors
to provide each service and share in part of the eventual payments. Without some
form of automated monitoring, it would be quite difficult (and expensive) to
manage such a complex business interaction. Thus, business operations may now
involve many independent software systems, interacting in a distributed,
asynchronous manner.
The introduction of service-oriented architectures (SOA) has taken this
interoperability to an even higher level of complexity. In a SOA-based system,
services are offered from one system to another (often from multiple competing
companies) in a loosely-coupled, platform-independent manner. These services can
provide any level of business function, from order management, to billing, to
inventory, to fulfillment. By collecting a set of services together, it is
possible to build an arbitrarily complex business operation, all seamlessly
interacting across a distributed network. The tricky part comes when information
from one operation is required by another: for instance, an order number is
needed for inventory assignment, which in turn generates a record that is used
by a billing system to compute the monthly billing statement. Ensuring data
consistency in this environment requires some from of transaction management by
a transaction coordination service (TCS). A TCS must be capable of
organizing and controlling complex business operations including resource
recruitment, business function management, concurrency control, and failure
recovery.
Coordinating distributed
services
Working with distributed services requires a strong understanding of the
business functions to be automated. The price for the flexibility of a
service-oriented architecture, where services can be swapped in or out rapidly,
is a premium on organizing and controlling business activities. Each service may
only provide a small part of the overall business transaction, with results of
one operation feeding into the processing of another (such as order fulfillment
and billing). To be successful in creating a complex, interdependent series of
services, the Services Coordinator (i.e., the System Architect) must
define each service interface, the work products created and consumed by those
services, and the rules for handling exceptional conditions.
The first step is to define the business function, which is a defined part of
the business process that is supported by information automation. A
specification for declaring the business flows and participants using XML has
been recently published by a consortium of business partners (including IBM, BEA
Systems, Microsoft, SAP AG, and Siebel Systems). This specification is called
the Business Process Execution Language for Web Services (BPEL4WS)1. This specification permits the definition of
business functions in a platform independent manner, similar to the declaration
of Web services by WSDL (Web Services Description Language). Of course, the
client system that implements the BPEL definition must parse and interpret the
specification's XML tags, but this approach is much more flexible than
hard-coding the service descriptions and sequence information. Business
transactions are described by another set of standards -- WS-Transaction --
where business functions are tied together (WS-Coordination) to form either
independent units of work (WS-AtomicTransaction) or a collection of long-running
operations (WS-BusinessActivity)2.
Business function
coordination
In SOA terms, a business function is defined as the collection of distributed
services that are performed to produce a defined work product. For example,
lookupOrder (see Figure 1) would be a business function that returns a defined
data structure detailing a customer's order information. A more complex example
would be finalizeOrder (see Figure 2), where a series of operations are
performed to fulfill the order and update billing information. Each activity in
a business workflow is defined by the invoked service(s), the input parameter
lists, and the output data.
Figure 1. Lookup order service
Figure 2. Finalize order service
Activity definition (services and messages)
Most services are defined by a WSDL description, as shown in Figure 3, which
includes a description of the supported message types and service binding
information.3 A service is defined by six major
elements: the data types, input/output messages, message passing
portTypes, binding protocols, binding address port, and the named
service. The message data types and structures may conform to an existing
business data transfer definition, such as a variety of ebXML implementations,4 or it may be defined by each service provider.
The binding protocol is often based on SOAP (Simple Object Access Protocol) or
HTTP (Hypertext Transport Protocol), but a service-oriented architecture does
not require that a single protocol will be used by all services, just that the
protocols will be defined. Connection port and service information is also
typically defined, although if a UDDI (Universal Description and Discovery and
Integration) registry is used, then the activity definition may include
information used to identify a valid service from the registry.
Figure 3. WSDL service model
Functional organization (work groups)
Once the business function work units and services are defined, the next step
is to organize these activities into meaningful work groups. A business
function work group is a collection of business functions that have an overall
unifying purpose.5 For example an
OrderManagementGroup, as shown in Figure 4, would be responsible for
business activities involved in searching, retrieving, creating, modifying, and
canceling customer orders. Each of these activities may be performed by an
ordered collection of services (such that an order creation business function
first checks to see that the order doesn't already exist), involving multiple
passed messages. Organization of activities also provides the starting point for
transaction demarcation (discussed below), where particular business operations
are performed in the context of a transaction. Work groups may be further
connected into chained-operations, where either a serialized operation
(such as order creation, processing, and fulfillment), or parallel operations
(such as inventory assignment, bill generation, and account updating) are
performed.
Figure 4. Service groups (service partition)
Coordination points (transaction demarcation)
Once the business activities are determined, and the set of services is
defined and grouped, the next step is to indicate transaction demarcation
points. Not all business operations require transaction management, only those
containing multiple operations needing to be coordinated in such a way that all
operations must complete successfully. In the example shown in Figure 5, the
LookupCustomer and CreateCustomerRecord are not wrapped in a
transaction, but the FinalizeCustomerOrder enters into a Create Order
Transaction demarcated set of services. In this example, the Transaction
Management Service (TCS) is used by the FinalizeCustomerOrder service
to control the sequential operations of four other services, and supplies the
messages necessary for each operation to the TCS (shown as attachments to the
FinalizeCustomerRequest).
The transaction coordination diagram is used to show the constituents of a
defined transaction, the order of processing (sequential vs. parallel), and
nested transactions. In the example in Figure 5, the
GenerateShippingRequest is a nested transaction created by the Assign
Inventory Item service. Nested transactions may occur for many reasons, for
example if a service is intended to be used independently or as part of a larger
transaction context. Note that the same TCS can be used by nested transactions,
ensuring that if the nested transaction fails, the parent transaction will also
fail.
At the conclusion of the transaction process, the TCS polls all the
transaction participants to determine if the transaction should be committed. In
a two-phase commit protocol, the TCS first polls the participants to see if they
are ready, and then issues the commit to each in turn. If a transaction
participant fails to prepare, then all of the other participants are issued a
roll-back command. In the case where participating services are not
transaction-aware (e.g., stateless), then the TCS will invoke a compensating
operation (i.e., "cancel").
Figure 5. Submit order transaction demarcation
Processing control (serialized/parallel)
As noted above, transaction control may be serial, parallel, or a combination
of both. Transactions defined as serialized must occur in a particular order,
and are typically needed when the transaction context information from
completion of one operation is a required input to another (e.g., when there is
an order number noted on an inventory assignment or billing record reference).
Parallel operations, on the other hand, are independent of one another and so
can be executed simultaneously. For example, a travel itinerary consisting of
air travel, hotel stay, and car rental may be executed in parallel, since the
results of one operation is not necessary for the completion of another, as
shown in Figure 6. Combined processing may occur when one part of a transaction
is serialized, while other parts are parallel. An example of this type of
transaction would be an order for bundled products, where third-party vendors
may be independently provisioned.
Figure 6. Parallel transaction processing
Processing context (map-in, map-out)
Finally, the TCS requires transaction context information mappings to be
assigned prior to the start of the transaction. This means that the Service
Coordinator must understand each service's required information, and ensure that
the necessary transaction context is provided with each service call. In the
sequential example shown earlier in Figure 5, the CreateOrder service
generates an OrderID that is part of the required interface on the other two
services. It is up to the Service Coordinator to provide the map-in and map-out
information to allow for this context information to be added to the subsequent
service calls. As shown in Figure 7, a map-out would show the OrderID produced
by the CreateOrder service mapped to the input map-in of the Billing and
Inventory operations. This mapping allows the TCS to propagate context
information to all elements of the transaction.
Figure 7. Transaction context mapping
Implementing a TCS
A service-oriented architecture consists of a set of services, and each
service requires certain system resources to perform its job. A reservation
service may need access to scheduling information, while a shipping service may
need to call said reservation service to arrange for a particular inventory item
to be sent to a customer. Services can declare their capabilities, submit this
information to a registry, and provide for secure communications. The TCS can
utilize this information during a transaction by locating the correct set of
services, discovering their capabilities, and propagating context information to
each service in turn.
Resource discovery and registration (UDDI)
There are two ways the TCS can learn about which services participate in a
transaction: either programmatically, by the Services Coordinator specifying the
service details in the registration message to the TCS, or by looking them up in
a registry. The UDDI (Universal Description Discovery and Integration) registry
specification is designed to allow service implementers to register services
along with integration information (such as security, transactional awareness,
recovery, etc.).
Resource capability declaration (policy)
Services involved in transactions are required to declare their capabilities
to the TCS.6 In particular, a service must
declare itself capable of handling transactions, or to provide a compensating
service if it cannot (e.g., CreateOrder service must have a
CancelOrder service). Moreover, the capability declaration notes security
policy so that multiple services can participate in a secure transaction
communication.
Security (authentication/authorization)
Services are responsible for implementing security. The security policy7 defines the assertions used to create a secure
access channel between the TCS and the service. The policy defines the secure
protocol and credential passing required for secure interoperability between the
TCS and the service.
Concurrency control in a distributed environment
In addition to other resource management required by services, the ability to
handle concurrency is critical to correct transaction management. Each service
that participates in a TCS mediated transaction must be able to ensure that
changes performed during a transaction are not overwritten by other
transactions.8 For example, a transaction may be
started that updates customers orders with newly purchased products. While that
order is processing, another transaction is started to cancel some of the
changes. If these transactions collide, then products that were expected to be
cancelled will be provided instead and others that were to be provided might be
cancelled -- not at all what the customer wants!
Therefore, services must implement some form of data locking/release strategy
when they are participating in a transaction. This is very similar to the
operations used by relational databases to maintain consistency across multiple
concurrent transactions. These strategies involve checking the time-stamp of
each operation and determining the correct order for scheduling, as well as
when/how records are locked (e.g., optimistic vs. pessimistic locking). Finally,
when resource dead-locks occur (where one service holds a lock that another
requires, and vice versa), the service will need to implement some form of
dead-lock resolution.9
Recovery and retry
A last requirement for the TCS is to implement retry and recovery for when
transactions fail. There are two kinds of transaction failure to consider: the
first is when all of the services are transaction-capable, and the second is
when one or more non-transaction-capable services are involved. In the first
case, the TCS can use the two-phase commit protocol10 to manage the transaction steps. In the second,
the service must have a compensating service to allow for the cancellation of
the first action. A typical example of the first case is where all of the
services access standard relational databases (which implement two-phase
commit), or are defined with the OpenGroup X/Open transaction semantics.11 An example of the second (unfortunately far more
common) is where an operation is committed as soon as the service completes
processing -- such as for a hotel reservation using the hotel's Web-service
interface.
The TCS may also implement retry semantics. In this case, the TCS stores a
long-running transaction to a durable storage device, and attempts to complete
the transaction at a latter time. For example, if a transaction is established
where a travel itinerary is purchased, the air reservation portion may complete
prior to the hotel, car, golf, dinner, cruise, etc. portions. The TCS may elect
to retry any or all of the remaining transaction elements prior to failing the
initial air reservation. This would be an example of a "guaranteed" transaction,
where the TCS will attempt to complete the transaction by resubmitting the
failed elements a set number of times.
Challenges to implementing transactions in
service-oriented architectures
There are a number of challenges unique to implementing transaction
management in a service-oriented architecture. Chief among these is the nature
of services themselves: services are loosely-coupled, so they tend to be
stateless, asynchronous, distributed, and opaque. Stateless services are unaware
of transactional state; therefore they cannot be requested to "roll-back" a set
of changes if a transaction fails. If a service is implemented as a Web-service,
the current protocols take advantage of the asynchronous nature of the Internet;
this means that a service may not respond in a timely manner to a request. For
parallel operations this is not a factor, but consider when a transaction is
serial and information from one service call may be needed by another.
Services implemented as Web-services are accessible from any location, so
they are by definition distributed; this affects transaction management by
introducing concerns regarding latency, reliability, and security.
Finally, services are only known by a defined interface to the TCS; there are
no details on the internals of the service's processing. This leads to a
"black-box" usage model, where a service may utilize other services without the
client's knowledge, thereby propagating changes with secondary effects. For
transactions, this could mean that a service that is part of a transaction calls
another service that is also part of the transaction. This could lead to
significant concurrency problems as the first call may change data needed by the
second, leading to a very hard problem to trace.
Given all of these challenges, creating a reliable TCS is a difficult
undertaking for anyone implementing a service-oriented architecture. A TCS must
be capable of managing an arbitrarily complex transaction, with nesting,
concurrency, security, scheduling, and all of the other issues discussed in this
article. So what is a poor service architect to do? Part Two of this article
will address these issues by presenting a candidate architecture for a
Transaction Control Service.