Multivariate statistical monitoring of batch processes: an industrial ...

Review

TRENDS in Biotechnology Vol.19 No.2 February 2001

53

Multivariate statistical monitoring of batch processes: an industrial case study of fermentation supervision Sarolta Albert∗ and Robert D. Kinley This article describes the development of Multivariate Statistical Process Control (MSPC) procedures for monitoring batch processes and demonstrates its application with respect to industrial tylosin biosynthesis. Currently, the main fermentation phase is monitored using univariate statistical process control principles implemented within the G2 real-time expert system package. This development addresses integrating various process stages into a monitoring system and observing interactions among individual variables through the use of multivariate projection methods. The benefits of this approach will be discussed from an industrial perspective.

Sarolta Albert∗ and Robert D. Kinley Eli Lilly and Company Limited, Speke Operations, Fleming Road, Liverpool, UK L24 9LN e-mail: [email protected]∗, [email protected]

The biosynthetic production of secondary metabolites has always posed challenges to scientists and engineers. Reducing peculiar variations in process performance potentially results in improvement in process performance and quality1, as well as improved process understanding. The task of lowering variability was addressed via the development and application of advanced techniques, namely EXPERT SYSTEMS (see Glossary) complemented with data based modelling approaches. The Expert System approach aims to replicate the reasoning of operating staff who have traditionally been supervising the process and who, through their individual experiences and perceptions, have developed a decision-making practice that is essential for maintaining the pre-specified conditions. Such knowledge can be expressed in the form of ‘if-then rules’, which are consciously used in operation. Recently, it was shown that it is possible to reproduce a correct and complete set of the above rules through the use of a new Knowledge Acquisition Technique (KAT; Ref. 2). The development of a fermentation knowledge base is outside the scope of this article and is reported elsewhere3. Although rules are useful in detecting deviations of individual variables, the INTERACTIONS (see Glossary) between measured process variables are usually important, complex and not always fully understood. Simultaneous combined effects of variables might lead to variation in performance that might remain hidden when univariate rule-based approaches are in place. Multivariate techniques are effective in detecting such deviations, and extracting information from the process data itself, thus providing an alternative to knowledge-based approaches. Furthermore, the only requirement for their use is historical data that is often a widely available and under-utilized resource in

today’s industry. Although multivariate methods recently emerged as a leading-edge technology in today’s chemical industries, more traditional batch industries have not been able to adopt this approach because of inherent process dynamics and nonlinearities that are associated with batch processing. As there was no commercial package available for batch process applications, this research involved the development of a prototype comprehensive Batch MSPC tool that is capable of turning raw data into information that could potentially lead towards improved processes. Process description and data availability

Tylosin fermentation has been chosen as an example of a complex secondary metabolite production process. Tylosin production, as with most fermentation processes, involves various stages before and subsequent to the main fermentation, which has traditionally been the favoured area for improvements. However, deviations that influence final productivity might occur before the final stage and therefore it is clearly beneficial to focus on preceding operations of the fermentation. Tylosin production starts with mixing the raw materials that are of natural origin providing essential complex substrates. The pH of the medium is adjusted before it is transferred to the previously sterilized seed vessel. Following inoculation with the carefully prepared culture, the seed fermentation is allowed to grow sufficient microorganisms to inoculate the main fermentor vessels. The medium for the main fermentors is prepared, sterilized and inoculated in a similar manner to the seed. Many factors are believed to influence the above process, several of which are recorded off-line or on-line throughout the ~6 day-long fermentation. Productivity, the indicator of successful operation, is not available until the fermentation is terminated. Data from 144 fermentations were collected including all stages of fermentation. Most stages were represented by off-line measurements with the exception of the main fermentation, where a data historian stores several computer-logged variables throughout the batch duration, such as pH, temperature, respiratory data, pressure, agitation rate, airflow and dissolved oxygen; some of which are controlled around a setpoint. Before any

http://tibtech.trends.com 0167-7799/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01528-6

Review

54

TRENDS in Biotechnology Vol.19 No.2 February 2001

Glossary Expert Systems: A supervisory control system that makes use of expert knowledge in the form of rules to advice operators on process problems and control actions. The simple logic of associative decision rules are easily understood and accepted by humans, therefore often ‘if-then’ rules are used in the above context in the form of ‘If X is true and Y is false, conclude class 1’. Interactions: Chemical and biological systems often highly complex and present complex, non-linear and dynamic dependence structure amongst those parameters one can measure in order to observe such systems. Mathematically, we can approximate these relationships via kinetic expressions (if known) or, if no prior knowledge is available, infer relationships from data. The most commonly used measure of dependence between variables is the correlation coefficient, measuring linear interactions (collinearity). This assumption is adapted by using the covariance matrix of multivariable data when applying PCA. Artificial neural networks (ANNs): A modelling methodology attempting to construct approximations of process behaviour via integrating many processing units (neurons), interacting in a certain manner to provide a powerful means of approximation. ANNs ‘learn’ the approximation (process model) by repeated exposure to process data. SCREE test: A simple visual method to determine the optimal number of principal component representation. It involves plotting the eigenvalues against the number of principal components and deciding which slopes of lines joining the plotted points are ‘steep’ to the left and ‘not steep’ to the right of a given point. This point is suggested to be the optimal number of principal components. Central limit theorem (CLT): The CLT implies that the sum of n independently distributed random variables is approximately normally distributed, regardless of the distributions of the individual variables. Partial least squares regression (PLS): Conceptually PLS is similar to the technique of PCA, with the difference that process outputs (Y) are also projected to a reduced space simultaneously as the process data (X). As PLS is primarily a regression technique, the aim is to find a linear combination of the input and output variables that describe the maximum amount of correlation between the inputs and outputs, that is not only to explain the variation in X but that variation in X which is the most predictive of Y. G2: Recent developments of sophisticated supervisory systems, such as real time knowledge based systems (RTKBS) are software tools providing opportunity to implement process knowledge in the form of rules and/or algorithmic procedures. They communicate with control systems real time and provide advice to assist operation. The benefit from RTKBS comes from the ease with which information can be encoded. Major companies involved in fermentation have reported applications of the G2 RTKBS from Gensym, allowing information to be coded in English which greatly eases the problems of implementation and long term maintenance.

modelling, a few outliers were removed and replaced with linearly interpolated values in the case of the on-line logged variables. Interpolation is not a viable option in case of missing off-line assays and therefore missing data was ignored from the database. Noise filtering was not carried out because of the nature of MSPC models that filter small variations from the data as a consequence of Principal Component Analysis (PCA; see next section). Productivity data were available for each batch and were used as an indicator of performance when the data were subgrouped before modelling. Principles of MSPC

The principles MSPC are published widely4–6 and therefore only a brief summary will be given here. PCA involves finding the eigenvalues of the sample covariance matrix that are the variances of the principal components. For a normalized (mean centred, variance scaled) sample matrix X [n, m] with n samples and m variables, PCA will find m uncorrelated new variables, the variance of which decreases from first to last. Let the new variables be represented by ti for a particular sample i as follows: m

ti = ∑ X j × p ji j =1

http://tibtech.trends.com

(1)

The first principal component t1 is found by selecting pi so that t1 has the largest possible variance subject to the condition shown by Eqn 2: m

∑ pi2 = 1

(2)

i =1

The following form gives the sample covariance matrix:  c11 c12L c1m    c c c (3) C = cov( X ) =  21 22L 2m   : : :    cm1 cm2L cmm  In Eqn 3, cij is the covariance between variables Xi and Xj and the diagonal element cii is the variance of Xi. The variances of the individual principal components are the eigenvalues of the matrix C, and the sum of the eigenvalues is equal to the sum of the variances of the original variables. For m input variables there will be m principal components, some of which might be negligible, if the original variables were either correlated or collinear. By retaining only the first r principal components the X matrix is approximated by the following equation: r

Xˆ = ∑ ti × piT + E

(4)

i =1

In Eqn 4, E is the residual matrix, p [m, r] are the loadings and t [n, r] are the scores. The rest of this article will refer to scores as t [n, r] and loadings as p [m, r]. Ideally, dimension r is chosen such that no significant information is left in E. The transformation results in several desirable mathematical and statistical properties that are associated with the transformed data (scores), enabling the derivation of statistical confidence limits. This is a very significant benefit addressing the major shortcomings associated with univariate statistical process control, namely, the ignoring of interactions between variables and the difficult simultaneous interpretation of numerous control charts (m>20). If the original variables were correlated, a reduced number of control charts can be achieved (r