# Multiple Linear Regression (Multivariate Analysis)

It's a black box. All you know is that you have multiple process inputs (X) and at least one process output (Y) that you care about. Multivariate analysis is the method by which you analyze how Y varies with your multiple inputs (x1, x2,... xn). There a lot of ways to go about figuring out how Y relates.

One way to go is to turn that black box into a transparent box where you try to understand the fundamentals from first principles. Say you identify x1 as cell growth and you believe that your cells grow exponentially, you can try to apply an equation like Y = Y0eµx1.

But this is large-scale manufacturing. You don't have time for that. You have to supply management with an immediate solution followed by a medium-term solution. What you can do is assume that each parameter varies with Y linearly.

Just like we learned in 8th grade. How can we just say that Y relates to X linearly? Well, for one, I can say whatever I want (it's a free country). Secondly, all curves (exponential, polynomial, logarithmic, asymptotic...) are linear over small ranges... you know, like the proven acceptable range in which you ought to be controlling your manufacturing process.

Assuming everything is linear keeps things simple and happens to be rooted in manufacturing reality. What next?

Next you start adding more inputs to your equation... applying a different coefficient for each new input. And if you think that a few of your inputs may interact, you can add their interactions like this:

You achieve interactions by multiplying the inputs and giving that product its own coefficient. So now you - the big nerd - have this humongous equation that you need solving. You don't know:
• Which inputs (x's) to put in the equation
• What interactions (x1 * x2) to put in the equation
• What coefficients to put in the keep (m's)

What you're doing with multiple linear regression is picking the right inputs, interactions and so that the data you have fits that your statistical software package and brute-force the coefficients (m's) to fit an equation that gives you the least error.

Here's the thing: The fewer rows you have in your data table, the fewer inputs you get to throw into your equation. If you have 10 samples, but 92 inputs, you're going to have to be very selective with what you try in your model.

It's a tough job, but someone's got to do it. And when you finally do (i.e. explain the relationship between, say, cell culture titer and your cell culture process inputs), millions of dollars can literally roll into your company's coffers.

Your alternative is to hire Zymergi and skip that learning curve.

# Cornell ChemE Prof Makes Glycoproteins Without CHO Cell Culture

Just out yesterday:

Cornell Chemical Engineering Professor Produces Glycoproteins with E.Coli

The thrust of the story is that a Cornell professor, Matthew DeLisa, has figured out how to make glycoproteins using E.Coli. Readers of my blog know that the reason we humans use Chinese Hamster Ovary (CHO) cells to make human glycoproteins is because only mammalian cells can do post-translational glycosylation while the simpler bacteria cannot.

This is where Dr. DeLisa says, "Not anymore."

You'll have to read the paper from his Nature paper (included at the bottom) if you want to get wonkish, but described in lay terms, the Cornell research team added:

• Four enzymes from yeast cells:

1. Yeast uridine diphosphate-N-acetylglucosamine transferases Alg13
2. and Alg14
3. Mannosyltransferases Alg1
4. and Alg2
• One enzyme bacterial from from Campylobacter jejuni
oligosaccharyltransferase PglB

to E.coli cultures to get the desired glycan structures. All this, I presume, involves more than cleverness and advanced pipetting skills.

Sounds quite promising as the tech is being commercialized through a startup called Glycobia. As science is skeptical, so is this fermentation engineer (of the commercial value of this venture).

The process economics of CHO vs. E.Coli does not clearly point in the direction of bacterial cultures. If you look at two nearly identical drugs, Lucentis (made with E.coli) vs. Avastin (made with CHO), you have drastically higher cost for the E.coli product. Lucentis is the Fab region of the antibody while Avastin is the whole antibody (Fab + Fc) and the costs are \$2,000 vs \$150 (assuming the same concentration of API gets the job done). The markup has little to do with Genentech being greedy.

A quick Google search of "cho vs e coli process economics" will get you to a book by Ajit Sadana (1998) on Bioseparation of proteins. Starting on page 66, he goes through an example of Activase (tPA) made in CHO vs. E.coli.

CHO had 5 steps while E.coli had 16. CHO had a 47% yield while E.coli had a 2.8% yield... primarily due to the extra recovery steps to remove the endotoxin that E.coli creates that CHO does not. Sure, this example is 1998 technology talking about (likely small scale) purification, but I have yet to see the process economics work in favor of E.coli for biologics.

If contamination concerns (like the Genzyme Allston plant cited) are the main cost avoidance, I'm going to come out and say that this research will remain academic: bioreactor contaminations are easy to prevent when management is committed to contamination reduction.

If replacing the known quantity that is CHO with an unknown quantity that is E.coli + Bottom-Up Glycoengineering (BUG) technology is all we get (i.e. without orders of magnitude increase in culture titers or reduction in variability), then my money with CHO.

# How Manufacturing Sciences Works

The Manufacturing Sciences laboratory and data groups interact like this:

Favorable special cause signals at large-scale give us opportunities for finding significant factors and interactions that produced these special causes. With a significant correlation adjusted (for cell culture: R2> 0.65 and p < 0.05), we are able to justify expending lab resources to test our hypothesis.

Significant actionable factors from the multivariate analysis of large-scale data become the basis for a DOE. Once the experiment design is vetted, documents can be drafted and experiment prepped to test those conditions.

There are a lot of reasons we go to the lab first. Here are a few:
1. You have more N (data samples)
2. You can test beyond the licensed limits
3. You get to isolate variables
4. You get the scientific basis for changing your process.

Should your small-scale experiments confirm your hypothesis, your post-experiment memo becomes the justification for plant trials. Depending on how your organization views setpoint changes within the acceptable limits or license limits, you will run into varying degrees of justification to "fix what isn't broken." Usually, the summary of findings attached to the change order is sufficient for with-license changes to process setpoints. If your outside-of-license-limitsfindings can produce significant (20 to 50%) increase in yields (or improvements in product quality), you may have to go to the big guns (Process Sciences) to get more N and involve the nice folks in Regulatory Affairs.

From a plant trial perspective, I've seen large-scale process changes run under QA-approved planned deviation for big changes. I've seen on-the-floor production-supervision-approved changes for within acceptable range changes. I've seen managers so panicked by a potentially failing campaign that they shoot first and ask questions later (i.e. intiate the QA discrepancies, address the cGMP concerns later).

Whatever the case. The flow of hypothesis from the plant to the lab is how companies gain process knowledge and process understanding. The flow of plant trials from the lab back to the plant is how we realize continuous improvement.

Credit goes to Jesse Bergevin for inculcating this model under adverse conditions.

# Manufacturing Sciences - Local Lab

The other wing of the Manufacturing Sciences group was a lab group.

Basically, you enter the virtuous cycle thusly:
1. Design an experiment
2. Execute the experiment
3. Analyze the data for clues
4. Go to Step 1.

You're thinking, "Gosh, that looks a lot like Process Sciences (aka Process R&D)." And you'd be right. That's exactly what they do; they run experiments at small scale to figure out something about the process.

Territorial disputes are common when it comes to local Manufacturing Sciences groups having local labs. From the Process Science's perspective, you have these other groups that may be duplicating work, operating outside of your system, basically doing things out of your control. From the Manufacturing Science's perspective, you need a local resource that works on the timetable commercial campaigns to address very specific and targeted issues. People who can sit at a table to update the local plant on findings.

If your cashflow can support it, I recommend developing a local lab and here's why:

The lab counterpart of the Manufacturing Sciences group ran an experiment that definitively proved a physical bioreactor part was the true root cause of poor cell growth... this poor cell growth had delayed licensing of the 400+ million dollar plant by 10 months. The hypothesis was unpopular with the Process Science department at corporate HQ and there was much resistance to testing it. In the end, it was the local lab group that ended the political wrangling and provided the data to put the plant back on the tracks towards FDA licensure.

I do have to say that not everything is adversarial. We received quite a bit of help from Process Sciences when starting up the plant and a lot of our folks hailed from Process Sciences (after all, where do you think we got the know-how?). When new products came to our plant, we liaised with Process Science folk.

My point is: in more cases than not, a local manufacturing sciences group with laboratory capability is crucial to the process support mission.

# Manufacturing Sciences - Local Data

My second job out of college was to be the fermentation engineer at what was then the largest cell culture plant (by volume) in the United States. As it turns out, being "large" isn't the point; but this was 1999 and we didn't know that yet, we were trying to be the lowest per gram cost of bulk product; but I digress.

I was hired into a group called Manufacturing Sciences, which reported into the local technology department that reported to the plant manager. My job was to observe the large-scale cell culture process and analyze the data.

Our paramount concern was quantifying process variability and trying to reduce it. The reason, of course, is to make the process stable so that manufacturing is predictable. Should special cause variability show up, the job was to look for clues to improve volumetric productivity.

The circle of life (with respect to data) looks like this:

Data and observations come from the large-scale process. We applied statistical process control (SPC) and statistical analysis like control charts and ANOVA. And from our analysis, we are able to implement within-license changes to make the process more predictable. And should the special cause signals arise, we stood ready with more statistical analysis/methods to increase volumetric productivity.

# Variability Reduction is a core objective

Reducing process variability is a core objective for process improvement initiatives because low variability helps you identify small changes in the process.

Here's a short example to illustrate this concept. Suppose you are measuring osmolality in your buffer solution and the values for the last 10 batch are as follows:

293, 295, 299, 297, 291, 299, 298, 292, 293, 296.

Then the osmolality of the 11th batch of buffer comes back at 301 mOsm/kg. Is this 301 result "anomalous" or "significantly different"?

It's hard to tell, right? I mean, it's the first value greater than 300, so that's something. But it is only 2 mOsm/kg greater than the highest previously observed while the measurement ranges from 291 to 299, an 8 mOsm/kg difference.

Let's try another series of measurements - this time, only 7 measurements:

295, 295, 295, 295, 295, 295, 295.

Then the measurement of the eighth batch is 297 mOsm/kg. Is this result anomalous or significantly different? The answer is yes. Here's why:

The process demonstrates no variability (within measurement error) and all of the sudden, there is a measurable difference. The 297 mOsm/kg is a distance of 2 mOsm/kg from the highest measured value. But the range is 0 (with all values measuring 295). The difference is infinitely greater than the range.

There are far more rigorous data analysis methods to better quantify the statistics comparing differences that will be discussed in the future, but you can see how variability reduction helps you detect differences sooner.

Also, remember that variability (a.k.a. standard deviation) is the denominator of the capability equation:

Reducing process variability increases process capability.

To summarize: reducing process variability helps in 2 ways:

1. Deviations (or differences) in the process can be detected sooner.
2. Capability of the process (a.k.a. robustness) increases.

Hitting the aforementioned two birds with the proverbial one stone (variability reduction) is a core objective of any continuous process improvement initiative. Applying the statistical tools to quantify process variability ought to be a weapon in every process engineer's arsenal.

## Tuesday, March 13, 2012

As discussed in a previous post on cGMP, FDA issues Form 483s to companies for inspection observations. These Form 483s turn into warning letters; warning letters turn into consent decrees and your plant might get shutdown.

So the game is to nip this in the bud at at the 483 level. Well, the guys over at FDAzilla have done it again, offering all the freely posted FDA 483s in a Dropbox folder.

You just put in your email, and they'll invite you to share their Dropbox folder.

Genius.

# How Does Someone Go About Stealing My Process (PI) Data?

One savvy prospect asked that question, and it's a great question:

A lot of user requirement specifications (URS) or detailed design specs require data security.

And if keeping your process data secure was not in your company's best interest, it is a requirement to be compliant with the infamous 21 CFR Part 11- the regulation governing how the conditions under which FDA views electronic data the "same as paper."

21 CFR Part 11, Subpart B mandates that the manufacturer

And demands that the use

(k) Use of appropriate controls over systems documentation including: (1) Adequate controls over the distribution of, access to, and use of documentation for system operation and maintenance.

In response to securing data and the trade secrets embedded in your company's process data, a lot of time goes into securing the server... We:

• Create Active Directories and we map them to PI Identities.
• We have PI Trusts to authorize specific computers
• We bifurcate our networks and put PI behind fortinets and firewalls.

All this protects the server - but none of it protects your data from theft if you just back up your PI data to a network drive.

You see, if I wanted to steal your company's data, I wouldn't go anywhere NEAR your PI server. I'd go looking for your backup files.

Your backup files contains all the files needed to restore your PI system. If I get these files, I can recreate your PI system on my own box where I have admin rights and mine your data all day long.

As for that prospect... they're now a customer.

# Excdev/Compdev vs. Excdevpercent/Compdevpercent

There are two ways of configuring data compression in PI... and there seems to be a lot of confusion surrounding it:
1. As absolute value
2. As percent of span

#### Absolute Value - excdev, compdev

Most people understand setting excdev and compdev best. pH is a parameter that goes from 1 to 14. If you think increments of 0.01 are significant to your process and you think that your pH probe can measure pH that accurately, then you ought to set excdev = 0.01, compdev = 0.005 (see the whitepaper).

Excdev and compdev hold values measured in the engineering units of the parameter being measured. In the previous example, 0.01 pH units. If your tag is measuring temperature, it'd have engineering units of degC or degF or K.

Because excdev and compdev are expressed in terms that you can relate to, this is the most popular method of speciying the compression settings.

#### Percent of Span - excdevpercent, compdevpercent

The less popular method of setting compression is by specifying the compression settings as a percentage of span.

What does "Percent of span" mean?

Every PI point has a SPAN attributes. The SPAN is the largest value you expect the tag to archive. In our pH example, there is no such thing as a pH greater than 14, so people often set SPAN = 14.

When you specify compression settings with excdevpercent/compdevpercent, what you do is you set it equal to the percent of the range. For example, suppose we wanted to set the exception to 0.01 pH units. We can do that by doing this calculation:

excdevpercent = excdev / ( SPAN ) * 100

excdevpercent = 0.01 / 14 * 100

excdevpercent = 0.0714

As a caveat, this number is computed as a percent of SPAN and not as a percent of the range (SPAN - ZERO).

This seems like a lot of trouble - why bother doing this calculation when you can set the actual value? Here's why:

If you have a lot of points to configure, you don't have time to go through each one. Also, more often than not, your user requirements are specified as a percent of the range of the instrument. This is why it is sometimes faster, more efficient to configure data compression with the exc-/compdevpercent settings.

#### What happens if I specify both?

Under the hood, if you specify EXCDEV or COMPDEV, the PI Server will compute and store the values as EXCDEVPERCENT and COMPDEVPERCENT. If you specify them as percentages, they'll simply be stored as the percentages.

If you happen to specify both the -dev and -devpercent, the -devpercent will override the -dev settings.
-->