# Data Visualization - Learn By Looking

So the least rigorous form of statistical analysis is simply looking at the data. I've written about this before where you can tell quite a bit about a phenomenon just by looking at the data (no p-values, no alphas... just looking at the data)

Here was that example distribution of test scores:

When you look at the data and there are irregularities or non-smoothness, you're looking at some human intervention... some manual action that does not comport with the natural order.

Have a look at this visualization. It's apparently the average monthly premiums for insurance plans under the Affordable Care Act for each county. Dark blue are plans that cost \$250/mo. Dark red are plans that cost \$1,250/mo... so the more red, the more costly.

What's most interesting to me is that you can see the shapes of Virginia, Wyoming, South Dakota and New Jersey pretty well on this map according to the price of ACA insurance premiums. Some guy living in Montana is paying \$500/month; cross an imaginary line into Wyoming and now it's \$1,000.

When you see something like this, you can infer that a non-natural phenomenon holds the true explanation (e.g. state law). There's a step-function here and step-functions aren't found that often in nature.

Now have a look at New England: here, there's a gradient... the farther northeast you go, the more costly the insurance. Likewise in Wisconsin... the closer you get to Minnesota, the more expensive the premium. Gradual changes or smoothness is what we can expect for nature. And a lot of information can be inferred by just looking.

# Unnecessary Testing in cGMP World

In the world of soaring medical costs, we have unnecessary medical tests and procedures sapping precious healthcare dollars.

The thing about these medical tests is that they are necessary for someone under some circumstances, just not for most people under most circumstances.

So is the case in the world of GMP biologics manufacturing. There are
plenty of tests that need to happen to produce a releasable lot. There are in-process tests; there of Certificate of Analysis (CofA) tests.  There are analytical tests you for engineering runs; there are tests you perform on contaminated lots, but not others.

But regardless of what test you are performing, the litmus test for performing the test or analysis is:
Does the result of the test help me make a decision?
Consider the following snippet from a computer program:

```   if ( testResult == PASS )
{
forwardProcessBatch();
}
else
{
forwardProcessBatch();
}
```

In this case, if I run a test and pass, I get to forward process the batch. If the test fails, I still get to forward process the batch. So if in either case, I get to forward process the batch, why should I bother doing the test? The result of the test does not do anything to serve the outcome!

Another good way to approach the question of whether or not to perform a test is to see if you can write down a plan for what to do with the test result. If you can write down a reasonable plan and stick with the plan prior to getting the test results, then there's a good reason to perform the test; otherwise, you're simply on a fishing expedition and making it up as you go along.

### FIO Samples

There exists "For Information Only" samples that are specified into the process.  For example, concentrations of ammonium (NH4+), sodium (Na+), pO2 and pCO2 are measures of cell culture metabolism that are useful for long-term process understanding.  They likely never be used to make a forward-process decision, though they can be used for retrospective justification of discrepancies or as variables during multivariate data analysis.

In my experience, these routine FIO samples are contentious.  On one hand, they serve the purpose of long-term, large-scale process understanding as well as sporadic justification for discrepancies.  On the other hand, if FIO samples get used enough to close discrepancies and release lots, over time the FDA and other agencies will pressure you into making these FIO tests into in-process or lot-release tests.

### Defensibility

In the end, your actions in deciding to perform a test need to be defensible.  You need to defend the costs to do the test to management.  You need to defend not doing the test to the FDA.  And your situation may be different than the biologics manufacturer down the street.

That defense ought to rest on whether or not you can do something with the result of the test.

# Oldie, but Doozy - Scientists are not Statistician

Here's something that just came to my attention: last year (2012), Amgen reported that they attempted to confirm the published findings of 53 "landmark" studies involving new approaches to targeting cancer or alternative clinical uses. Only six (11%) of these studies could be reproduced.

Adding a second clue to this puzzle, Germany's Bayer HealthCare tried to validate published pre-clinical findings and found that only 25% were reproducible... an abysmal rate similar to the Amgen finding.

This past week, UC Davis professor of plant pathology - Pamela Ronald - issued a retraction of foundation work her lab did in 1995 and wrote of the mistakes in Scientific American. In her case, there were two errors, one of which was simply labeling:
In this way, new members of my laboratory uncovered two major errors in our previous research. First, we found that one of the bacterial strains we had relied on for key experiments was mislabeled.
Incidentally, Dr. Ronald cites the lack of reproducibility that Amgen found.

It's gotten to the point where the Economist has two articles out on it:

In the second article on unreliable research there's a segment on researchers who lack statistical knowledge, designing experiments whose results do not pass statistical muster because "scientists are not statisticians." Their conclusion comports with my experience: an epidemic of statistical dunderheads in science. Researchers are choosing N based on the number of slots in the pilot plant or based on the capacity of the lab. Scientists not understanding the risks of Type 1/Type 2 error built into their design.

The first step is to get some statistical training. After that, it's on-the-job training and getting on the phone with someone who is qualified. But the way it is and the where we are headed is simply not acceptable. Data-based decision-making is at the core of science (and for that matter, biologics manufacturing).  Scientists may not be statisticians, but perhaps they ought to be.

p.s. - It's interesting to note that it is publicly-funded academic research that cannot be confirmed by private-sector firms and not vice versa.

# Non-Essential FDA Inspections to Resume!

News reports indicate that the government shutdown is over.  "Non-essential" federal employees that were furloughed included FDA inspectors (euphemistically called, "Consumer Safety Officers").

Apparently, even the government admits that FDA inspections are... non-essential.

Regardless, I'm quite certain that the effect on the GMP operations around the world was that of pure joy. Inspection Readiness Teams could relax a bit.  Regulatory Affairs and QA managers didn't have to be on high alert (especially ones that hadn't been inspected in a while).

FDAzilla reports a significant drop in sales of 483 documents as RegA and Quality managers had no reason for last-minute research on the inspector that didn't show up (but will still collect backpay).

But now that paid vacation the furlough for FDA inspectors is over, it's time for the GMP community to get back on their toes and maintain inspection readiness.

Good luck.

# Who Are You Guys, Anyway?

So, I asked for a report to study Zymergi blog readers, and here's where the biotech/pharma readers are coming from:

This is a veritable who's who of the biotech world.  Obviously, you aren't all customers, but when it comes to large-scale biologics support, cell culture and bioreactor contaminations, readers and customers find themselves in good company.

Note: All logos/trademarks belong to the trademark holder and inclusion on this list is not an endorsement of Zymergi or vice versa.

# MSAT to Automation, MSAT to Automation. Come in, Automation

When I was running cell culture campaign monitoring and we were using PI to review trends to understand physical phenomenon, there were times the trends didn't make any sense.

After digging a little, we found out that the data was simply recording with too little resolution either in the Y-direction or the X-direction.

Here's a short blog post describing the words to say to Automation (as well as some perspective) to get some more resolution in your data.

### Compression Settings

If the data seems sparse in the Y-direction (e.g., you expect to see oscillation but only see a straight line), it could because the compression settings are such that too much data gets filtered out. For OSI PI, there are two types of filter settings: (1) exception and (2) compression.

Exception is responsible for filtering out repeat data between the data interface and PI.

Compression is responsible for filtering out repeat linear data within PI (between the snapshot and the archive).

Every point attribute can be viewed from within PI ProcessBook. And if you find that your exception or compression settings are too wide, view them within PI and make a note of what they ought to be, then go on and tell your Automation team.

In my experience, you'll find a reluctance within Automation for changing the individual settings on points. Generally, there is a standard or a rule that is applied uniformly to the set of points. For example, you're using Broadley-James pH probes in both cell culture and purification and we (cell culture) ask for a 0.005 compdev on bioreactor pH probes, shouldn't the buffer prep pH probes also be set to 0.005 compdev?

Automation has to balance the tension between customer (your) needs as well as defensible system configuration.

Generally speaking, you're going to be asking for changes to compdev of excdev point attributes, and if you're asking for more data to be collected, you want these numbers to be smaller.

### Scan Rate Settings

What if after improving compression to filter out less data you still find that there is not sufficient resolution in the data to observe the physical phenomena that you know is happening? Well, the only place left to check is in the scan rate of the data... sparseness of data along the X-axis.

A point's scan rate is set based on a list of pre-defined intervals in the data interface. The data interface is a piece of software that transfers data from the source (control system) to the destination (data historian). If the interface is configured well, it will have sensible scan rates:
1. Every second
2. Every 2 seconds
3. Every 5 seconds
4. Every 10 seconds
5. Every 20 seconds
6. Every 30 seconds
7. Every minute
8. Every 5 minutes
It isn't always like this, but very often you'll see these intervals. The scan rates are defined in the interface configuration file and once set, they rarely change. The way it works is this, the first entry in the interval configuration is gets assigned: 1... the second entry: 2... the third entry: 3.

And whatever you set the point's location4 attribute is what it's scan rate is.

So suppose 00:00:05 is the third entry. Then a point whose location4=3 has a scan rate of every-5-seconds.

In a lot of cases, you simply tell your PI administrator you want the scan rate to be "every second," after which he's on the hook for looking up that scan rate in the interface. But FYI, if they said they made the change to the point but the location4 attribute is the same before and after, they're just BSing you.

There are a lot of considerations that need to get balanced when figuring out this stuff.  What's working against you is the default settings that come out-of-the-box with PI... as well as a generation taking the path of least resistance.

-->