Read in and process data from the Excel template#

This tutorial looks at how to import data from the Excel template and provides an example of how it can be manipulated readily.

A function is provided (“process_template”) which reads in the Excel file and automatically performs the pseudo-batch conversion. As inputs, it takes the filename of the Excel template to process as well as two optional parameters: 1) “phase” argument (string type, default “all”, allowable “1”,”2”,”all”) 2) “outlier_removal” argument (boolean type, default False) The phase argument allows users to restrict calculations over a particular phase of the data as defined within the template (particularly, growth rate on the bioreactor_data tab and yields on the visualisations tab). This is useful if a user wishes to restrict analysis to a particular period of growth (e.g. exponential growth). The template automatically plots biomass over time in the “Bioreactor_data” tab and the logarithmic transformation of this (where linear regions would represent periods of balanced, exponential growth). These periods can be captured by defining time boundaries in the respective cells (AF4-5 and AN4-5). While all data is still imported and transformed, phase specific boundaries are captured for individual compounds and outlier information, allowing the calculation of rates while eliminating data points with obvious or suspected errors, or the bespoke curation of bounds defining steady production/secretion fluxes which may differ from linear growth periods.

The visualisations tab allows for specification of phase boundaries for the calculation of rates and yields by selecting the drop down cells (lines 40-41 for phase 1 and 90-91 for phase 2). The following 3 lines allow for the specification of outliers (note that the template will automatically suggest potential outlier values based on Cook’s distance, flagging values greater than 1 (where Cook’s distance specifies how influential a data point is).

Specification of “1” or “2” will capture the phase specific annotation from the Excel file for the respective phase, while “all” will ignore phase boundary information but still permit the use of outlier information as entered for phase 1.

The outlier_removal flag enforces whether defined outliers are censored from the data or not.

Note that the template already captures data from multiple feeds so no additional processing has to be done here.

[1]:
#Grab requirements
import pandas as pd
import numpy as np
from pseudobatch import pseudobatch_transform_pandas
from pseudobatch import process_excel_template
from patsy import dmatrices
import statsmodels.api as sm
import pathlib

PROJECT_DIR = pathlib.Path.cwd().parents[2]
/Users/s143838/.virtualenvs/pseudobatch-dev/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
{'stan_version_major': '2', 'stan_version_minor': '29', 'stan_version_patch': '2', 'STAN_THREADS': 'false', 'STAN_MPI': 'false', 'STAN_OPENCL': 'false', 'STAN_NO_RANGE_CHECKS': 'false', 'STAN_CPP_OPTIMS': 'false'}
[2]:
#First read in and convert the data
fedbatch_df_measurement, phase_bounds, pseudo_cpds, fed_cpds = process_excel_template(PROJECT_DIR / 'excel-pseudobatch' / 'Pseudo_batch_template_test2.xlsx', phase='1', outlier_removal = False)
The following compounds are not defined in feeds but are measured. They are assumed to not be fed. Compounds: Biomass (OD)
/Users/s143838/.virtualenvs/pseudobatch-dev/lib/python3.10/site-packages/openpyxl/worksheet/_reader.py:329: UserWarning: Data Validation extension is not supported and will be removed
  warn(msg)

The outputs of the above function are: 1) a dataframe (here fedbatch_df_measurement) which contains the compiled bioreactor data, concentration measurements and converted pseudo concentrations. 2) phase_bounds - either a series object containing just the growth phase bounds or a dataframe that contains growth and yield defined phase bounds, depending on whether phase specific information is to be used. Phase bounds are mapped to Sample IDs. 3) pseudo_cpds - a list containing all column names of adjusted measured compounds using the pseudo-batch transformation methodology. This is to streamline further calculations 4) fed_cpds - a list containing column names of adjusted measured compounds that are also in the feed (in case they wanted to be screened as potential substrates and redefined as consumed substrate quantities, as per previous tutorials).

We will now illustrate for example, simple ways how to further analyse this data within the Python framework. As an example we will calcualte growth rate, yields and specific rates of production/consumption

[3]:
#First we redefine the helper function wrapper for fitting rates and yields as per previous tutorials:

def fit_ols_model(formula_like: str, data: pd.DataFrame) -> sm.regression.linear_model.RegressionResultsWrapper:
    y, X = dmatrices(formula_like, data)
    model = sm.OLS(endog=y, exog=X)
    res = model.fit()
    return res

In previous tutorials, plotting the data and fits has been an important step. Here we assume that this has already been visualised, inspected and quality controlled for outliers using functionalities of the Excel template. Therefore we can skip this step and jump straight to generating results.

[4]:
#First, add a log-trasnformed pseudo-biomass column to calculate growth rate
#We assume here that all time points have a corresponding biomass measurement.
#Rows missing biomass records should be dropped from the dataframe

log_bio_name = "np.log(" + pseudo_cpds[0] + ")" #The first column in pseudo_cpds is always the biomass
fedbatch_df_measurement[log_bio_name] = np.log(fedbatch_df_measurement[pseudo_cpds[0]])

[5]:
# Next let's reduce the fedbatch measurement dataframe to consider only the phase of interest if we are considering phases.
# We will do this by censoring values that fall outside of the phase range

fedbatch_df_measurement_phase = fedbatch_df_measurement.copy() # create a copy of the DF
#Find matching rows to Sample ID
if isinstance(phase_bounds, pd.DataFrame): #we only do this if phase_bounds is a DF, that is we are considering phase specific info
    phase_bounds = phase_bounds.rename(columns = {pseudo_cpds[0]: log_bio_name}) #update name in phase_bounds
    for col in phase_bounds:
        row_locs = fedbatch_df_measurement_phase.index[fedbatch_df_measurement_phase['Sample ID'].isin(phase_bounds[col])].to_list() #indices corresponding to phase boundary
        valid_indices = fedbatch_df_measurement_phase.loc[row_locs[0]:row_locs[1]].index.to_list() #set of indices that can be used to calculate yield/rate
        fedbatch_df_measurement_phase.loc[~fedbatch_df_measurement_phase.index.isin(valid_indices),col] = np.nan #censor data in invalid indices

After restricting available data we will perform the yield and growth rate calculations, as previously demonstrated in tutorial 2. We begin with the growth rate. Because users may opt for human readable column headers in the Excel template rather than Python friendly headers we will utilise patsy so we can accept any header.

[6]:
# Calculate growth rate
model_fit = "Q('" + log_bio_name + "')" + " ~ " + "Q('Time (h)')" #using patsy to handle disgusting variable names
mask = ~pd.isnull(fedbatch_df_measurement_phase[log_bio_name]) #create a mask to select for rows containing non-NaN values
res_mu_hat_pseudo = fit_ols_model(model_fit, fedbatch_df_measurement_phase) #perform the fit using our helper function
growth_rate = res_mu_hat_pseudo.params[1] #retrieve rate

Next we will demonstrate how yields and rates can readily be calculated from these data structures for any number of compounds. Results will be iteratively generated and stored in a dataframe. Note that here we will demonstrate a slightly different way of calculating yields to the previous tutorials (such as Tutorial 2).

In this instance, we will simplify calculations to deal with the potential for large numbers of measured compounds (such as feeding amino acid supplemented media) which could also be produced transiently (or entirely) across the culture. Here we use the direct pseudo-concentrations for the yield calculations, which can give negative yields. We will then calculate specific uptake/consumption rates by multiplying by the growth rate.

[7]:
parameter_df = pd.DataFrame(columns=['Compound','Yield','Rate'])

#First add in growth rate:
parameter_df.loc[0] = [pseudo_cpds[0],1,growth_rate]

for cpd in pseudo_cpds[1:]: #skipping biomass
    mask = ~pd.isnull(fedbatch_df_measurement_phase[cpd]) #create a mask to select for rows containing non NaN values
    model_fit = "Q('" + cpd + "')" + " ~ " + "Q('" + pseudo_cpds[0] +"')" #using patsy to handle disgusting variable names
    res_yield = fit_ols_model(model_fit, fedbatch_df_measurement_phase[mask]) #perform the fit
    #Append to DF. We will apply the convention of yields always being positive here
    parameter_df.loc[len(parameter_df)] = [cpd,abs(res_yield.params[1]),res_yield.params[1]*growth_rate]

parameter_df