Pandas (for data to plot)#

The standard data manipulation framework in Python (many others make use of at least part of its functionalities or are compatible with it)

We are not going in-depth - we are just showing some basic functionalities which are important for plotting.

import os
from pathlib import Path

import numpy as np
import pandas as pd

IN_COLAB = "COLAB_GPU" in os.environ
DATA_DIR = Path("data")

fname = Path("data") / "growth" / "fake_growth_data.csv"
if IN_COLAB:
    fname = (
        "https://raw.githubusercontent.com/biosustain/dsp_workshop_dataviz_python"
        "/refs/heads/main/data/growth/fake_growth_data.csv"
    )

1. DATA I/O (Input/Output)#

1. Read Data#

Let us load in some fake data made to fit some growth data.

df = pd.read_csv(fname)

Inspect the first few lines.

df.head()
condition SFN concentration (µM) replicate time (h) Bacterial growth (OD600)
0 Anaerobic DMSO 1 0.0 0.053628
1 Anaerobic DMSO 1 2.0 0.034735
2 Anaerobic DMSO 1 4.0 0.137467
3 Anaerobic DMSO 1 6.0 0.300051
4 Anaerobic DMSO 1 8.0 0.533996

You could also omit .head() and it will automatically shorten the output based on your rendering settings.

These settings are controlled by pandas display options, such as pd.options.display.max_rows and pd.options.display.max_columns.

df
condition SFN concentration (µM) replicate time (h) Bacterial growth (OD600)
0 Anaerobic DMSO 1 0.0 0.053628
1 Anaerobic DMSO 1 2.0 0.034735
2 Anaerobic DMSO 1 4.0 0.137467
3 Anaerobic DMSO 1 6.0 0.300051
4 Anaerobic DMSO 1 8.0 0.533996
... ... ... ... ... ...
463 Aerobic 2.5 3 16.0 0.943037
464 Aerobic 2.5 3 18.0 0.956014
465 Aerobic 2.5 3 20.0 0.945902
466 Aerobic 2.5 3 22.0 0.908015
467 Aerobic 2.5 3 24.0 0.917409

468 rows × 5 columns

1.2 Write it to a File#

E.g. after some data manipulation you wish to save the data.

Write a dataframe to a .csv file.

df.to_csv("fake_data.csv")

pd.read_csv() and .to_csv() are the general input/output functions (among some more). Meaning: they work for other common files too (.txt, .tsv) and some more domain-specific files that are however in a (somewhat) tabular format (e.g. .vcf). In other words - as long as it looks like a table, you can read /write to it, regardless of the file extension.

To demonstrate, let us save it to a .tsv file. The difference is that it is tab-separated values instead of comma-separated values in a .csv file.

df.to_csv("fake_data.tsv", sep="\t")

It can make sense to omit the index column, as Pandas automatically creates one when you read it, even if it already exists. (Otherwise you specify e.g. index_col=0 when reading it - try it out to see the difference).

df.to_csv("fake_data.tsv", sep="\t", index=False)
pd.read_csv("fake_data.tsv", sep="\t", index_col=None).head()
condition SFN concentration (µM) replicate time (h) Bacterial growth (OD600)
0 Anaerobic DMSO 1 0.0 0.053628
1 Anaerobic DMSO 1 2.0 0.034735
2 Anaerobic DMSO 1 4.0 0.137467
3 Anaerobic DMSO 1 6.0 0.300051
4 Anaerobic DMSO 1 8.0 0.533996

You can also read/write zipped file. Pandas will try to detect it automatically, but you can specify it yourself:

df.to_csv("fake_data.tsv.gz", sep="\t", index=False, compression="gzip")

2. Data Manipulation#

2.1 Rename#

The column names are written in a human-readable format, with spaces and brackets. It can be a good practice however to write them in a more programming-friendly way. We can easily do it by using a dictionary with the syntax {<old-name>: <new-name>}

df_col_renamed = df.rename(
    columns={
        "SFN concentration (µM)": "sfn_conc_mumolar",
        "time (h)": "time_h",
        "Bacterial growth (OD600)": "bact_growth_od600",
    }
)
df_col_renamed
condition sfn_conc_mumolar replicate time_h bact_growth_od600
0 Anaerobic DMSO 1 0.0 0.053628
1 Anaerobic DMSO 1 2.0 0.034735
2 Anaerobic DMSO 1 4.0 0.137467
3 Anaerobic DMSO 1 6.0 0.300051
4 Anaerobic DMSO 1 8.0 0.533996
... ... ... ... ... ...
463 Aerobic 2.5 3 16.0 0.943037
464 Aerobic 2.5 3 18.0 0.956014
465 Aerobic 2.5 3 20.0 0.945902
466 Aerobic 2.5 3 22.0 0.908015
467 Aerobic 2.5 3 24.0 0.917409

468 rows × 5 columns

We can do this - or many other operations - inplace for brevity.

df.rename(
    columns={
        "SFN concentration (µM)": "sfn_conc_mumolar",
        "time (h)": "time_h",
        "Bacterial growth (OD600)": "bact_growth_od600",
    },
    inplace=True,
)
df
condition sfn_conc_mumolar replicate time_h bact_growth_od600
0 Anaerobic DMSO 1 0.0 0.053628
1 Anaerobic DMSO 1 2.0 0.034735
2 Anaerobic DMSO 1 4.0 0.137467
3 Anaerobic DMSO 1 6.0 0.300051
4 Anaerobic DMSO 1 8.0 0.533996
... ... ... ... ... ...
463 Aerobic 2.5 3 16.0 0.943037
464 Aerobic 2.5 3 18.0 0.956014
465 Aerobic 2.5 3 20.0 0.945902
466 Aerobic 2.5 3 22.0 0.908015
467 Aerobic 2.5 3 24.0 0.917409

468 rows × 5 columns

2.2 Filtering by Conditions#

As we can see, we have data for aerobic and anaerobic conditions. If we are only interested in looking at e.g. the aerobic data, we can do it like this:

df_aerobic = df[df["condition"] == "Aerobic"]
df_aerobic
condition sfn_conc_mumolar replicate time_h bact_growth_od600
234 Aerobic DMSO 1 0.0 0.051324
235 Aerobic DMSO 1 2.0 0.012470
236 Aerobic DMSO 1 4.0 0.116809
237 Aerobic DMSO 1 6.0 0.413621
238 Aerobic DMSO 1 8.0 0.710586
... ... ... ... ... ...
463 Aerobic 2.5 3 16.0 0.943037
464 Aerobic 2.5 3 18.0 0.956014
465 Aerobic 2.5 3 20.0 0.945902
466 Aerobic 2.5 3 22.0 0.908015
467 Aerobic 2.5 3 24.0 0.917409

234 rows × 5 columns

We can also filter by multiple conditions. For example, get all anaerobic data from DMSO, where the OD600 is below a value of 0.4:

df_multi_condition = df[
    (df["condition"] == "Anaerobic")
    & (df["sfn_conc_mumolar"] == "DMSO")
    & (df["bact_growth_od600"] < 0.4)
]
df_multi_condition
condition sfn_conc_mumolar replicate time_h bact_growth_od600
0 Anaerobic DMSO 1 0.0 0.053628
1 Anaerobic DMSO 1 2.0 0.034735
2 Anaerobic DMSO 1 4.0 0.137467
3 Anaerobic DMSO 1 6.0 0.300051
13 Anaerobic DMSO 2 0.0 0.048281
14 Anaerobic DMSO 2 2.0 0.023912
15 Anaerobic DMSO 2 4.0 0.083990
16 Anaerobic DMSO 2 6.0 0.268920
26 Anaerobic DMSO 3 0.0 0.020199
27 Anaerobic DMSO 3 2.0 0.038043
28 Anaerobic DMSO 3 4.0 0.126597
29 Anaerobic DMSO 3 6.0 0.367225

2.3 Adding Data By Conditions#

Part of data transformation is adding in new variables based on other columns in the table. This can be also just for plotting. For example, we could wish to color our plot on whether our bacteria are in their stationary phase or not. Then, we would add a variable called is_stationary() that we use to subset parts of our data for plotting (we will show how later with seaborn). We use NumPy for this purpose with the syntax (<condition(s)>, <output-if-true>, <output-if-false>).

df["is_stationary"] = np.where(df["bact_growth_od600"] >= 0.55, True, False)
df
condition sfn_conc_mumolar replicate time_h bact_growth_od600 is_stationary
0 Anaerobic DMSO 1 0.0 0.053628 False
1 Anaerobic DMSO 1 2.0 0.034735 False
2 Anaerobic DMSO 1 4.0 0.137467 False
3 Anaerobic DMSO 1 6.0 0.300051 False
4 Anaerobic DMSO 1 8.0 0.533996 False
... ... ... ... ... ... ...
463 Aerobic 2.5 3 16.0 0.943037 True
464 Aerobic 2.5 3 18.0 0.956014 True
465 Aerobic 2.5 3 20.0 0.945902 True
466 Aerobic 2.5 3 22.0 0.908015 True
467 Aerobic 2.5 3 24.0 0.917409 True

468 rows × 6 columns

2.4 Aggregating Data#

Maybe we are more interested in the maximum or average OD or alike in our experiment. We could just use .max() for example but it would give us the maximum value across all variables. If however we wish to distinguish between different cases (like anaerobic/aerobic), then it makes sense to group our data first. Here, we show how to aggregate across all replicates:

df_rep_agg = (
    df.groupby(["time_h", "condition", "sfn_conc_mumolar"])["bact_growth_od600"]
    .mean()
    .reset_index()
)
df_rep_agg
time_h condition sfn_conc_mumolar bact_growth_od600
0 0.0 Aerobic 10 0.035428
1 0.0 Aerobic 15 0.069004
2 0.0 Aerobic 2.5 0.062911
3 0.0 Aerobic 20 0.046474
4 0.0 Aerobic 5 0.057184
... ... ... ... ...
151 24.0 Anaerobic 15 0.616655
152 24.0 Anaerobic 2.5 0.660576
153 24.0 Anaerobic 20 0.597215
154 24.0 Anaerobic 5 0.642585
155 24.0 Anaerobic DMSO 0.655811

156 rows × 4 columns

We put our command in brackets so we can write it out in multiple lines to show it better. We also reset the index at the end to transform the pd.Series object back into a pd.DataFrame object.

We can also aggregate multiple things at once:

df_multi_agg = (
    df.groupby(["time_h", "condition", "sfn_conc_mumolar"])
    .agg(
        {
            "is_stationary": ["count"],
            "bact_growth_od600": [
                "min",
                "mean",
                "max",
            ],
        }
    )
    .reset_index()
)
df_multi_agg
time_h condition sfn_conc_mumolar is_stationary bact_growth_od600
count min mean max
0 0.0 Aerobic 10 3 0.029107 0.035428 0.043397
1 0.0 Aerobic 15 3 0.066276 0.069004 0.071293
2 0.0 Aerobic 2.5 3 0.034301 0.062911 0.096214
3 0.0 Aerobic 20 3 0.023262 0.046474 0.074124
4 0.0 Aerobic 5 3 0.051937 0.057184 0.061837
... ... ... ... ... ... ... ...
151 24.0 Anaerobic 15 3 0.602570 0.616655 0.624859
152 24.0 Anaerobic 2.5 3 0.633187 0.660576 0.678301
153 24.0 Anaerobic 20 3 0.572835 0.597215 0.617822
154 24.0 Anaerobic 5 3 0.584576 0.642585 0.689661
155 24.0 Anaerobic DMSO 3 0.618712 0.655811 0.696756

156 rows × 7 columns

2.5 Adding Data by Aggregation#

Previously, we just added the is_stationary variable based on a guess. We could also do it more programmatically. We do it here in two steps:

  1. Adding a column indicating the maximum OD600 for that condition-sfn-replicate, minus a tolerance

  2. Conditioning on that new column

df["stationary_od"] = df.groupby(["condition", "sfn_conc_mumolar", "replicate"])[
    "bact_growth_od600"
].transform(lambda col: col.max() - 0.1)
df["is_stationary"] = np.where(
    df["bact_growth_od600"] >= df["stationary_od"], True, False
)
df
condition sfn_conc_mumolar replicate time_h bact_growth_od600 is_stationary stationary_od
0 Anaerobic DMSO 1 0.0 0.053628 False 0.556991
1 Anaerobic DMSO 1 2.0 0.034735 False 0.556991
2 Anaerobic DMSO 1 4.0 0.137467 False 0.556991
3 Anaerobic DMSO 1 6.0 0.300051 False 0.556991
4 Anaerobic DMSO 1 8.0 0.533996 False 0.556991
... ... ... ... ... ... ... ...
463 Aerobic 2.5 3 16.0 0.943037 True 0.856014
464 Aerobic 2.5 3 18.0 0.956014 True 0.856014
465 Aerobic 2.5 3 20.0 0.945902 True 0.856014
466 Aerobic 2.5 3 22.0 0.908015 True 0.856014
467 Aerobic 2.5 3 24.0 0.917409 True 0.856014

468 rows × 7 columns

That tolerance could also be calculated in a similar way, e.g. based on the standard deviation (we leave that as an exercise if you want to try it out yourself).

# Your turn

There are many more useful transformations, but we end here by having shown some that you could already use in more advanced plots.