Pandas (for data to plot)#

The standard data manipulation framework in Python (many others make use of at least part of its functionalities or are compatible with it)

We are not going in-depth - we are just showing some basic functionalities which are important for plotting.

import os
from pathlib import Path

import numpy as np
import pandas as pd

IN_COLAB = "COLAB_GPU" in os.environ

DATA_DIR = Path("data")

fname = Path("data") / "growth" / "fake_growth_data.csv"
if IN_COLAB:
    fname = (
        "https://raw.githubusercontent.com/biosustain/dsp_workshop_dataviz_python"
        "/refs/heads/main/data/growth/fake_growth_data.csv"
    )

1. DATA I/O (Input/Output)#

1. Read Data#

Let us load in some fake data made to fit some growth data.

df = pd.read_csv(fname)

Inspect the first few lines.

df.head()

	condition	SFN concentration (µM)	replicate	time (h)	Bacterial growth (OD600)
0	Anaerobic	DMSO	1	0.0	0.053628
1	Anaerobic	DMSO	1	2.0	0.034735
2	Anaerobic	DMSO	1	4.0	0.137467
3	Anaerobic	DMSO	1	6.0	0.300051
4	Anaerobic	DMSO	1	8.0	0.533996

You could also omit .head() and it will automatically shorten the output based on your rendering settings.

These settings are controlled by pandas display options, such as pd.options.display.max_rows and pd.options.display.max_columns.

df

	condition	SFN concentration (µM)	replicate	time (h)	Bacterial growth (OD600)
0	Anaerobic	DMSO	1	0.0	0.053628
1	Anaerobic	DMSO	1	2.0	0.034735
2	Anaerobic	DMSO	1	4.0	0.137467
3	Anaerobic	DMSO	1	6.0	0.300051
4	Anaerobic	DMSO	1	8.0	0.533996
...	...	...	...	...	...
463	Aerobic	2.5	3	16.0	0.943037
464	Aerobic	2.5	3	18.0	0.956014
465	Aerobic	2.5	3	20.0	0.945902
466	Aerobic	2.5	3	22.0	0.908015
467	Aerobic	2.5	3	24.0	0.917409

468 rows × 5 columns

1.2 Write it to a File#

E.g. after some data manipulation you wish to save the data.

Write a dataframe to a .csv file.

df.to_csv("fake_data.csv")

pd.read_csv() and .to_csv() are the general input/output functions (among some more). Meaning: they work for other common files too (.txt, .tsv) and some more domain-specific files that are however in a (somewhat) tabular format (e.g. .vcf). In other words - as long as it looks like a table, you can read /write to it, regardless of the file extension.

To demonstrate, let us save it to a .tsv file. The difference is that it is tab-separated values instead of comma-separated values in a .csv file.

df.to_csv("fake_data.tsv", sep="\t")

It can make sense to omit the index column, as Pandas automatically creates one when you read it, even if it already exists. (Otherwise you specify e.g. index_col=0 when reading it - try it out to see the difference).

df.to_csv("fake_data.tsv", sep="\t", index=False)
pd.read_csv("fake_data.tsv", sep="\t", index_col=None).head()

	condition	SFN concentration (µM)	replicate	time (h)	Bacterial growth (OD600)
0	Anaerobic	DMSO	1	0.0	0.053628
1	Anaerobic	DMSO	1	2.0	0.034735
2	Anaerobic	DMSO	1	4.0	0.137467
3	Anaerobic	DMSO	1	6.0	0.300051
4	Anaerobic	DMSO	1	8.0	0.533996

You can also read/write zipped file. Pandas will try to detect it automatically, but you can specify it yourself:

df.to_csv("fake_data.tsv.gz", sep="\t", index=False, compression="gzip")

2. Data Manipulation#

2.1 Rename#

The column names are written in a human-readable format, with spaces and brackets. It can be a good practice however to write them in a more programming-friendly way. We can easily do it by using a dictionary with the syntax {<old-name>: <new-name>}

df_col_renamed = df.rename(
    columns={
        "SFN concentration (µM)": "sfn_conc_mumolar",
        "time (h)": "time_h",
        "Bacterial growth (OD600)": "bact_growth_od600",
    }
)
df_col_renamed

	condition	sfn_conc_mumolar	replicate	time_h	bact_growth_od600
0	Anaerobic	DMSO	1	0.0	0.053628
1	Anaerobic	DMSO	1	2.0	0.034735
2	Anaerobic	DMSO	1	4.0	0.137467
3	Anaerobic	DMSO	1	6.0	0.300051
4	Anaerobic	DMSO	1	8.0	0.533996
...	...	...	...	...	...
463	Aerobic	2.5	3	16.0	0.943037
464	Aerobic	2.5	3	18.0	0.956014
465	Aerobic	2.5	3	20.0	0.945902
466	Aerobic	2.5	3	22.0	0.908015
467	Aerobic	2.5	3	24.0	0.917409

468 rows × 5 columns

We can do this - or many other operations - inplace for brevity.

df.rename(
    columns={
        "SFN concentration (µM)": "sfn_conc_mumolar",
        "time (h)": "time_h",
        "Bacterial growth (OD600)": "bact_growth_od600",
    },
    inplace=True,
)
df

	condition	sfn_conc_mumolar	replicate	time_h	bact_growth_od600
0	Anaerobic	DMSO	1	0.0	0.053628
1	Anaerobic	DMSO	1	2.0	0.034735
2	Anaerobic	DMSO	1	4.0	0.137467
3	Anaerobic	DMSO	1	6.0	0.300051
4	Anaerobic	DMSO	1	8.0	0.533996
...	...	...	...	...	...
463	Aerobic	2.5	3	16.0	0.943037
464	Aerobic	2.5	3	18.0	0.956014
465	Aerobic	2.5	3	20.0	0.945902
466	Aerobic	2.5	3	22.0	0.908015
467	Aerobic	2.5	3	24.0	0.917409

468 rows × 5 columns

2.2 Filtering by Conditions#

As we can see, we have data for aerobic and anaerobic conditions. If we are only interested in looking at e.g. the aerobic data, we can do it like this:

df_aerobic = df[df["condition"] == "Aerobic"]
df_aerobic

	condition	sfn_conc_mumolar	replicate	time_h	bact_growth_od600
234	Aerobic	DMSO	1	0.0	0.051324
235	Aerobic	DMSO	1	2.0	0.012470
236	Aerobic	DMSO	1	4.0	0.116809
237	Aerobic	DMSO	1	6.0	0.413621
238	Aerobic	DMSO	1	8.0	0.710586
...	...	...	...	...	...
463	Aerobic	2.5	3	16.0	0.943037
464	Aerobic	2.5	3	18.0	0.956014
465	Aerobic	2.5	3	20.0	0.945902
466	Aerobic	2.5	3	22.0	0.908015
467	Aerobic	2.5	3	24.0	0.917409

234 rows × 5 columns

We can also filter by multiple conditions. For example, get all anaerobic data from DMSO, where the OD600 is below a value of 0.4:

df_multi_condition = df[
    (df["condition"] == "Anaerobic")
    & (df["sfn_conc_mumolar"] == "DMSO")
    & (df["bact_growth_od600"] < 0.4)
]
df_multi_condition

	condition	sfn_conc_mumolar	replicate	time_h	bact_growth_od600
0	Anaerobic	DMSO	1	0.0	0.053628
1	Anaerobic	DMSO	1	2.0	0.034735
2	Anaerobic	DMSO	1	4.0	0.137467
3	Anaerobic	DMSO	1	6.0	0.300051
13	Anaerobic	DMSO	2	0.0	0.048281
14	Anaerobic	DMSO	2	2.0	0.023912
15	Anaerobic	DMSO	2	4.0	0.083990
16	Anaerobic	DMSO	2	6.0	0.268920
26	Anaerobic	DMSO	3	0.0	0.020199
27	Anaerobic	DMSO	3	2.0	0.038043
28	Anaerobic	DMSO	3	4.0	0.126597
29	Anaerobic	DMSO	3	6.0	0.367225

2.3 Adding Data By Conditions#

Part of data transformation is adding in new variables based on other columns in the table. This can be also just for plotting. For example, we could wish to color our plot on whether our bacteria are in their stationary phase or not. Then, we would add a variable called is_stationary() that we use to subset parts of our data for plotting (we will show how later with seaborn). We use NumPy for this purpose with the syntax (<condition(s)>, <output-if-true>, <output-if-false>).

df["is_stationary"] = np.where(df["bact_growth_od600"] >= 0.55, True, False)
df

	condition	sfn_conc_mumolar	replicate	time_h	bact_growth_od600	is_stationary
0	Anaerobic	DMSO	1	0.0	0.053628	False
1	Anaerobic	DMSO	1	2.0	0.034735	False
2	Anaerobic	DMSO	1	4.0	0.137467	False
3	Anaerobic	DMSO	1	6.0	0.300051	False
4	Anaerobic	DMSO	1	8.0	0.533996	False
...	...	...	...	...	...	...
463	Aerobic	2.5	3	16.0	0.943037	True
464	Aerobic	2.5	3	18.0	0.956014	True
465	Aerobic	2.5	3	20.0	0.945902	True
466	Aerobic	2.5	3	22.0	0.908015	True
467	Aerobic	2.5	3	24.0	0.917409	True

468 rows × 6 columns

2.4 Aggregating Data#

Maybe we are more interested in the maximum or average OD or alike in our experiment. We could just use .max() for example but it would give us the maximum value across all variables. If however we wish to distinguish between different cases (like anaerobic/aerobic), then it makes sense to group our data first. Here, we show how to aggregate across all replicates:

df_rep_agg = (
    df.groupby(["time_h", "condition", "sfn_conc_mumolar"])["bact_growth_od600"]
    .mean()
    .reset_index()
)
df_rep_agg

	time_h	condition	sfn_conc_mumolar	bact_growth_od600
0	0.0	Aerobic	10	0.035428
1	0.0	Aerobic	15	0.069004
2	0.0	Aerobic	2.5	0.062911
3	0.0	Aerobic	20	0.046474
4	0.0	Aerobic	5	0.057184
...	...	...	...	...
151	24.0	Anaerobic	15	0.616655
152	24.0	Anaerobic	2.5	0.660576
153	24.0	Anaerobic	20	0.597215
154	24.0	Anaerobic	5	0.642585
155	24.0	Anaerobic	DMSO	0.655811

156 rows × 4 columns

We put our command in brackets so we can write it out in multiple lines to show it better. We also reset the index at the end to transform the pd.Series object back into a pd.DataFrame object.

We can also aggregate multiple things at once:

df_multi_agg = (
    df.groupby(["time_h", "condition", "sfn_conc_mumolar"])
    .agg(
        {
            "is_stationary": ["count"],
            "bact_growth_od600": [
                "min",
                "mean",
                "max",
            ],
        }
    )
    .reset_index()
)
df_multi_agg

	time_h	condition	sfn_conc_mumolar	is_stationary	bact_growth_od600
				count	min	mean	max
0	0.0	Aerobic	10	3	0.029107	0.035428	0.043397
1	0.0	Aerobic	15	3	0.066276	0.069004	0.071293
2	0.0	Aerobic	2.5	3	0.034301	0.062911	0.096214
3	0.0	Aerobic	20	3	0.023262	0.046474	0.074124
4	0.0	Aerobic	5	3	0.051937	0.057184	0.061837
...	...	...	...	...	...	...	...
151	24.0	Anaerobic	15	3	0.602570	0.616655	0.624859
152	24.0	Anaerobic	2.5	3	0.633187	0.660576	0.678301
153	24.0	Anaerobic	20	3	0.572835	0.597215	0.617822
154	24.0	Anaerobic	5	3	0.584576	0.642585	0.689661
155	24.0	Anaerobic	DMSO	3	0.618712	0.655811	0.696756

156 rows × 7 columns

2.5 Adding Data by Aggregation#

Previously, we just added the is_stationary variable based on a guess. We could also do it more programmatically. We do it here in two steps:

Adding a column indicating the maximum OD600 for that condition-sfn-replicate, minus a tolerance
Conditioning on that new column

df["stationary_od"] = df.groupby(["condition", "sfn_conc_mumolar", "replicate"])[
    "bact_growth_od600"
].transform(lambda col: col.max() - 0.1)
df["is_stationary"] = np.where(
    df["bact_growth_od600"] >= df["stationary_od"], True, False
)
df

	condition	sfn_conc_mumolar	replicate	time_h	bact_growth_od600	is_stationary	stationary_od
0	Anaerobic	DMSO	1	0.0	0.053628	False	0.556991
1	Anaerobic	DMSO	1	2.0	0.034735	False	0.556991
2	Anaerobic	DMSO	1	4.0	0.137467	False	0.556991
3	Anaerobic	DMSO	1	6.0	0.300051	False	0.556991
4	Anaerobic	DMSO	1	8.0	0.533996	False	0.556991
...	...	...	...	...	...	...	...
463	Aerobic	2.5	3	16.0	0.943037	True	0.856014
464	Aerobic	2.5	3	18.0	0.956014	True	0.856014
465	Aerobic	2.5	3	20.0	0.945902	True	0.856014
466	Aerobic	2.5	3	22.0	0.908015	True	0.856014
467	Aerobic	2.5	3	24.0	0.917409	True	0.856014

468 rows × 7 columns

That tolerance could also be calculated in a similar way, e.g. based on the standard deviation (we leave that as an exercise if you want to try it out yourself).

# Your turn

There are many more useful transformations, but we end here by having shown some that you could already use in more advanced plots.

Pandas (for data to plot)

Contents

Pandas (for data to plot)#

1. DATA I/O (Input/Output)#

1. Read Data#

1.2 Write it to a File#

2. Data Manipulation#

2.1 Rename#

2.2 Filtering by Conditions#

2.3 Adding Data By Conditions#

2.4 Aggregating Data#

2.5 Adding Data by Aggregation#