Write Code The Way You Think

Transform cryptic array indexing into intuitive, self-documenting code using labeled data structures. Make your analysis as clear as your scientific thinking.

The Problem Every Scientist Faces

You want to select “high temperature, medium pH samples from the January batch.”

In your head, this takes 2 seconds. In code, it becomes:

# The painful way
temp_mask = (data[:, 2] > 65) & (data[:, 2] < 85)
pH_mask = (data[:, 3] > 6.5) & (data[:, 3] < 7.5) 
date_mask = data[:, 0] == 'Jan'
result = data[temp_mask & pH_mask & date_mask, :]
# Wait... which column was pH again?

Why does selecting scientific data require cryptic array indexing?

The gap between how scientists think and how we’re forced to code is killing productivity. You shouldn’t need to remember that “temperature is column 2” or decode sample_array[mask1 & mask2, :] six months later.

This post shows how to write code that matches your scientific intuition - where selecting data is as natural as thinking about it.

The Numpy Problem: Arrays Without Meaning

Python’s built-in lists are slow, so we use Numpy. But Numpy arrays are just numbers - they don’t know what those numbers mean.

Let’s see this problem in action with real scientific data:

I am going to load data from the paper “Automated Gold Nanorod Spectral Morphology Analysis Pipeline” Gleason et al. (2024)

from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

import os
# Keep the working directory as-is since data file is local
print(f"Current working directory: {os.getcwd()}")

Current working directory: /home/sjoerd/projects/science-and-machines-workspace/website/data-driven-operations/python-for-experiments/write-code-the-way-you-think

We load quantum simulations of UV/Vis absorption spectra, obtained from a grid of simulations of gold nanorods. The data is stored in a Numpy array, which is a common format for scientific data.

Load the data with Numpy

SIMULATED_PATH = Path("main_profiles.npy")
simulated_np = np.load(SIMULATED_PATH)
print(f"Simulated shape: {simulated_np.shape}")

Simulated shape: (491, 46, 1299)

This loads one large array of meaningless numbers without context.

The simulated array contains a grid of simulations, but Numpy doesn’t know this represents:

46 diameter values (5-50 nm)  
491 length values (10-500 nm)
1299 wavelength points (401-1699 nm)

Here’s the problem: This critical scientific information was lost when the data was saved to Numpy. I had to contact the authors to reconstruct what the numbers meant.

This is the gap: Your scientific understanding vs. what your code can express. With Numpy it is a lot of work to store meaningful context. We easily lose it.

The Numpy nightmare: Every time we manipulate simulated, we must manually track coordinates. Miss one step and we lose the scientific meaning forever.

Here’s the fragile manual approach:

We code

simulated_length_coordinates = np.arange(10, 501, 1) # 491
simulated_diameter_coordinates = np.arange(5, 51, 1) # 86
simulated_wavelength_coordinates = np.arange(401, 1700, 1) # 1299

# Cross product of simulated diameter and length
simulated_coordinates = lengths, diameters = np.meshgrid(
  simulated_length_coordinates, 
  simulated_diameter_coordinates,
)
print(lengths)
print(diameters)

[[ 10  11  12 ... 498 499 500]
 [ 10  11  12 ... 498 499 500]
 [ 10  11  12 ... 498 499 500]
 ...
 [ 10  11  12 ... 498 499 500]
 [ 10  11  12 ... 498 499 500]
 [ 10  11  12 ... 498 499 500]]
[[ 5  5  5 ...  5  5  5]
 [ 6  6  6 ...  6  6  6]
 [ 7  7  7 ...  7  7  7]
 ...
 [48 48 48 ... 48 48 48]
 [49 49 49 ... 49 49 49]
 [50 50 50 ... 50 50 50]]

Now we have three separate arrays to keep synchronized: simulated, lengths, diameters.

The problem: In every manipulation, we must manually coordinate all arrays. One mistake = lost scientific context.

The Solution: Code That Maps to Science

For chemistry, biology, and medicine we often need to

Represent data on grids and irregular grids
Grouping and aggregating data
Work with labels
Work with coordinates

Grid structures tend to be useful for representing experimental data, as they represent structures in our experiments. The same data may be represented in different ways; each highlighting different aspects.

Three Ways to Represent Grid Data

The following grid representations cover most data sets in science and engineering. They occur in chemistry, biology, materials science, physics and medicine. Each has its own strengths and weaknesses.

1. Cartesian Product Data

Organized as products of dimensions, regardless of spacing regularity:

Structure: Data organized as combinations (Cartesian product) of coordinate values
Use case: Simulated data, gridded measurements, image data
Advantages: Intuitive slicing, fast indexing, easy to store, natural for visualization
Disadvantages: Wastes space if many combinations are missing/invalid, forces rectangular structure

Example:

Consider reaction yield measurements across different conditions:

Temperature: [50°C, 75°C, 100°C] - regular 25° spacing
pH: [6.0, 6.5, 7.2, 8.0] - irregular spacing

The Cartesian product means you measure yield at every combination:

     pH:    6.0   6.5   7.2   8.0
Temp:
50°C        78%   82%   85%   79%
75°C        85%   89%   92%   86% 
100°C       79%   83%   88%   82%

That’s 3×4 = 12 experiments forming a complete rectangular grid. The spacing doesn’t need to be regular - you still measure every possible (temperature, pH) combination.

2. Hierarchical Data

For data with natural nested structure that you want to group and analyze:

Structure: Pandas-style hierarchical indexing combining dimensions
Use case: Hierarchical grouping, panel data, experiments with replicates
Advantages: Suitable for high-level manipulation. Compact representation, familiar pandas workflows, efficient groupby
Disadvantages: Hierarchical indexing is more complex. Imposes a strong model on the data. Data model and data are tightly coupled, which makes it less suitable for storing or re-interpreting.

Example: Drug stability study with multiple batches and replicates:

You’re testing drug degradation across different storage conditions. Each condition has 3 batches, each batch has 4 replicates. The MultiIndex structure arises from the grouping.

Original flat data:

Storage Temp    Batch    Replicate    Stability (%)
4°C             A        1            98.5
4°C             A        2            98.3  
4°C             A        3            98.7
4°C             A        4            98.1
4°C             B        1            97.9
4°C             B        2            98.2
25°C            A        1            94.1
25°C            A        2            93.8
25°C            B        1            93.2

After groupby([‘storage_temp’, ‘batch’]).mean():

Index (storage_temp, batch)    Mean Stability (%)
(4°C, A)                      98.4
(4°C, B)                      98.0  
(25°C, A)                     93.9
(25°C, B)                     93.2

The MultiIndex has two levels:

Level 0 (name: 'storage_temp'): The levels are ['4°C', '25°C']
Level 1 (name: 'batch'): The levels are ['A', 'B']

Each index is a tuple like (4°C, A). The codes are integer positions [0,0], [0,1], [1,0], [1,1] that map to the levels.

Storage Temp    Batch    Replicate    Stability (%)
4°C             A        1            98.5
4°C             A        2            98.3  
4°C             A        3            98.7
4°C             A        4            98.1
4°C             B        1            97.9
4°C             B        2            98.2
...             ...      ...          ...
25°C            C        4            89.1

Now grouping operations match your scientific thinking:

groupby('storage_temp') → “Compare 4°C vs 25°C performance”
groupby('batch') → “Is batch A consistently different from B and C?”
groupby(['storage_temp', 'batch']) → “Each temperature-batch combination”

Perfect when your experimental design has natural hierarchies and you ask nested questions.

3. Irregular Grids

Most experimental grids cannot be decomposed as a product (Cartesian product). Some combinations do not make sense, or it would be too expensive.

Structure: Flat arrays with coordinate metadata
Use case: Field measurements, scattered sensor data, arbitrary point clouds
Advantages: NetCDF compatible, simple mental model, maximum flexibility
Disadvantages: Less compact for structured data

Example: Enzyme activity measurements from different labs at arbitrary conditions:

You’re collecting enzyme kinetics data from collaborating labs. Each lab tested at whatever conditions were convenient - no coordinated experimental design:

Lab       Temperature    pH     Buffer        Activity (U/mL)
Tokyo     37°C          7.4    Tris          245
Berlin    25°C          6.8    Phosphate     189  
Austin    42°C          7.2    HEPES         267
Tokyo     30°C          7.0    Tris          198
São Paulo 35°C          8.1    Bicine        156
Austin    37°C          7.4    HEPES         251
Berlin    20°C          7.5    Phosphate     167
...       ...           ...    ...           ...

Perfect for scattered, real-world data where you can’t control the experimental grid. Each measurement has its own unique combination of conditions - no systematic structure, just flexible coordinate arrays that preserve all the context.

XArray - Labeled Arrays for Science

The xarray package transforms how we work with scientific data by bridging the gap between how we think about experiments and how we code them. It solves several problems:

Offers labeled arrays, so we know what each dimension represents
Supports all three grid types: Cartesian, hierarchical, and irregular
Provides intuitive selection and aggregation methods
Integrates seamlessly with Numpy and Pandas workflows

Instead of managing separate arrays and hoping they stay synchronized, xarray stores data with its coordinates and metadata attached.

Your arrays should know what they represent.

Here is a selection example with XArray:

# Numpy code
temp_mask = (data[:, 2] > 65) & (data[:, 2] < 85)
result = data[temp_mask, :]  # Which column was temperature again?

# XArray code
result = experiments.sel(temperature=slice(65, 85))  # Crystal clear

Because the array know what your data represents, manipulation becomes self-documenting. You can select data using labels that match your scientific thinking, not cryptic array indices.

Cartesian grid

The simulated gold nanorod spectra are on a Cartesian grid. We have a regular grid of simulations across three dimensions: length, diameter, and wavelength.

This is how we convert the Numpy array to an XArray DataArray with named coordinates:

import xarray as xr
xr.set_options(display_expand_attrs=False, display_expand_data=False);

# Create XArray DataArray with named coordinates
profiles = xr.DataArray(
    np.array(simulated_np),
    coords=[
        ("length", simulated_length_coordinates),
        ("diameter", simulated_diameter_coordinates), 
        ("wavelength", simulated_wavelength_coordinates)
    ],
    dims=["length", "diameter", "wavelength"],
    name="simulated profiles"
)

print(f"XArray shape: {profiles.shape}")
profiles

XArray shape: (491, 46, 1299)

You can click the data symbol to expand the array.

Metadata

The XArray DataArray has a name, which is useful for identification:

profiles.name

'simulated profiles'

We can also add attributes

profiles.attrs["Subject"] = "Simulated gold nanorod UV/Vis spectra"
profiles.attrs["Medium"] = "Water"
profiles.attrs["Paper"] = "Automated Gold Nanorod Spectral Morphology Analysis Pipeline"
profiles.attrs["Authors"] = "Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul"
profiles

Click on attributes to expand the attributes field.

Data Selection

We can select data based on coordinates with the .sel() method:

subset_length = profiles.sel(diameter=10, length=[30, 40, 50])
subset_length_diameter =  profiles.sel(diameter=[ 10, 15, 20] , length=[30, 40, 50])
subset_length

XArray automatically updates the coordinates, and the meta data is preserved.

If we store this DataArray, we do not have to take care of storing the meta data; something that often goes wrong.

Plotting

Common tasks like plotting are a breeze with XArray:

subset_length.plot(col="length")

subset_length_diameter.plot(col="length", row="diameter")

Aggregation

Aggregation over coordinates is very common, and part of XArray:

max_absorption = profiles.max(dim='wavelength')
max_absorption

# %matplotlib qt5
import matplotlib
import matplotlib.pyplot as plt
max_absorption
max_absorption.plot()

Boolean expressions

Python inequalities are evaluated element wise and return a boolean array.

mask = max_absorption > 0
mask

Be careful. Attributes are lost, as the meaning of the array has changed.

Non-Cartesian grids

Lab data rarely fits on Cartesian grids. It is too expensive to measure every combination, and it may physically not make sense either.

This is an example of a irregular grid of measurements:

# Example: scattered measurement points
lengths = [10, 15, 20, 25, 12, 18]     # Irregular
diameters = [5, 8, 12, 6, 9, 11]      # Not all combinations exist

scattered_profiles = xr.DataArray(
  np.random.rand(6, 1299),  # 6 scattered points, 1299 wavelengths
  coords=[
    ("measurement", range(6)),
    ("wavelength", np.arange(401, 1700))
  ],
  dims=["measurement", "wavelength"]
).assign_coords(
  length=("measurement", lengths),
  diameter=("measurement", diameters)
)

scattered_profiles

Where profile uses a Cartesian grid, scattered_profiles uses flat arrays with meta data labels. It is like a key in a database: each measurement can be associated with coordinates.

scattered_profiles - Uses coordinate arrays:
length       (measurement) int64 10 15 20 25 12 18  # <-- Simple array
diameter     (measurement) int64 5 8 12 6 9 11      # <-- Simple array

When to Group Data

MultiIndex is most valuable when your data has natural hierarchical structure:

Use Cases for MultiIndex:

Panel data: Multiple time series (e.g., stock prices by company and date)
Experimental replicates: Multiple measurements per condition
Hierarchical sampling: Nested geographic regions
Efficient groupby operations: When you frequently group by combined dimensions

I like to see it as an intermediate representation for analysis purposes.

Example: Stacking for Hierarchical Analysis

The stack operation creates a MultiIndex for hierarchical operations:

profiles_stacked = profiles.stack(size_combo=("length", "diameter"))
print(f"Stacked shape: {profiles_stacked.shape}")
print(f"MultiIndex levels: {profiles_stacked.coords['size_combo']}")

Stacked shape: (1299, 22586)
MultiIndex levels: <xarray.DataArray 'size_combo' (size_combo: 22586)> Size: 181kB
MultiIndex
Coordinates:
  * size_combo  (size_combo) object 181kB MultiIndex
  * length      (size_combo) int64 181kB 10 10 10 10 10 ... 500 500 500 500 500
  * diameter    (size_combo) int64 181kB 5 6 7 8 9 10 11 ... 45 46 47 48 49 50

Now we have a Pandas style multi-index with the two dimensions length and diameter stacked together.

profiles_stacked

Selecting on the multi-index

Operations become a bit more involved with MultiIndex:

profiles_stacked.sel(size_combo=(50.0, 15.0))

Filtering non-zero profiles

For the nano rod simulations, some combinations were not computed. These are filled with zero’s. We can remove them after stacking the data:

# Take max over wavelength dimension
max_absorption = profiles_stacked.max(dim='wavelength')
print(f"Max absorption shape: {max_absorption.shape}")

# Filter profiles with max > 0
valid_profiles = profiles_stacked.where(profiles_stacked.max(dim='wavelength') > 0, drop=True)
print(f"Original profiles: {profiles_stacked.shape}")
print(f"Valid profiles (max > 0): {valid_profiles.shape}")

Max absorption shape: (22586,)
Original profiles: (1299, 22586)
Valid profiles (max > 0): (1299, 10747)

Now we have a MultiIndex DataArray with only the valid profiles that have a maximum absorption greater than zero.

Irregular Grid

The MultiIndex is only useful for specific analyses. To get rid of it, we flatten the array:

valid_profiles_flat = valid_profiles.reset_index("size_combo")
valid_profiles_flat

Now we have a non-Cartesian grid with a single coordinate array size_combo that combines length and diameter coordinates.

Lets slice valid_profiles_flat

# Select profiles with length 50, 55 and diameter 15, 20
valid_profiles_flat.sel(size_combo=[50, 51])

Selection no longer acts on the coordinates; it acts on the integer identifier size_combo. This is a flat array with the coordinates combined, which is useful for storage and serialization.

The coordinates are stored as properties of the DataArray, so we can access them:

valid_profiles_flat.length

<xarray.DataArray 'length' (size_combo: 10747)> Size: 86kB
12 13 13 13 13 14 14 14 14 14 15 ... 490 491 492 493 494 495 496 497 498 499 500
Coordinates:
    length    (size_combo) int64 86kB 12 13 13 13 13 14 ... 496 497 498 499 500
    diameter  (size_combo) int64 86kB 5 5 6 7 8 5 6 7 ... 50 50 50 50 50 50 50
Dimensions without coordinates: size_combo

The coordinate is itself a DataArray. We can use it for subsetting:

valid_profiles_flat.sel(size_combo=valid_profiles_flat.length == 100)

Note that sel expects coordinate values; not indexes or boolean arrays.

We can also use boolean expressions to filter the profiles:

medium_rods = valid_profiles_flat.where(
  (valid_profiles_flat.length >= 50) & (valid_profiles_flat.length <= 100),
  drop=True
)
medium_rods

Storage with NetCDF

NetCDF (Network Common Data Form) is the standard format for scientific data storage:

Self-describing: Metadata and data stored together
Cross-platform: Works across different systems and languages
Efficient: Optimized for large multidimensional arrays
Standardized: CF conventions for climate/atmospheric data

NetCDF is perfect for storing scientific data, as it preserves the structure and metadata of your experiments.

# ✅ Regular grids - work fine
profiles.to_netcdf("regular_grid.nc")

# ❌ MultiIndex - not supported
# valid_profiles.to_netcdf("multiindex.nc")  # Would fail!

# ✅ Coordinate arrays - work fine  
simulated_xarray_path = "simulated_profiles.nc"
valid_profiles_flat.to_netcdf(simulated_xarray_path)
print("Stored successfully as NetCDF")

Stored successfully as NetCDF

Beware: MultiIndex cannot be serialized to NetCDF. Use reset_index() to convert to coordinate arrays for storage.

From Cryptic Code to Scientific Intuition

We started with the painful reality: scientific thinking doesn’t match programming reality. Selecting “high temperature samples” becomes cryptic array indexing that nobody understands six months later.

With XArray’s labeled arrays, we transformed this:

# Before: cryptic and fragile
result = data[temp_mask & pH_mask & date_mask, :]

Into this:

# After: matches scientific thinking  
result = experiments.sel(temperature='high', pH='medium', batch='January')

Analysis, plotting, storing and loading data becomes simpler with XArray.

The 10x Productivity Gain

When your code thinks like a scientist:

Come up with Hypothesis to test: 30 seconds
Code to write: 3 minutes (not 30 minutes)
Understanding old code: Instant (not hours of detective work)
Sharing with colleagues: Copy-paste ready

Stop fighting your tools. Start coding the way you think.

Your experiments are too valuable to lose in translation between scientific thinking and programming reality. Make your data as intuitive as your hypotheses.

References

Gleason, Samuel P., Jakob C. Dahl, Mahmoud Elzouka, Xingzhi Wang, Dana O. Byrne, Hannah Cho, Mumtaz Gababa, et al. 2024. “Automated Gold Nanorod Spectral Morphology Analysis Pipeline.” ACS Nano 18 (51): 34646–55. https://doi.org/10.1021/acsnano.4c09753.