Transform cryptic array indexing into intuitive, self-documenting code using labeled data structures. Make your analysis as clear as your scientific thinking.
The Problem Every Scientist Faces
You want to select “high temperature, medium pH samples from the January batch.”
In your head, this takes 2 seconds. In code, it becomes:
Why does selecting scientific data require cryptic array indexing?
The gap between how scientists think and how we’re forced to code is killing productivity. You shouldn’t need to remember that “temperature is column 2” or decode sample_array[mask1 & mask2, :] six months later.
This post shows how to write code that matches your scientific intuition - where selecting data is as natural as thinking about it.
The Numpy Problem: Arrays Without Meaning
Python’s built-in lists are slow, so we use Numpy. But Numpy arrays are just numbers - they don’t know what those numbers mean.
Let’s see this problem in action with real scientific data:
I am going to load data from the paper “Automated Gold Nanorod Spectral Morphology Analysis Pipeline” Gleason et al. (2024)
from pathlib import Pathimport numpy as npimport matplotlib.pyplot as pltimport os# Keep the working directory as-is since data file is localprint(f"Current working directory: {os.getcwd()}")
Current working directory: /home/sjoerd/projects/science-and-machines-workspace/website/data-driven-operations/python-for-experiments/write-code-the-way-you-think
We load quantum simulations of UV/Vis absorption spectra, obtained from a grid of simulations of gold nanorods. The data is stored in a Numpy array, which is a common format for scientific data.
Here’s the problem: This critical scientific information was lost when the data was saved to Numpy. I had to contact the authors to reconstruct what the numbers meant.
This is the gap: Your scientific understanding vs. what your code can express. With Numpy it is a lot of work to store meaningful context. We easily lose it.
The Numpy nightmare: Every time we manipulate simulated, we must manually track coordinates. Miss one step and we lose the scientific meaning forever.
Now we have three separate arrays to keep synchronized: simulated, lengths, diameters.
The problem: In every manipulation, we must manually coordinate all arrays. One mistake = lost scientific context.
The Solution: Code That Maps to Science
For chemistry, biology, and medicine we often need to
Represent data on grids and irregular grids
Grouping and aggregating data
Work with labels
Work with coordinates
Grid structures tend to be useful for representing experimental data, as they represent structures in our experiments. The same data may be represented in different ways; each highlighting different aspects.
Three Ways to Represent Grid Data
The following grid representations cover most data sets in science and engineering. They occur in chemistry, biology, materials science, physics and medicine. Each has its own strengths and weaknesses.
1. Cartesian Product Data
Organized as products of dimensions, regardless of spacing regularity:
Structure: Data organized as combinations (Cartesian product) of coordinate values
Use case: Simulated data, gridded measurements, image data
Advantages: Intuitive slicing, fast indexing, easy to store, natural for visualization
Disadvantages: Wastes space if many combinations are missing/invalid, forces rectangular structure
Example:
Consider reaction yield measurements across different conditions:
That’s 3×4 = 12 experiments forming a complete rectangular grid. The spacing doesn’t need to be regular - you still measure every possible (temperature, pH) combination.
2. Hierarchical Data
For data with natural nested structure that you want to group and analyze:
Disadvantages: Hierarchical indexing is more complex. Imposes a strong model on the data. Data model and data are tightly coupled, which makes it less suitable for storing or re-interpreting.
Example: Drug stability study with multiple batches and replicates:
You’re testing drug degradation across different storage conditions. Each condition has 3 batches, each batch has 4 replicates. The MultiIndex structure arises from the grouping.
Original flat data:
Storage Temp Batch Replicate Stability (%)
4°C A 1 98.5
4°C A 2 98.3
4°C A 3 98.7
4°C A 4 98.1
4°C B 1 97.9
4°C B 2 98.2
25°C A 1 94.1
25°C A 2 93.8
25°C B 1 93.2
After groupby([‘storage_temp’, ‘batch’]).mean():
Index (storage_temp, batch) Mean Stability (%)
(4°C, A) 98.4
(4°C, B) 98.0
(25°C, A) 93.9
(25°C, B) 93.2
The MultiIndex has two levels:
Level 0 (name: 'storage_temp'): The levels are ['4°C', '25°C']
Level 1 (name: 'batch'): The levels are ['A', 'B']
Each index is a tuple like (4°C, A). The codes are integer positions [0,0], [0,1], [1,0], [1,1] that map to the levels.
Storage Temp Batch Replicate Stability (%)
4°C A 1 98.5
4°C A 2 98.3
4°C A 3 98.7
4°C A 4 98.1
4°C B 1 97.9
4°C B 2 98.2
... ... ... ...
25°C C 4 89.1
Now grouping operations match your scientific thinking:
groupby('storage_temp') → “Compare 4°C vs 25°C performance”
groupby('batch') → “Is batch A consistently different from B and C?”
Perfect when your experimental design has natural hierarchies and you ask nested questions.
3. Irregular Grids
Most experimental grids cannot be decomposed as a product (Cartesian product). Some combinations do not make sense, or it would be too expensive.
Structure: Flat arrays with coordinate metadata
Use case: Field measurements, scattered sensor data, arbitrary point clouds
Advantages: NetCDF compatible, simple mental model, maximum flexibility
Disadvantages: Less compact for structured data
Example: Enzyme activity measurements from different labs at arbitrary conditions:
You’re collecting enzyme kinetics data from collaborating labs. Each lab tested at whatever conditions were convenient - no coordinated experimental design:
Lab Temperature pH Buffer Activity (U/mL)
Tokyo 37°C 7.4 Tris 245
Berlin 25°C 6.8 Phosphate 189
Austin 42°C 7.2 HEPES 267
Tokyo 30°C 7.0 Tris 198
São Paulo 35°C 8.1 Bicine 156
Austin 37°C 7.4 HEPES 251
Berlin 20°C 7.5 Phosphate 167
... ... ... ... ...
Perfect for scattered, real-world data where you can’t control the experimental grid. Each measurement has its own unique combination of conditions - no systematic structure, just flexible coordinate arrays that preserve all the context.
XArray - Labeled Arrays for Science
The xarray package transforms how we work with scientific data by bridging the gap between how we think about experiments and how we code them. It solves several problems:
Offers labeled arrays, so we know what each dimension represents
Supports all three grid types: Cartesian, hierarchical, and irregular
Provides intuitive selection and aggregation methods
Integrates seamlessly with Numpy and Pandas workflows
Instead of managing separate arrays and hoping they stay synchronized, xarray stores data with its coordinates and metadata attached.
Your arrays should know what they represent.
Here is a selection example with XArray:
# Numpy codetemp_mask = (data[:, 2] >65) & (data[:, 2] <85)result = data[temp_mask, :] # Which column was temperature again?# XArray coderesult = experiments.sel(temperature=slice(65, 85)) # Crystal clear
Because the array know what your data represents, manipulation becomes self-documenting. You can select data using labels that match your scientific thinking, not cryptic array indices.
Cartesian grid
The simulated gold nanorod spectra are on a Cartesian grid. We have a regular grid of simulations across three dimensions: length, diameter, and wavelength.
This is how we convert the Numpy array to an XArray DataArray with named coordinates:
import xarray as xrxr.set_options(display_expand_attrs=False, display_expand_data=False);# Create XArray DataArray with named coordinatesprofiles = xr.DataArray( np.array(simulated_np), coords=[ ("length", simulated_length_coordinates), ("diameter", simulated_diameter_coordinates), ("wavelength", simulated_wavelength_coordinates) ], dims=["length", "diameter", "wavelength"], name="simulated profiles")print(f"XArray shape: {profiles.shape}")profiles
You can click the data symbol to expand the array.
Metadata
The XArray DataArray has a name, which is useful for identification:
profiles.name
'simulated profiles'
We can also add attributes
profiles.attrs["Subject"] ="Simulated gold nanorod UV/Vis spectra"profiles.attrs["Medium"] ="Water"profiles.attrs["Paper"] ="Automated Gold Nanorod Spectral Morphology Analysis Pipeline"profiles.attrs["Authors"] ="Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul"profiles
Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul
Click on attributes to expand the attributes field.
Data Selection
We can select data based on coordinates with the .sel() method:
Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul
XArray automatically updates the coordinates, and the meta data is preserved.
If we store this DataArray, we do not have to take care of storing the meta data; something that often goes wrong.
Plotting
Common tasks like plotting are a breeze with XArray:
Where profile uses a Cartesian grid, scattered_profiles uses flat arrays with meta data labels. It is like a key in a database: each measurement can be associated with coordinates.
Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul
Selecting on the multi-index
Operations become a bit more involved with MultiIndex:
Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul
Filtering non-zero profiles
For the nano rod simulations, some combinations were not computed. These are filled with zero’s. We can remove them after stacking the data:
# Take max over wavelength dimensionmax_absorption = profiles_stacked.max(dim='wavelength')print(f"Max absorption shape: {max_absorption.shape}")# Filter profiles with max > 0valid_profiles = profiles_stacked.where(profiles_stacked.max(dim='wavelength') >0, drop=True)print(f"Original profiles: {profiles_stacked.shape}")print(f"Valid profiles (max > 0): {valid_profiles.shape}")
Max absorption shape: (22586,)
Original profiles: (1299, 22586)
Valid profiles (max > 0): (1299, 10747)
Now we have a MultiIndex DataArray with only the valid profiles that have a maximum absorption greater than zero.
Irregular Grid
The MultiIndex is only useful for specific analyses. To get rid of it, we flatten the array:
Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul
Now we have a non-Cartesian grid with a single coordinate array size_combo that combines length and diameter coordinates.
Lets slice valid_profiles_flat
# Select profiles with length 50, 55 and diameter 15, 20valid_profiles_flat.sel(size_combo=[50, 51])
Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul
Selection no longer acts on the coordinates; it acts on the integer identifier size_combo. This is a flat array with the coordinates combined, which is useful for storage and serialization.
The coordinates are stored as properties of the DataArray, so we can access them:
Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul
Note that sel expects coordinate values; not indexes or boolean arrays.
We can also use boolean expressions to filter the profiles:
Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul
Storage with NetCDF
NetCDF (Network Common Data Form) is the standard format for scientific data storage:
Self-describing: Metadata and data stored together
Cross-platform: Works across different systems and languages
Efficient: Optimized for large multidimensional arrays
Standardized: CF conventions for climate/atmospheric data
NetCDF is perfect for storing scientific data, as it preserves the structure and metadata of your experiments.
# ✅ Regular grids - work fineprofiles.to_netcdf("regular_grid.nc")# ❌ MultiIndex - not supported# valid_profiles.to_netcdf("multiindex.nc") # Would fail!# ✅ Coordinate arrays - work fine simulated_xarray_path ="simulated_profiles.nc"valid_profiles_flat.to_netcdf(simulated_xarray_path)print("Stored successfully as NetCDF")
Stored successfully as NetCDF
Beware: MultiIndex cannot be serialized to NetCDF. Use reset_index() to convert to coordinate arrays for storage.
From Cryptic Code to Scientific Intuition
We started with the painful reality: scientific thinking doesn’t match programming reality. Selecting “high temperature samples” becomes cryptic array indexing that nobody understands six months later.
With XArray’s labeled arrays, we transformed this:
# After: matches scientific thinking result = experiments.sel(temperature='high', pH='medium', batch='January')
Analysis, plotting, storing and loading data becomes simpler with XArray.
The 10x Productivity Gain
When your code thinks like a scientist:
Come up with Hypothesis to test: 30 seconds
Code to write: 3 minutes (not 30 minutes)
Understanding old code: Instant (not hours of detective work)
Sharing with colleagues: Copy-paste ready
Stop fighting your tools. Start coding the way you think.
Your experiments are too valuable to lose in translation between scientific thinking and programming reality. Make your data as intuitive as your hypotheses.
References
Gleason, Samuel P., Jakob C. Dahl, Mahmoud Elzouka, Xingzhi Wang, Dana O. Byrne, Hannah Cho, Mumtaz Gababa, et al. 2024. “Automated Gold Nanorod Spectral Morphology Analysis Pipeline.”ACS Nano 18 (51): 34646–55. https://doi.org/10.1021/acsnano.4c09753.
Source Code
---title: "Write Code The Way You Think"jupyter: uv-data-driven-operations-python-for-experiments-write-code-the-way-you-thinkdescription: "Transform cryptic array indexing into intuitive, self-documenting code using labeled data structures. Make your analysis as clear as your scientific thinking."image: "index_files/figure-html/cell-11-output-1.png"format: html: toc: trueexecute: echo: true eval: true cache: false freeze: falseengine: jupyterbibliography: bibliography.bib---# The Problem Every Scientist FacesYou want to select "high temperature, medium pH samples from the January batch." In your head, this takes 2 seconds. In code, it becomes:```python# The painful waytemp_mask = (data[:, 2] >65) & (data[:, 2] <85)pH_mask = (data[:, 3] >6.5) & (data[:, 3] <7.5) date_mask = data[:, 0] =='Jan'result = data[temp_mask & pH_mask & date_mask, :]# Wait... which column was pH again?```**Why does selecting scientific data require cryptic array indexing?** The gap between how scientists think and how we're forced to code is killing productivity. You shouldn't need to remember that "temperature is column 2" or decode `sample_array[mask1 & mask2, :]` six months later.This post shows how to write code that matches your scientific intuition - where selecting data is as natural as thinking about it.# The Numpy Problem: Arrays Without MeaningPython's built-in lists are slow, so we use Numpy. But Numpy arrays are just numbers - they don't know what those numbers *mean*.Let's see this problem in action with real scientific data:I am going to load data from the paper "Automated Gold Nanorod Spectral Morphology Analysis Pipeline" @gleasonAutomatedGoldNanorod2024```{python}from pathlib import Pathimport numpy as npimport matplotlib.pyplot as pltimport os# Keep the working directory as-is since data file is localprint(f"Current working directory: {os.getcwd()}")```We load quantum simulations of UV/Vis absorption spectra, obtained from a grid of simulations of gold nanorods. The data is stored in a Numpy array, which is a common format for scientific data. Load the data with Numpy ```{python}#|label: load-dataSIMULATED_PATH = Path("main_profiles.npy")simulated_np = np.load(SIMULATED_PATH)print(f"Simulated shape: {simulated_np.shape}")```This loads one large array of **meaningless numbers** without context.The `simulated` array contains a grid of simulations, but Numpy doesn't know this represents:```46 diameter values (5-50 nm) 491 length values (10-500 nm)1299 wavelength points (401-1699 nm)```**Here's the problem**: This critical scientific information was lost when the data was saved to Numpy. I had to contact the authors to reconstruct what the numbers meant.**This is the gap**: Your scientific understanding vs. what your code can express. With Numpy it is a lot of work to store meaningful context. We easily lose it. **The Numpy nightmare**: Every time we manipulate `simulated`, we must manually track coordinates. Miss one step and we lose the scientific meaning forever.Here's the fragile manual approach:We code ```{python}simulated_length_coordinates = np.arange(10, 501, 1) # 491simulated_diameter_coordinates = np.arange(5, 51, 1) # 86simulated_wavelength_coordinates = np.arange(401, 1700, 1) # 1299 ``````{python}# Cross product of simulated diameter and lengthsimulated_coordinates = lengths, diameters = np.meshgrid( simulated_length_coordinates, simulated_diameter_coordinates,)print(lengths)print(diameters)```Now we have **three separate arrays** to keep synchronized: `simulated`, `lengths`, `diameters`. **The problem**: In every manipulation, we must manually coordinate all arrays. One mistake = lost scientific context. # The Solution: Code That Maps to ScienceFor chemistry, biology, and medicine we often need to- Represent data on grids and irregular grids- Grouping and aggregating data- Work with labels- Work with coordinatesGrid structures tend to be useful for representing experimental data, as they represent structures in our experiments. The same data may be represented in different ways; each highlighting different aspects. ## Three Ways to Represent Grid DataThe following grid representations cover most data sets in science and engineering. They occur in chemistry, biology, materials science, physics and medicine. Each has its own strengths and weaknesses.### 1. Cartesian Product DataOrganized as products of dimensions, regardless of spacing regularity:- **Structure**: Data organized as combinations (Cartesian product) of coordinate values- **Use case**: Simulated data, gridded measurements, image data- **Advantages**: Intuitive slicing, fast indexing, easy to store, natural for visualization- **Disadvantages**: Wastes space if many combinations are missing/invalid, forces rectangular structure**Example**: Consider reaction yield measurements across different conditions:- **Temperature**: `[50°C, 75°C, 100°C]` - regular 25° spacing- **pH**: `[6.0, 6.5, 7.2, 8.0]` - irregular spacingThe **Cartesian product** means you measure yield at *every combination*:``` pH: 6.0 6.5 7.2 8.0Temp:50°C 78% 82% 85% 79%75°C 85% 89% 92% 86% 100°C 79% 83% 88% 82%```That's 3×4 = 12 experiments forming a complete rectangular grid. The spacing doesn't need to be regular - you still measure every possible (temperature, pH) combination.### 2. Hierarchical DataFor data with natural nested structure that you want to group and analyze:- **Structure**: Pandas-style hierarchical indexing combining dimensions- **Use case**: Hierarchical grouping, panel data, experiments with replicates- **Advantages**: Suitable for high-level manipulation. Compact representation, familiar pandas workflows, efficient groupby- **Disadvantages**: Hierarchical indexing is more complex. Imposes a strong model on the data. Data model and data are tightly coupled, which makes it less suitable for storing or re-interpreting. **Example**: Drug stability study with multiple batches and replicates:You're testing drug degradation across different storage conditions. Each condition has 3 batches, each batch has 4 replicates. The MultiIndex structure arises from the grouping.**Original flat data:**```Storage Temp Batch Replicate Stability (%)4°C A 1 98.54°C A 2 98.3 4°C A 3 98.74°C A 4 98.14°C B 1 97.94°C B 2 98.225°C A 1 94.125°C A 2 93.825°C B 1 93.2```**After groupby(['storage_temp', 'batch']).mean():**```Index (storage_temp, batch) Mean Stability (%)(4°C, A) 98.4(4°C, B) 98.0 (25°C, A) 93.9(25°C, B) 93.2```The **MultiIndex** has two **levels**:- **Level 0** (name: `'storage_temp'`): The **levels** are `['4°C', '25°C']`- **Level 1** (name: `'batch'`): The **levels** are `['A', 'B']`Each index is a **tuple** like `(4°C, A)`. The **codes** are integer positions `[0,0], [0,1], [1,0], [1,1]` that map to the levels.```Storage Temp Batch Replicate Stability (%)4°C A 1 98.54°C A 2 98.3 4°C A 3 98.74°C A 4 98.14°C B 1 97.94°C B 2 98.2... ... ... ...25°C C 4 89.1```Now grouping operations match your scientific thinking:- `groupby('storage_temp')` → "Compare 4°C vs 25°C performance" - `groupby('batch')` → "Is batch A consistently different from B and C?"- `groupby(['storage_temp', 'batch'])` → "Each temperature-batch combination"Perfect when your experimental design has natural hierarchies and you ask nested questions.### 3. Irregular GridsMost experimental grids cannot be decomposed as a product (Cartesian product).Some combinations do not make sense, or it would be too expensive. - **Structure**: Flat arrays with coordinate metadata- **Use case**: Field measurements, scattered sensor data, arbitrary point clouds- **Advantages**: NetCDF compatible, simple mental model, maximum flexibility- **Disadvantages**: Less compact for structured data**Example**: Enzyme activity measurements from different labs at arbitrary conditions:You're collecting enzyme kinetics data from collaborating labs. Each lab tested at whatever conditions were convenient - no coordinated experimental design:```Lab Temperature pH Buffer Activity (U/mL)Tokyo 37°C 7.4 Tris 245Berlin 25°C 6.8 Phosphate 189 Austin 42°C 7.2 HEPES 267Tokyo 30°C 7.0 Tris 198São Paulo 35°C 8.1 Bicine 156Austin 37°C 7.4 HEPES 251Berlin 20°C 7.5 Phosphate 167... ... ... ... ...```Perfect for scattered, real-world data where you can't control the experimental grid. Each measurement has its own unique combination of conditions - no systematic structure, just flexible coordinate arrays that preserve all the context. # XArray - Labeled Arrays for ScienceThe `xarray` package transforms how we work with scientific data by bridging the gap between how we *think* about experiments and how we *code* them. It solves several problems:- Offers labeled arrays, so we know what each dimension represents- Supports all three grid types: Cartesian, hierarchical, and irregular- Provides intuitive selection and aggregation methods- Integrates seamlessly with Numpy and Pandas workflowsInstead of managing separate arrays and hoping they stay synchronized, xarray stores data with its coordinates and metadata attached.**Your arrays should know what they represent.**Here is a selection example with XArray:```python# Numpy codetemp_mask = (data[:, 2] >65) & (data[:, 2] <85)result = data[temp_mask, :] # Which column was temperature again?# XArray coderesult = experiments.sel(temperature=slice(65, 85)) # Crystal clear```Because the array know what your data represents, manipulation becomes self-documenting. You can select data using labels that match your scientific thinking, not cryptic array indices.## Cartesian grid The simulated gold nanorod spectra are on a Cartesian grid. We have a regular grid of simulations across three dimensions: length, diameter, and wavelength.This is how we convert the Numpy array to an XArray DataArray with named coordinates:```{python}import xarray as xrxr.set_options(display_expand_attrs=False, display_expand_data=False);# Create XArray DataArray with named coordinatesprofiles = xr.DataArray( np.array(simulated_np), coords=[ ("length", simulated_length_coordinates), ("diameter", simulated_diameter_coordinates), ("wavelength", simulated_wavelength_coordinates) ], dims=["length", "diameter", "wavelength"], name="simulated profiles")print(f"XArray shape: {profiles.shape}")profiles```You can click the data symbol to expand the array.## MetadataThe XArray DataArray has a name, which is useful for identification:```{python}profiles.name```We can also add attributes```{python}profiles.attrs["Subject"] ="Simulated gold nanorod UV/Vis spectra"profiles.attrs["Medium"] ="Water"profiles.attrs["Paper"] ="Automated Gold Nanorod Spectral Morphology Analysis Pipeline"profiles.attrs["Authors"] ="Gleason, Samuel P. and Dahl, Jakob C. and Elzouka, Mahmoud and Wang, Xingzhi and Byrne, Dana O. and Cho, Hannah and Gababa, Mumtaz and Prasher, Ravi S. and Lubner, Sean and Chan, Emory M. and Alivisatos, A. Paul"profiles```Click on attributes to expand the attributes field. ## Data SelectionWe can select data based on coordinates with the `.sel()` method:```{python}subset_length = profiles.sel(diameter=10, length=[30, 40, 50])subset_length_diameter = profiles.sel(diameter=[ 10, 15, 20] , length=[30, 40, 50])subset_length```XArray automatically updates the coordinates, and the meta data is preserved.If we store this DataArray, we do not have to take care of storing the meta data; something that often goes wrong. ## PlottingCommon tasks like plotting are a breeze with XArray:```{python}subset_length.plot(col="length")``````{python}subset_length_diameter.plot(col="length", row="diameter")```## AggregationAggregation over coordinates is very common, and part of XArray: ```{python}max_absorption = profiles.max(dim='wavelength')max_absorption ``````{python}# %matplotlib qt5import matplotlibimport matplotlib.pyplot as pltmax_absorptionmax_absorption.plot()```## Boolean expressionsPython inequalities are evaluated element wise and return a boolean array. ```{python}mask = max_absorption >0mask```Be careful. Attributes are lost, as the meaning of the array has changed. ## Non-Cartesian gridsLab data rarely fits on Cartesian grids. It is too expensive to measure every combination, and it may physically not make sense either. This is an example of a irregular grid of measurements:```{python}# Example: scattered measurement pointslengths = [10, 15, 20, 25, 12, 18] # Irregulardiameters = [5, 8, 12, 6, 9, 11] # Not all combinations existscattered_profiles = xr.DataArray( np.random.rand(6, 1299), # 6 scattered points, 1299 wavelengths coords=[ ("measurement", range(6)), ("wavelength", np.arange(401, 1700)) ], dims=["measurement", "wavelength"]).assign_coords( length=("measurement", lengths), diameter=("measurement", diameters))scattered_profiles```Where profile uses a Cartesian grid, scattered_profiles uses flat arrays with meta data labels. It is like a key in a database: each measurement can be associated with coordinates.```scattered_profiles - Uses coordinate arrays:length (measurement) int64 10 15 20 25 12 18 # <-- Simple arraydiameter (measurement) int64 5 8 12 6 9 11 # <-- Simple array```## When to Group DataMultiIndex is most valuable when your data has **natural hierarchical structure**:### Use Cases for MultiIndex:1. **Panel data**: Multiple time series (e.g., stock prices by company and date)2. **Experimental replicates**: Multiple measurements per condition3. **Hierarchical sampling**: Nested geographic regions4. **Efficient groupby operations**: When you frequently group by combined dimensionsI like to see it as an intermediate representation for analysis purposes. ## Example: Stacking for Hierarchical AnalysisThe `stack` operation creates a MultiIndex for hierarchical operations: ```{python}profiles_stacked = profiles.stack(size_combo=("length", "diameter"))print(f"Stacked shape: {profiles_stacked.shape}")print(f"MultiIndex levels: {profiles_stacked.coords['size_combo']}")```Now we have a Pandas style multi-index with the two dimensions `length` and `diameter` stacked together.```{python}profiles_stacked```## Selecting on the multi-indexOperations become a bit more involved with MultiIndex:```{python}profiles_stacked.sel(size_combo=(50.0, 15.0))```## Filtering non-zero profilesFor the nano rod simulations, some combinations were not computed. These are filled with zero's.We can remove them after stacking the data:```{python}# Take max over wavelength dimensionmax_absorption = profiles_stacked.max(dim='wavelength')print(f"Max absorption shape: {max_absorption.shape}")# Filter profiles with max > 0valid_profiles = profiles_stacked.where(profiles_stacked.max(dim='wavelength') >0, drop=True)print(f"Original profiles: {profiles_stacked.shape}")print(f"Valid profiles (max > 0): {valid_profiles.shape}")```Now we have a MultiIndex DataArray with only the valid profiles that have a maximum absorption greater than zero.## Irregular GridThe MultiIndex is only useful for specific analyses. To get rid of it, we flatten the array:```{python}valid_profiles_flat = valid_profiles.reset_index("size_combo")valid_profiles_flat```Now we have a non-Cartesian grid with a single coordinate array `size_combo` that combines `length` and `diameter` coordinates. Lets slice valid_profiles_flat```{python}# Select profiles with length 50, 55 and diameter 15, 20valid_profiles_flat.sel(size_combo=[50, 51])```Selection no longer acts on the coordinates; it acts on the integer identifier `size_combo`. This is a flat array with the coordinates combined, which is useful for storage and serialization.The coordinates are stored as properties of the DataArray, so we can access them:```{python}valid_profiles_flat.length```The coordinate is itself a DataArray. We can use it for subsetting:```{python}valid_profiles_flat.sel(size_combo=valid_profiles_flat.length ==100)```Note that sel expects coordinate values; not indexes or boolean arrays. We can also use boolean expressions to filter the profiles:```{python}medium_rods = valid_profiles_flat.where( (valid_profiles_flat.length >=50) & (valid_profiles_flat.length <=100), drop=True)medium_rods ```## Storage with NetCDFNetCDF (Network Common Data Form) is the standard format for scientific data storage:- **Self-describing**: Metadata and data stored together- **Cross-platform**: Works across different systems and languages- **Efficient**: Optimized for large multidimensional arrays- **Standardized**: CF conventions for climate/atmospheric dataNetCDF is perfect for storing scientific data, as it preserves the structure and metadata of your experiments.```{python}# ✅ Regular grids - work fineprofiles.to_netcdf("regular_grid.nc")# ❌ MultiIndex - not supported# valid_profiles.to_netcdf("multiindex.nc") # Would fail!# ✅ Coordinate arrays - work fine simulated_xarray_path ="simulated_profiles.nc"valid_profiles_flat.to_netcdf(simulated_xarray_path)print("Stored successfully as NetCDF")```**Beware**: MultiIndex cannot be serialized to NetCDF. Use `reset_index()` to convert to coordinate arrays for storage.# From Cryptic Code to Scientific IntuitionWe started with the painful reality: scientific thinking doesn't match programming reality. Selecting "high temperature samples" becomes cryptic array indexing that nobody understands six months later.With XArray's labeled arrays, we transformed this:```python# Before: cryptic and fragileresult = data[temp_mask & pH_mask & date_mask, :]```Into this:```python# After: matches scientific thinking result = experiments.sel(temperature='high', pH='medium', batch='January')```Analysis, plotting, storing and loading data becomes simpler with XArray. ## The 10x Productivity GainWhen your code thinks like a scientist:- **Come up with Hypothesis to test**: 30 seconds- **Code to write**: 3 minutes (not 30 minutes)- **Understanding old code**: Instant (not hours of detective work)- **Sharing with colleagues**: Copy-paste ready**Stop fighting your tools. Start coding the way you think.**Your experiments are too valuable to lose in translation between scientific thinking and programming reality. Make your data as intuitive as your hypotheses.