Filtering Data
Once you’ve created an EnzymeML document and filled it with proteins, small molecules, reactions, and measurements, you’ll often need to find specific entities within that document. Perhaps you want to find all proteins from a particular organism, or all reversible reactions, or all measurements taken at a specific temperature. Rather than manually looking through lists of entities, PyEnzyme provides filtering capabilities that let you search for entities based on their attributes.
Think of filtering as asking questions about your data: “Which proteins come from humans?” or “Which reactions involve this particular substrate?” PyEnzyme’s filter methods provide a straightforward way to get answers to these questions, returning just the entities that match your criteria. Instead of writing loops to manually search through lists, PyEnzyme’s filter methods provide a clean, readable way to work with your data.
How Filtering Works
Section titled “How Filtering Works”PyEnzyme provides filter methods for all major entity collections in an EnzymeMLDocument. The design is consistent and predictable: each type of entity (proteins, small molecules, reactions, measurements, etc.) has a corresponding filter method that follows the same pattern. This consistency makes filtering easy to learn: once you understand how to filter one type of entity, you can apply the same approach to all others.
The general pattern:
filtered_entities = enzmldoc.filter_<collection>(attribute1=value1, attribute2=value2, ...)Here’s how to read this pattern:
enzmldocis your EnzymeML documentfilter_<collection>is the filter method for a specific entity type (e.g.,filter_proteins,filter_reactions,filter_measurements)- You provide keyword arguments where the keyword is an attribute name (like
name,organism, orreversible) and the value is what you’re searching for - The method returns a list containing all entities that match your criteria
Understanding the matching behavior:
When you provide multiple attributes to filter on, PyEnzyme uses an “AND” condition: meaning entities must match all the criteria you specify to be included in the results. This is intuitive: if you ask for proteins that are from “Homo sapiens” AND in vessel “v1”, you’ll only get proteins that satisfy both conditions.
Let’s see this in practice:
import pyenzyme as pe
# Load an existing documentenzmldoc = pe.read_enzymeml("experiment.json")
# Filter proteins by namealcohol_dehydrogenases = enzmldoc.filter_proteins(name="Alcohol dehydrogenase")
# Filter by multiple attributes (AND condition)human_proteins = enzmldoc.filter_proteins( organism="Homo sapiens", vessel_id="v1")
# Filter reactions by reversibilityreversible_reactions = enzmldoc.filter_reactions(reversible=True)What each example does:
-
Single attribute filtering:
filter_proteins(name="Alcohol dehydrogenase")finds all proteins whose name is exactly “Alcohol dehydrogenase”. This returns a list that might contain zero, one, or multiple proteins depending on what’s in your document. -
Multiple attribute filtering:
filter_proteins(organism="Homo sapiens", vessel_id="v1")finds proteins that meet both conditions: they must be from humans AND they must be in vessel v1. If a protein is from humans but in a different vessel, it won’t be included. -
Boolean attribute filtering:
filter_reactions(reversible=True)finds all reactions where thereversibleattribute is set toTrue. This uses a boolean value (True/False) rather than a string.
Important characteristics of filter results:
Returns a list: Filter methods always return a list, even if there’s only one match or no matches at all. If nothing matches, you’ll get an empty list [].
Multiple criteria use AND logic: All specified attributes must match. PyEnzyme doesn’t currently support OR logic (where either condition A or condition B could be true), but you can perform multiple separate filters and combine the results if needed.
Exact attribute names required: The attribute names you use in the filter must match the entity’s field names exactly, including capitalization. If you’re not sure what attributes an entity has, you can check by looking at an example entity or consulting the entity’s documentation.
Results reference the original entities: The entities returned by filtering aren’t copies: they’re references to the actual entities in your document. This means if you modify a filtered entity, you’re modifying the entity in the document itself. This is usually what you want, but it’s important to be aware of.
No matches returns an empty list: If your filter criteria don’t match any entities, you’ll get an empty list. This isn’t an error: it simply means no entities met your criteria. You can check if the list is empty with if not filtered_results: or if len(filtered_results) == 0:.
Filtering Different Entity Types
Section titled “Filtering Different Entity Types”PyEnzyme provides filter methods for all major entity types in your document. Let’s explore how to filter each type with practical examples.
Filtering Proteins
Section titled “Filtering Proteins”Proteins can be filtered by any of their attributes, such as name, organism, EC number, sequence, or vessel ID.
import pyenzyme as pe
enzmldoc = pe.read_enzymeml("experiment.json")
# Find all proteins from a specific organismhuman_proteins = enzmldoc.filter_proteins(organism="Homo sapiens")
# Find proteins with a specific EC numberoxidoreductases = enzmldoc.filter_proteins(ecnumber="1.1.1.1")
# Find proteins in a specific vesselvessel_proteins = enzmldoc.filter_proteins(vessel_id="v1")
# Combine multiple criteriahuman_adh = enzmldoc.filter_proteins( name="Alcohol dehydrogenase", organism="Homo sapiens")Common use cases:
- Finding all enzymes of a certain type (by EC number)
- Grouping proteins by organism
- Locating proteins in specific experimental vessels
- Identifying enzymes by name when you have multiple variants
Filtering Small Molecules
Section titled “Filtering Small Molecules”Small molecules can be filtered by attributes like name, chemical identifiers (SMILES, InChI), or vessel association.
# Find molecules by nameethanol = enzmldoc.filter_small_molecules(name="ethanol")
# Find molecules in a specific vesselvessel_molecules = enzmldoc.filter_small_molecules(vessel_id="v1")
# Find by chemical structure (if you know the SMILES)alcohol = enzmldoc.filter_small_molecules(canonical_smiles="CCO")Common use cases:
- Locating specific substrates or products by name
- Finding all molecules in a particular vessel
- Identifying molecules by their chemical structure
- Checking if a particular compound is already in your document
Filtering Reactions
Section titled “Filtering Reactions”Reactions can be filtered by attributes such as name, reversibility, or associated enzyme.
# Find all reversible reactionsreversible = enzmldoc.filter_reactions(reversible=True)
# Find all irreversible reactionsirreversible = enzmldoc.filter_reactions(reversible=False)
# Find reactions by namespecific_reaction = enzmldoc.filter_reactions(name="Ethanol oxidation")Common use cases:
- Separating reversible from irreversible reactions
- Finding specific reactions by name
- Organizing reactions for different types of analysis
Filtering Measurements
Section titled “Filtering Measurements”Measurements can be filtered by attributes like name, temperature, pH, or other experimental conditions.
# Find measurements at body temperaturephysiological = enzmldoc.filter_measurements(temperature=37.0)
# Find measurements at a specific pHneutral_ph = enzmldoc.filter_measurements(ph=7.0)
# Find by measurement namerun1 = enzmldoc.filter_measurements(name="Run 1")
# Combine conditionsspecific_conditions = enzmldoc.filter_measurements( temperature=37.0, ph=7.4)Common use cases:
- Grouping measurements by experimental conditions
- Finding replicates or related experimental runs
- Filtering by temperature or pH for analysis
- Locating specific measurement series
Filtering Vessels
Section titled “Filtering Vessels”Vessels can be filtered by attributes like name or volume.
# Find vessels by namecuvettes = enzmldoc.filter_vessels(name="Cuvette")
# Find vessels by volume (with specific unit)small_vessels = enzmldoc.filter_vessels(volume=1.0)Common use cases:
- Locating specific experimental containers
- Grouping experiments by vessel type
- Organizing vessels by size or capacity
Working with Filter Results
Section titled “Working with Filter Results”Once you’ve filtered entities, you can work with the results in several ways:
Checking Results
Section titled “Checking Results”# Get filtered proteinsproteins = enzmldoc.filter_proteins(organism="Homo sapiens")
# Check if any matches were foundif proteins: print(f"Found {len(proteins)} human proteins")else: print("No human proteins found")
# Iterate through resultsfor protein in proteins: print(f"Protein: {protein.name}, EC: {protein.ecnumber}")Modifying Filtered Entities
Section titled “Modifying Filtered Entities”Since filter results reference the original entities, you can modify them directly:
# Find all proteins from a specific organismproteins = enzmldoc.filter_proteins(organism="Homo sapiens")
# Modify an attribute for all matching proteinsfor protein in proteins: protein.name = "Human alcohol dehydrogenase"
# The changes are reflected in the documentImportant: Be careful when modifying filtered entities, as you’re changing the original entities in your document. Make sure modifications are intentional.
Extracting Information
Section titled “Extracting Information”You can extract specific information from filtered entities:
# Get all protein names from human proteinshuman_proteins = enzmldoc.filter_proteins(organism="Homo sapiens")protein_names = [p.name for p in human_proteins]print(protein_names)
# Get all EC numbers from your enzymesall_proteins = enzmldoc.filter_proteins() # No filter = get allec_numbers = [p.ecnumber for p in all_proteins if p.ecnumber]print(ec_numbers)Combining Filter Results
Section titled “Combining Filter Results”If you need OR logic (either condition A or condition B), you can perform multiple filters and combine the results:
# Find proteins from either humans or yeasthuman_proteins = enzmldoc.filter_proteins(organism="Homo sapiens")yeast_proteins = enzmldoc.filter_proteins(organism="Saccharomyces cerevisiae")
# Combine the resultscombined = human_proteins + yeast_proteins
# Remove duplicates if needed (though usually not necessary)unique_proteins = list(set(combined))Next Steps
Section titled “Next Steps”After learning how to filter and manage entities in your EnzymeML documents, you can explore:
The Exporting documents guide shows how to save your documents (including filtered subsets) in various formats for sharing and analysis.
The Visualizing data guide demonstrates how to create plots and visualizations from your filtered measurements and data.
For advanced analysis, the Mathematical modeling guide explains how to build and fit kinetic models using your filtered experimental data.
Finally, the Working with units guide provides detailed information about how PyEnzyme handles different unit systems, which is particularly useful when filtering by numerical attributes like temperature or concentration.