Sequence

cladetime.sequence.filter(sequence_ids: set, url_sequence: str, output_path: Path) Path[source]

Filter a fasta file against a specific set of sequences.

Download a sequence file (in FASTA format) from Nexstrain, filter it against a set of specific strains, and write the filtered sequences to a new file.

Parameters:
Returns:

Full path to the filtered sequence file

Return type:

pathlib.Path

Raises:

ValueError – If url_sequence points to a file that doesn’t have a .zst or .xz extension or if sequence_ids is empty

cladetime.sequence.filter_metadata(metadata: DataFrame | LazyFrame, cols: list | None = None, state_format: StateFormat = StateFormat.ABBR, collection_min_date: datetime | None = None, collection_max_date: datetime | None = None) DataFrame | LazyFrame[source]

Apply standard filters to Nextstrain’s SARS-CoV-2 sequence metadata.

A helper function to apply commonly-used filters to a Polars DataFrame or LazyFrame that represents Nextstrain’s SARS-CoV-2 sequence metadata. It filters on human sequences from the United States (including Puerto Rico and Washington, DC).

This function also performs small transformations to the metadata, such as casting the collection date to a date type, renaming columns, and returning alternate state formats if requested.

Parameters:
  • metadata (polars.DataFrame or polars.LazyFrame) – A Polars DataFrame or LazyFrame that represents SARS-CoV-2 sequence metadata produced by Nextstrain as an intermediate file in their daily workflow. This parameter is often the cladetime.CladeTime.url_sequence_metadata attribute of a cladetime.CladeTime object

  • cols (list) – Optional. A list of columns to include in the filtered metadata. The default columns included in the filtered metadata are: clade_nextstrain, country, date, division, strain, host

  • state_format (cladetime.types.StateFormat) – Optional. The state name format returned in the filtered metadata’s location column. Defaults to StateFormat.ABBR

  • collection_min_date (datetime.datetime | None) – Optional. Return sequences collected on or after this date. Defaults to None (no minimum date filter).

  • collection_max_date (datetime.datetime | None) – Optional. Return sequences collected on or before this date. Defaults to None (no maximum date filter).

Returns:

A Polars object that represents the filtered SARS-CoV-2 sequence metadata. The type of returned object will match the type of the function’s metadata parameter.

Return type:

polars.DataFrame or polars.LazyFrame

Raises:

ValueError – If the state_format parameter is not a valid cladetime.types.StateFormat.

Notes

This function will filter out metadata rows with invalid state names or date strings that cannot be cast to a Polars date format.

Example

>>> from cladetime import CladeTime
>>> from cladetime.sequence import filter_covid_genome_metadata
>>>
>>> ct = CladeTime(sequence_as_of="2024-10-15")
>>> filtered_metadata = filter_covid_genome_metadata(ct.sequence_metadata)
>>> filtered_metadata.collect().head(5)
shape: (5, 7)
┌───────┬─────────┬────────────┬────────────────────────────┬──────────────┬──────┬
│ clade ┆ country ┆ date       ┆ strain                     ┆ host         ┆ loca │
│       ┆         ┆            ┆                            ┆              ┆ tion │
│ ---   ┆ ---     ┆ ---        ┆ ---                        ┆ ---          ┆ ---  │
│ str   ┆ str     ┆ date       ┆ str                        ┆ str          ┆ str  │
│       ┆         ┆            ┆                            ┆              ┆      │
╞═══════╪═════════╪════════════╪════════════════════════════╪══════════════╪══════╡
│ 22A   ┆ USA     ┆ 2022-07-07 ┆ Alabama/SEARCH-202312/2022 ┆ Homo sapiens ┆ AL   │
│ 22B   ┆ USA     ┆ 2022-07-02 ┆ Arizona/SEARCH-201153/2022 ┆ Homo sapiens ┆ AZ   │
│ 22B   ┆ USA     ┆ 2022-07-19 ┆ Arizona/SEARCH-203528/2022 ┆ Homo sapiens ┆ AZ   │
│ 22B   ┆ USA     ┆ 2022-07-15 ┆ Arizona/SEARCH-203621/2022 ┆ Homo sapiens ┆ AZ   │
│ 22B   ┆ USA     ┆ 2022-07-20 ┆ Arizona/SEARCH-203625/2022 ┆ Homo sapiens ┆ AZ   │
└───────┴─────────┴────────────┴────────────────────────────┴─────────────────────┴
cladetime.sequence.get_metadata_ids(sequence_metadata: DataFrame | LazyFrame) set[source]

Return sequence IDs for a specified set of Nextstrain sequence metadata.

For a given input of GenBank-based SARS-Cov-2 sequence metadata (as published by Nextstrain), return a set of strains. This function is mostly used to filter a sequence file.

Parameters:

sequence_metadata (polars.DataFrame or polars.LazyFrame)

Returns:

A set of strains

Return type:

set

Raises:

ValueError – If the sequence metadata does not contain a strain column