Sequence¶
- cladetime.sequence.filter(sequence_ids: set, url_sequence: str, output_path: Path) Path[source]¶
Filter a fasta file against a specific set of sequences.
Download a sequence file (in FASTA format) from Nexstrain, filter it against a set of specific strains, and write the filtered sequences to a new file.
- Parameters:
sequence_ids (set) – Strains used to filter the sequence file
url_sequence (str) – The URL to a file of SARS-CoV-2 GenBank sequences published by Nexstrain. The file is should be in .fasta format using the lzma compression method (e.g., “https://data.nextstrain.org/files/ncov/open/100k/sequences.fasta.xz”)
output_path (pathlib.Path) – Where to save the filtered sequence file
- Returns:
Full path to the filtered sequence file
- Return type:
- Raises:
ValueError – If url_sequence points to a file that doesn’t have a .zst or .xz extension or if sequence_ids is empty
- cladetime.sequence.filter_metadata(metadata: DataFrame | LazyFrame, cols: list | None = None, state_format: StateFormat = StateFormat.ABBR, collection_min_date: datetime | None = None, collection_max_date: datetime | None = None) DataFrame | LazyFrame[source]¶
Apply standard filters to Nextstrain’s SARS-CoV-2 sequence metadata.
A helper function to apply commonly-used filters to a Polars DataFrame or LazyFrame that represents Nextstrain’s SARS-CoV-2 sequence metadata. It filters on human sequences from the United States (including Puerto Rico and Washington, DC).
This function also performs small transformations to the metadata, such as casting the collection date to a date type, renaming columns, and returning alternate state formats if requested.
- Parameters:
metadata (
polars.DataFrameorpolars.LazyFrame) – A Polars DataFrame or LazyFrame that represents SARS-CoV-2 sequence metadata produced by Nextstrain as an intermediate file in their daily workflow. This parameter is often thecladetime.CladeTime.url_sequence_metadataattribute of acladetime.CladeTimeobjectcols (list) – Optional. A list of columns to include in the filtered metadata. The default columns included in the filtered metadata are: clade_nextstrain, country, date, division, strain, host
state_format (
cladetime.types.StateFormat) – Optional. The state name format returned in the filtered metadata’s location column. Defaults to StateFormat.ABBRcollection_min_date (datetime.datetime | None) – Optional. Return sequences collected on or after this date. Defaults to None (no minimum date filter).
collection_max_date (datetime.datetime | None) – Optional. Return sequences collected on or before this date. Defaults to None (no maximum date filter).
- Returns:
A Polars object that represents the filtered SARS-CoV-2 sequence metadata. The type of returned object will match the type of the function’s metadata parameter.
- Return type:
polars.DataFrameorpolars.LazyFrame- Raises:
ValueError – If the state_format parameter is not a valid
cladetime.types.StateFormat.
Notes
This function will filter out metadata rows with invalid state names or date strings that cannot be cast to a Polars date format.
Example
>>> from cladetime import CladeTime >>> from cladetime.sequence import filter_covid_genome_metadata >>> >>> ct = CladeTime(sequence_as_of="2024-10-15") >>> filtered_metadata = filter_covid_genome_metadata(ct.sequence_metadata) >>> filtered_metadata.collect().head(5) shape: (5, 7) ┌───────┬─────────┬────────────┬────────────────────────────┬──────────────┬──────┬ │ clade ┆ country ┆ date ┆ strain ┆ host ┆ loca │ │ ┆ ┆ ┆ ┆ ┆ tion │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ date ┆ str ┆ str ┆ str │ │ ┆ ┆ ┆ ┆ ┆ │ ╞═══════╪═════════╪════════════╪════════════════════════════╪══════════════╪══════╡ │ 22A ┆ USA ┆ 2022-07-07 ┆ Alabama/SEARCH-202312/2022 ┆ Homo sapiens ┆ AL │ │ 22B ┆ USA ┆ 2022-07-02 ┆ Arizona/SEARCH-201153/2022 ┆ Homo sapiens ┆ AZ │ │ 22B ┆ USA ┆ 2022-07-19 ┆ Arizona/SEARCH-203528/2022 ┆ Homo sapiens ┆ AZ │ │ 22B ┆ USA ┆ 2022-07-15 ┆ Arizona/SEARCH-203621/2022 ┆ Homo sapiens ┆ AZ │ │ 22B ┆ USA ┆ 2022-07-20 ┆ Arizona/SEARCH-203625/2022 ┆ Homo sapiens ┆ AZ │ └───────┴─────────┴────────────┴────────────────────────────┴─────────────────────┴
- cladetime.sequence.get_metadata_ids(sequence_metadata: DataFrame | LazyFrame) set[source]¶
Return sequence IDs for a specified set of Nextstrain sequence metadata.
For a given input of GenBank-based SARS-Cov-2 sequence metadata (as published by Nextstrain), return a set of strains. This function is mostly used to filter a sequence file.
- Parameters:
sequence_metadata (
polars.DataFrameorpolars.LazyFrame)- Returns:
A set of strains
- Return type:
- Raises:
ValueError – If the sequence metadata does not contain a strain column