This function imports and processes genotyping results from the Resistance Gene Identifier (RGI, https://github.com/arpcard/rgi), extracting antimicrobial resistance determinants and mapping them to standardised drug classes/antibiotics.
Usage
import_rgi(
input_table,
orf_id_col = "ORF_ID",
sample_id_sep = ".fasta.txt:",
model_col = "Model_type",
antibiotic_col = "Antibiotic",
class_col = "Drug Class",
exclude_loose = TRUE,
rgi_short_name = rgi_short_name_table,
rgi_drugs = rgi_drugs_table,
samples_no_amr = NULL
)Arguments
- input_table
A character string specifying a dataframe or path to the RGI results table (TSV format).
- orf_id_col
A character string specifying the column that identifies open reading frame ID (ORF_ID) in the dataset (default
ORF_ID). This column includes the sample ID and the contig / genomic location and is a default output of RGI.- sample_id_sep
A character string specifying the separator by which the sample ID is separated from the remaining text in
ORF_ID(Default:.fasta.txt:) . For example: in theORF_IDcolumn, "SAMEA3498968.fasta.txt:1_96 # 109511 # 110635....", the sample ID separator is.fasta.txt:.- model_col
A character string specifying the column that identifies model type identified by RGI (default
Model_type).- antibiotic_col
Character string specifying the antibiotic column (default
Antibiotic).- class_col
Character string specifying the drug class column (default
Drug Class).- exclude_loose
Logical indicating whether to exclude Loose hits (AMR markers that fall below a curated bitscore cutoff as defined by CARD/RGI). Default
TRUE, which excludes Loose hits.- rgi_short_name
A tibble containing a reference table mapping model IDs (from CARD/RGI) to shortened model names as provided by CARD (https://card.mcmaster.ca/download in aro_index.tsv). Defaults to
rgi_short_name_table, which is provided internally.- rgi_drugs
A tibble containing a reference table mapping CARD drug class / drug agents to standardised drug classes/names. Defaults to
rgi_drugs_table, which is provided internally.- samples_no_amr
A vector of sample IDs that have no RGI output because there are no AMR markers identified. For example
c("SampleA", "SampleB"). (default =NULL)
Value
A tibble containing the processed AMR determinants and drug classes that is AMRgen compatible. The output retains the original columns from the RGI output along with the newly mapped variables.
Details
The function performs the following steps:
Reads the RGI output table.
Transforms RGI output into long form (i.e., one AMR determinant AND drug class / antibiotic per row).
Maps CARD drug classes and antibiotics to standardised names. This processing ensures compatibility with downstream AMRgen analysis workflows.
Examples
# example RGI data (including Perfect, Strict, and Loose hits)
rgi_raw
#> # A tibble: 21,203 × 29
#> ORF_ID Contig Start Stop Orientation Cut_Off Pass_Bitscore Best_Hit_Bitscore
#> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 GCA_0… JASER… 2 283 + Loose 1150 24.3
#> 2 GCA_0… JASER… 367 1662 + Loose 700 172.
#> 3 GCA_0… JASER… 1666 1989 - Loose 500 23.5
#> 4 GCA_0… JASER… 2031 3386 - Loose 1900 29.3
#> 5 GCA_0… JASER… 3507 6158 - Loose 275 30.4
#> 6 GCA_0… JASER… 6961 7386 - Loose 910 25.4
#> 7 GCA_0… JASER… 7590 8675 + Loose 450 45.8
#> 8 GCA_0… JASER… 9735 10118 + Loose 600 25.8
#> 9 GCA_0… JASER… 10164 11495 - Loose 500 26.6
#> 10 GCA_0… JASER… 11627 12364 + Loose 400 37.7
#> # ℹ 21,193 more rows
#> # ℹ 21 more variables: Best_Hit_ARO <chr>, Best_Identities <dbl>, ARO <dbl>,
#> # Model_type <chr>, SNPs_in_Best_Hit_ARO <chr>, Other_SNPs <chr>,
#> # `Drug Class` <chr>, `Resistance Mechanism` <chr>, `AMR Gene Family` <chr>,
#> # Predicted_DNA <chr>, Predicted_Protein <chr>, CARD_Protein_Sequence <chr>,
#> # `Percentage Length of Reference Sequence` <dbl>, ID <chr>, Model_ID <dbl>,
#> # Nudged <lgl>, Note <lgl>, Hit_Start <dbl>, Hit_End <dbl>, …
# import using sample_id_sep=`_genomic.fna.txt:` and include Loose hits
rgi <- import_rgi(rgi_raw, sample_id_sep = "_genomic.fna.txt:", exclude_loose = FALSE)
# example RGI data from EuSCAPE project (including only Perfect and Strict hits)
rgi_EuSCAPE_raw
#> # A tibble: 59,447 × 26
#> ORF_ID Contig Start Stop Orientation Cut_Off Pass_Bitscore
#> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 SAMEA3498968.fasta.tx… 1 109511 110635 - Strict 700
#> 2 SAMEA3498968.fasta.tx… 1 237710 238072 + Perfect 150
#> 3 SAMEA3498968.fasta.tx… 1 238059 238388 + Perfect 150
#> 4 SAMEA3498968.fasta.tx… 1 278325 279185 - Perfect 550
#> 5 SAMEA3498968.fasta.tx… 1 299277 299651 - Strict 230
#> 6 SAMEA3498968.fasta.tx… 2 120780 123893 - Strict 1900
#> 7 SAMEA3498968.fasta.tx… 2 379306 380745 + Perfect 900
#> 8 SAMEA3498968.fasta.tx… 2 437973 441050 - Strict 1800
#> 9 SAMEA3498968.fasta.tx… 2 441051 444173 - Strict 1800
#> 10 SAMEA3498968.fasta.tx… 3 347739 348971 + Strict 700
#> # ℹ 59,437 more rows
#> # ℹ 19 more variables: Best_Hit_Bitscore <dbl>, Best_Hit_ARO <chr>,
#> # Best_Identities <dbl>, ARO <dbl>, Model_type <chr>,
#> # SNPs_in_Best_Hit_ARO <chr>, Other_SNPs <chr>, `Drug Class` <chr>,
#> # `Resistance Mechanism` <chr>, `AMR Gene Family` <chr>,
#> # `Percentage Length of Reference Sequence` <dbl>, ID <chr>, Model_ID <dbl>,
#> # Nudged <lgl>, Note <lgl>, Hit_Start <dbl>, Hit_End <dbl>, …
# import using defaults (sample_id_sep=`.fasta.txt:`, exclude_loose = `TRUE`)
import_rgi(rgi_EuSCAPE_raw)
#> # A tibble: 293,017 × 33
#> id marker mutation drug drug_class `variation type` marker.label ORF_ID
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 SAMEA3… Klebs… NA FOX Cephalosp… Gene presence d… Kpne_OmpK37… SAMEA…
#> 2 SAMEA3… Klebs… NA CTX Cephalosp… Gene presence d… Kpne_OmpK37… SAMEA…
#> 3 SAMEA3… Klebs… NA ERY Macrolides Gene presence d… Kpne_KpnE SAMEA…
#> 4 SAMEA3… Klebs… NA STR1 Aminoglyc… Gene presence d… Kpne_KpnE SAMEA…
#> 5 SAMEA3… Klebs… NA TCY Tetracycl… Gene presence d… Kpne_KpnE SAMEA…
#> 6 SAMEA3… Klebs… NA FEP Cephalosp… Gene presence d… Kpne_KpnE SAMEA…
#> 7 SAMEA3… Klebs… NA CRO Cephalosp… Gene presence d… Kpne_KpnE SAMEA…
#> 8 SAMEA3… Klebs… NA RIF Rifamycins Gene presence d… Kpne_KpnE SAMEA…
#> 9 SAMEA3… Klebs… NA COL Polymyxins Gene presence d… Kpne_KpnE SAMEA…
#> 10 SAMEA3… Klebs… NA COL Polymyxins Gene presence d… Kpne_KpnE SAMEA…
#> # ℹ 293,007 more rows
#> # ℹ 25 more variables: Contig <dbl>, Start <dbl>, Stop <dbl>,
#> # Orientation <chr>, Cut_Off <chr>, Pass_Bitscore <dbl>,
#> # Best_Hit_Bitscore <dbl>, Best_Hit_ARO <chr>, Best_Identities <dbl>,
#> # ARO <dbl>, Model_type <chr>, Other_SNPs <chr>, `Drug Class` <chr>,
#> # `Resistance Mechanism` <chr>, `AMR Gene Family` <chr>,
#> # `Percentage Length of Reference Sequence` <dbl>, ID <chr>, …
