
Example with custom classification of genotype hits
Natacha Couto
2026-03-12
Source:vignettes/StaphAureusClindamycin.Rmd
StaphAureusClindamycin.RmdAnalysing clindamycin resistance in Staphylococcus aureus
Introduction
This is document summarises the analysis of clindamycin resistance in S. aureus. More specifically, we will look at the distribution of clindamycin resistance in S. aureus isolates from a dataset of clinical samples.
Start by loading the package:
Data preparation
For this example, we have collated genotype-phenotype data from NCBI,
EBI through the ESGEM-AMR Staphylococcus subgroup. Phenotypic data were
collated into a single table (one row per isolate). The genotype data
was generated using AMRFinderPlus. This pre-loaded objects
ast_CLI and afp_CLI serve as the
input for AMRgen. First, we will create a
new marker.label column that includes the marker gene and its closest
accession number (this will be named variant from now on).
# Create the new marker.label column
cli_accession <- afp_CLI_public %>%
mutate(obs_id = row_number()) %>%
mutate(exact_match = if_else(
`% Coverage of reference sequence` == 100 &
`% Identity to reference sequence` == 100.00,
"", "x_"
)) %>%
mutate(node_hit = paste0(node, "__", exact_match, `Accession of closest sequence`)) %>%
mutate(marker.label = case_when(
`variation type` == "Inactivating mutation detected" ~ paste0(node_hit, ":-"),
!is.na(mutation) ~ paste0(node_hit, ":", mutation),
TRUE ~ node_hit
)) %>%
select(-obs_id)Genotype-phenotype analysis
Next, we will create a binary matrix for clindamycin resistance. We can do this with the just the markers and with the variants of the markers. This will allow us to see if the variants of the markers have different associations with clindamycin resistance.
# Generate binary matrix with markers
cli_bin <- get_binary_matrix(
afp_CLI_public,
ast_CLI_public,
antibiotic = "Clindamycin",
drug_class_list = c("Lincosamides"),
sir_col = "pheno_eucast",
keep_assay_values = TRUE,
marker_col = "marker.label"
)
# Generate binary matrix with variants
cli_bin_accession <- get_binary_matrix(
cli_accession,
ast_CLI_public,
antibiotic = "Clindamycin",
drug_class_list = c("Lincosamides"),
sir_col = "pheno_eucast",
keep_assay_values = TRUE,
marker_col = "marker.label"
)
# Visualise with UpSet plot (markers)
cli_mic_upset <- amr_upset(
cli_bin,
min_set_size = 2,
assay = "mic",
order = "value",
print_set_size = TRUE,
plot_set_size = TRUE,
print_category_counts = TRUE,
bp_S = 0.25,
bp_R = 0.5
)
# Visualise with UpSet plot (variants)
cli_mic_upset <- amr_upset(
cli_bin_accession,
min_set_size = 2,
assay = "mic",
order = "value",
print_set_size = TRUE,
plot_set_size = TRUE,
print_category_counts = TRUE,
bp_S = 0.25,
bp_R = 0.5
)
Solo PPV analysis
Next, we will calculate the solo positive predictive value (soloPPV) for each marker and variant. This will allow us to see which markers and variants are most predictive of clindamycin resistance.
# soloPPV analysis for markers
PPV_cli <- ppv(cli_bin, upset_grid = TRUE, plot_assay = TRUE)
# soloPPV analysis for variants
PPV_cli <- ppv(cli_bin_accession, upset_grid = TRUE, plot_assay = TRUE)
We can clearly see from this analysis that the variants of the markers have different associations with clindamycin resistance. For example, the erm(C)_WP0012363.1 and erm(C)_WP0012364.1 were not above the 0.5 threshold. This highlights the importance of considering the specific variants of the markers when predicting clindamycin resistance. Note: this discrepancy may be caused by inducible resistance, which is not always captured by a standard AST test. This is because the resistance gene may not be expressed under the conditions of the AST test, but can be induced in the presence of certain antibiotics (e.g., erythromycin). This is a known phenomenon for clindamycin resistance in S. aureus, where the presence of an erm gene can lead to inducible resistance that may not be detected in a standard AST test.
Association of specific variants with sequence types
Finally, we can also look at the association of specific marker variants with sequence types, if this data are available. This can be done by creating a binary matrix for the variants and then visualizing the associations with a bubble plot.
# Identify "High-Frequency" STs because we have many STs with only a few samples, which can make the plot cluttered and less informative. By filtering to include only STs with a certain number of samples (e.g., n >= 100), we can focus on the most prevalent STs in the dataset, which are likely to provide more meaningful insights into the associations between specific variants and sequence types.
high_freq_sts <- ST_data_CLI %>%
count(ST) %>%
filter(n >= 100) %>%
pull(ST)
# Merge with ST data
cli_accession_st <- cli_accession %>%
inner_join(ST_data_CLI, by = "Name") %>%
filter(ST %in% high_freq_sts)
# Calculate prevalence of variant in each ST
balloon_data <- cli_accession_st %>%
group_by(ST, marker.label) %>%
summarise(n_samples = n_distinct(Name), .groups = "drop") %>%
left_join(ST_data_CLI %>% count(ST, name = "total_st_n"), by = "ST") %>%
mutate(percent_prevalence = (n_samples / total_st_n) * 100) %>%
filter(percent_prevalence >= 10)
# Create the Balloon Plot showing only marker-ST combinations with at least 10% prevalence
ggplot(balloon_data, aes(x = ST, y = marker.label)) +
geom_point(aes(size = n_samples, color = percent_prevalence)) +
scale_color_viridis_c(option = "viridis", direction = -1) +
scale_size_continuous(range = c(1, 10)) +
theme_minimal() +
theme(
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16),
legend.title = element_text(size = 16),
legend.text = element_text(size = 14)
) +
labs(
title = "Clindamycin Markers by Major Sequence Types",
subtitle = "Analysis of STs with n >= 100",
x = "Sequence Type (ST)",
y = "Marker variant",
size = "Sample Count",
color = "Prevalence (%)"
)
We can see from this last figure, that while erm(C)_WP0012364.1 is quite rare, erm(C)_WP0012363.1 is prevalent in certain STs, such as ST22. This suggests that there may be a clonal association of this variant with this ST, indicating resistance in this clone might be inducible and therefore more difficult to identify using standard AST.