This Sullivan_FoodMicrobe_readme.txt file was generated on 2022-03-14 by Renato Hohl Orsi GENERAL INFORMATION 1. Title of Dataset: Data from: Whole genome sequencing-based characterization of Listeria isolates from produce packinghouses and fresh-cut facilities suggests both persistence and re-introduction of fully virulent L. monocytogenes 2. Author Information A. Principal Investigator Contact Information Name: Martin Wiedmann Institution: Cornell University Address: Email: mw16@cornell.edu B. Associate or Co-investigator Contact Information Name: Renato H. Orsi Institution: Cornell University Address: Email: rho2@cornell.edu C. Alternate Contact Information Name: Institution: Address: Email: 3. Date of data collection (range): 2017-03:2019-04 4. Geographic location of data collection: USA SHARING/ACCESS INFORMATION 1. Licenses/restrictions placed on the data: 2. Links to publications that cite or use the data: 3. Recommended citation for this dataset: Genevieve Sullivan, Renato H Orsi, Erika Estrada, Laura Strawn, and Martin Wiedmann. (2022) Data from: Whole genome sequencing-based characterization of Listeria isolates from produce packinghouses and fresh-cut facilities suggests both persistence and re-introduction of fully virulent L. monocytogenes. [dataset] Cornell University eCommons Repository. https://doi.org/10.7298/74sp-fg52 DATA & FILE OVERVIEW 1. File List: Sullivan_Metadata.txt Short description: Metadata associated with the 276 Listeria isolates for which assembly data are also provided. Each row represent a distinct isolate. Sullivan_Assemblies.zip Short description: This compressed folder contain the whole genome sequence assemblies for the 276 isolates described in the metadata.txt file. METHODOLOGICAL INFORMATION 1. Description of methods used for collection/generation of data: Isolate selection. Listeria isolates were obtained from two previous studies that investigated Listeria prevalence in 13 produce packinghouses (VT-A, VT-B, VT-C, VT-D, VT-E, VT-F, VT-G, VT-H, VT-I, VT-J, CU-A, CU-B, CU-C) and 3 fresh-cut facilities (CU-D, CU-E, CU-F). These two studies collected a total of 4,152 environmental samples, of which 217 were positive for Listeria. From these 217 Listeria-positive samples, a total of 679 Listeria isolates were collected. Additionally, a third study collected 156 environmental samples from three of the operations that participated in that study (i.e., operations CU-A, CU-C, CU-E) approximately one year after the original study, resulting in 21 Listeria-positive samples and 155 Listeria isolates. Only LM isolates from this third study were included in the study reported here. To identify unique representative isolates from each sample, sigB allelic typing had been performed on all isolates as previously described. One isolate of each sigB allelic type present in a given sample was considered “representative”. Therefore, a single positive sample could have more than one representative isolate if these isolates showed distinct sigB allelic types. Additionally, different representative isolates could have the same sigB allelic type as long as they originated from different samples. These “representative” isolates (n = 276) were characterized by whole genome sequencing as detailed below. In certain instances, such as when preliminary sequencing data was unavailable at time of strain selection, whole genome sequencing of an isolate ultimately designated as representative was not performed. Operation code designations for each isolate are consistent with the original studies. However, to differentiate between studies, operations studied by Sullivan et al. were given a “CU” prefix (i.e., Cornell University), and operations studied by Estrada et al. were given a “VT” prefix (i.e., Virginia Tech). 2. Methods for processing the data: Whole genome sequencing. Total DNA was extracted from representative isolates for whole genome sequencing. Sequencing was performed using a HiSeq 2500 (Illumina, Inc., San Diego, CA, United States) with a maximum read length of 2 x 150 bp. Trimmomatic (v 0.39) was used to trim the raw reads, which were then evaluated and filtered based on quality using FastQC (v 0.11) . Filtered contigs were assembled using SPAdes (v 3.13) using the k-mer sizes of 21, 33, 55, and 77 bp, as is suggested for multi-cell data with 2 x 150bp reads. 3. Instrument- or software-specific information needed to interpret the data: Assembly files can be opened with any text editor (e.g., Notepad). Metadata file is tab-delimited and can be viewed in a text editor or in a spreadsheet reader (e.g., Microsoft Excel). 4. Environmental/experimental conditions: Prior to DNA isolation, bacteria were grown overnight in BHI media at 37C without shaking. 5. Describe any quality-assurance procedures performed on the data: The quality of the assemblies was verified using the program Quast. All assemblies presented genome lengths and %GC content compatible with Listeria sensu stricto. The sigB gene was extracted from each assembly and compared to the sigb sequence obtained by PCR and Sanger sequencing. All sigB sequences from the assemblies matched the sigB sequences from the PCR. 6. People involved with sample collection, processing, analysis and/or submission: Genevieve Sullivan, Renato H Orsi, Erika Estrada, Laura Strawn, and Martin Wiedmann DATA-SPECIFIC INFORMATION FOR: *.txt file 1. Number of variables: 9 2. Number of rows: 276 3. Variable List: Species: Species of the isolates. Isolate ID: Accession number (ID) of the isolate as assigned in Food Microbe Tracker (https://www.foodmicrobetracker.net). File names: Name of the assembly files for each isolate. Operation: Operation code from where the isolate was obtained. Month: Month of collection. Year: Year of collection. Site: Specific location within the operation from which the sample was obtained. The isolate was then obtained from that sample Cluster Number: Indicates whether an isolate was part of a cluster (i.e., a group of genetically related isolates). Isolates with the same cluster number are genetically related to each other. Pre-production?: Indicates whether the samples were obtained pre-production or not. 4. Missing data codes: "." 5. Specialized formats or other abbreviations used: VT: Virginia Tech CU: Cornell University Months are indicated by their first three letters. DATA-SPECIFIC INFORMATION FOR: *.fasta files 1. Number of variables: 1 2. Number of rows: variable 3. Variable List: Each file consists of all the contigs that form the assembly of a given isolate. 4. Missing data codes: No missing data. 5. Specialized formats or other abbreviations used: A: adenine C: cytosine G: guanine T: thymine