Survice-BERT: A BERT Model for Named Entity Recognition in Infectious Disease Surveillance Reports

Saeyeon Cheon; Insung Ahn

doi:10.4167/jbv.2025.55.3.235

Preview

Original Article

JOURNAL OF BACTERIOLOGY AND VIROLOGY. 30 September 2025. 235-247
https://doi.org/10.4167/jbv.2025.55.3.235

Survice-BERT: A BERT Model for Named Entity Recognition in Infectious Disease Surveillance Reports

Saeyeon Cheon¹²

Insung Ahn¹²^{^*}

¹Research center for datacentric problem solving, Korea Institute of Science and Technology Information, Yuseong-gu, Daejeon 34141, Republic of Korea

²Department of Applied AI, University of Science & Technology, Yuseong-gu, Daejeon 34113, Republic of Korea

^*isahn@kisti.re.kr

License (open-access, https://creativecommons.org/licenses/by-nc/4.0/):

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/).

ABSTRACT

Since the onset of the COVID-19 pandemic, research on infectious disease surveillance has advanced rapidly, driven by the convergence of computing technologies with biology, medicine, and public health. A persistent challenge in the early stages of outbreaks is the lack of structured data, which hinders timely analysis and response. To address this issue, we present a domain-specific named entity recognition (NER) dataset derived from infectious disease surveillance reports and introduce Survice-BERT, a bidirectional encoder representations from transformers (BERT)-based model fine-tuned on this dataset for pandemic-related information extraction. Survice-BERT achieves an average F1-score of 0.99, demonstrating its ability to extract information with high accuracy and support forecasting for early warning of communicable diseases. We anticipate that the model will be valuable to practitioners across multiple fields. Survice-BERT is publicly available at https://github.com/csanny/Survice-BERT.

Keywords

Language model

Named entity recognition

Fine-tuning

Infectious disease

Surveillance report

MAIN

GRAPHICAL ABSTRACT

https://cdn.apub.kr/journalsite/sites/jbv/2025-055-03/N0290550304/images/JBV_2025_v55n3_235_g001.jpg

INTRODUCTION

A novel coronavirus was first documented on December 30, 2019, through the Program for Monitoring Emerging Diseases (ProMED-mail). The report stated that the Medical Administration of the Wuhan Municipal Health Committee had issued “an urgent notice on the treatment of pneumonia of unknown cause in Wuhan, China” (1, 2). On January 8, 2020, the pathogen was identified as a novel coronavirus, later named severe acute respiratory syndrome coronavirus 2 (SARS- CoV-2) (3). The virus spread rapidly worldwide, posing significant threats to public health.

According to Google Trends, “coronavirus” dominated global search activity in 2020, with peak interest observed in Italy between March 15 and 21. The pandemic heightened public awareness of the need for prompt preventive measures. Researchers from multiple disciplines conducted extensive studies on coronavirus disease 2019 (COVID-19) (4, 5), resulting in over 1,000,000 publications. Indeed, COVID-19 accounted for the majority of scientific papers published in 2020 (6).

Predictive modeling is one of the most widely studied applications of data science in infectious disease research. Previous studies explored forecasting outbreaks in specific regions—for example, predicting dengue fever incidence across 20 cities in China using long short-term memory (LSTM) neural networks (7), or anticipating COVID-19 cases in ten Brazilian states with machine learning techniques (8). Other researchers incorporated non-pharmaceutical interventions and cultural metrics (9), or social media data such as tweets (10). As artificial intelligence (AI) continues to expand its scope (11), it is facilitating faster responses across many fields, including infectious disease research. However, the effectiveness of AI fundamentally depends on data acquisition. Prior researches have confirmed that the volume and quality of data critically influence the performance of machine learning-based prediction models (12, 13, 14), particularly in time-series forecasting, where training on the most up-to-date data is essential (15, 16, 17, 18).

However, collecting up-to-date structured datasets remains challenging, especially in the early stages of outbreaks, due to time-intensive preprocessing. By contrast, unstructured textual data, such as news articles and press releases, are often available immediately. As noted earlier, ProMED-mail initially reported COVID-19 as “undiagnosed pneumonia” (2), and web-based monitoring systems such as MedISys and the World Health Organization (WHO) issue early alerts about emerging disease X (19). Communicable disease reports are also published regularly by WHO, the European Centre for Disease Prevention and Control (ECDC), the Pan American Health Organization (PAHO), and national health departments.

Several studies have explored using these media sources as datasets for epidemic analysis and forecasting (20, 21, 22). Thus, monitoring both structured and unstructured textual data from diverse sources is essential. However, manually inspecting and collecting such data at scale is impractical. To address this, we investigated the use of named entity recognition (NER) techniques to automatically extract key information from unstructured text.

Among natural language processing (NLP) tasks, NER extracts information from text by identifying entities and classifying them into predefined categories such as person, location, and date (23, 24). Traditional NER methods include deep neural networks such as convolutional neural networks (25), LSTM networks (26), and embeddings from language models such as ELMo (27). More recently, transformer-based language models have become dominant in NER research (28, 29), including bidirectional encoder representations from transformers (BERT) (30) and the generative pre-trained transformer (GPT) (31). With advances in computing power and language models, the application domains of NER have broadened diversely (32, 33).

Biomedical NER has been extensively studied in the context of biomedical literature, including entities such as drugs (34), proteins (35), and genes (36). In recent years, many researchers have adopted BERT-based approaches for biomedical NER tasks (37, 38). One study also developed an NLP pipeline to identify COVID-19 outbreaks from public health interview forms (39), demonstrating the superior performance of the BERT model.

Despite these advances, benchmark NER datasets for infectious disease surveillance remain scarce, even though NLP research in this field is urgently needed (40). For instance, one study created a Spanish event-based surveillance system using recurrent neural networks (41), while another introduced the Global Infectious Diseases Epidemic Information Monitoring System using epidemic websites (42).

To address this gap and enhance applicability in public health contexts, we propose Survice-BERT, a BERT model for biomedical NER in infectious disease surveillance reports. The model leverages a pre-trained BERT architecture and a novel dataset constructed by extracting key information from global surveillance reports in PDF format. Survice-BERT is designed to identify, extract, and classify critical outbreak-related information from unstructured text. To the best of our knowledge, this is among the first attempts at fine-tuning a pre-trained BERT model on datasets derived from infectious disease surveillance reports annotated with NER tags such as disease names, case counts, death counts, and outbreak dates. Moreover, we anticipate that the presented model has the potential to serve as a useful paradigm for surveillance data collection systems.

METHODS

This section describes the architecture, dataset construction, and training pipeline of Survice-BERT, a biomedical domain- specific BERT model fine-tuned for NER on infectious disease surveillance reports.

Model Architecture

Fig. 1 presents the workflow of Survice-BERT. The model receives unstructured surveillance reports as input and first splits the text into sentences based on punctuation. It then tokenizes each sentence, identifies tokens corresponding to predefined named entities, and categorizes the outputs into structured tables. These tables can subsequently be applied in downstream tasks such as epidemic monitoring and prediction.

https://cdn.apub.kr/journalsite/sites/jbv/2025-055-03/N0290550304/images/JBV_2025_v55n3_235_f001.jpg

Fig. 1

Workflow of the Survice-BERT model. Unstructured surveillance text reports are processed into structured epidemiological tables through named entity recognition (NER).

Pre-trained Language Model: BERT

In 2019, Google introduced BERT, a pre-trained language model (PLM) trained on large general corpora, including BookCorpus and English Wikipedia (30). Unlike GPT (31), which is trained unidirectionally, BERT employs bidirectional self-attention. Its adaptability has made it widely used across domains, leading to the development of domain-specific BERT models, including:

•BioBERT: Pre-trained on biomedical literature such as PubMed abstracts and PubMed Central full-text articles (43). It has achieved state-of-the-art performance in biomedical relation extraction, question answering, and NER using datasets such as NCBI Disease (44), BC5CDR (drug and chemical entities) (45), JNLPBA (genes and protein entities) (46), and LINNAEUS (species name recognition) (47).

•SciBERT: Pre-trained on a large corpus of scientific publications from Semantic Scholar (48). Designed to address the scarcity of high-quality labeled scientific data, SciBERT enhances performance on scientific and biomedical NLP tasks (49) using datasets such as BC5CDR (45), ChemProt (50), and EBM-NLP (51).

•Other models: DNABERT (52), ClinicalBERT (53), MT-Clinical BERT (54), and PubMedBERT (55).

In this study, we fine-tuned BioBERT and SciBERT, two widely used domain-specific BERT models for biomedical NERusing our custom datasets derived from communicable disease reports.

Fine-tuning Framework: NERDA

We used NERDA, a Python-based framework built on PyTorch and Hugging Face Transformers, to fine-tune the models for NER (56). NERDA provides an intuitive interface for fine-tuning transformer models on user-defined datasets with configurable hyperparameter settings.

Dataset Construction

A new NER dataset was developed due to the lack of publicly available resources containing the specific target entities required for modeling. The dataset follows the CoNLL-2003 benchmark format (57), consisting of sentence–tag pairs annotated using the BIO tagging scheme.

A Python-based crawler and parser were implemented to automatically download infectious disease surveillance reports in PDF format from three major organizations and convert them into plain text. Examples of extracted sentences include:

•ECDC: Communicable Disease Threats Report (e.g., “On 17 March 2023, the Ministry of Health of Tanzania reported seven people affected by an undiagnosed disease in Kagera, northern Tanzania, including five deaths and two people treated at hospitals”) (58, 59).

•PAHO: COVID-19 Daily Update Report (e.g., “According to the latest Uruguay Ministry of Public Health report, the total notes 10 positive cases reported in the last 24 hours were excluded from the total”) (60, 61).

•WHO: Dengue Situation Report (e.g., “As of epidemiological week 17 of 2023, 129 dengue cases were reported in Singapore, leading to a total of 2,857 cases”) (62, 63).

Tokenization was performed using spaces and punctuation, except for commas in numeric values representing case and death counts. We defined eight NER tags relevant to disease outbreak monitoring: DATE (including YEAR, MONTH, WEEK, and DAY), LOCATION, DISEASE_NAME, CASE, and DEATH. Each sentence was manually annotated with these tags. For indirect expressions such as “this year,” the token “this” was annotated as a DATE entity. Since the annotations were conducted by a single researcher, no inter-annotator agreement score was calculated. Fig. 2 shows the structure and example annotation of the datasets.

https://cdn.apub.kr/journalsite/sites/jbv/2025-055-03/N0290550304/images/JBV_2025_v55n3_235_f002.jpg

Fig. 2

Structure and annotation example of the proposed datasets. Sentences were derived from three types of reports, and eight named entities were annotated using BIO tagging with B (beginning), I (inside), and O (outside), depending on position.

To mitigate overfitting and enhance generalizability, the training, validation, and test sets were constructed from different report sources, following approaches reported in prior studies (64). To further diversify the data, augmented sentences were generated by modifying country and disease names. There were no duplicate sentences across the datasets. The final dataset comprised 4,200, 1,400, and 1,400 sentences for training, validation, and testing, respectively (Table 1). Two versions of the dataset were also created to assess the effect of comma usage in numeric values for cases and deaths counts: one with commas and one without. Table 2 summarizes the counts of NER tags in each dataset, and Fig. 3 presents their distribution across the datasets.

Table 1.

Description of our datasets

Dataset	Name of Report	No. of Sentences^a	Ratio	Ref	Date of Use
Train	Communicable Disease Threats Report	4,200	60	ECDC	Week 6 2012–Week 21 2012
Validation	COVID-19 Daily Update Report	1,400	20	PAHO	26 Jan 2020–2 Mar 2020
Test	Dengue Situation Report	1,400	20	WHO	Report No. 458–462

a: Sentences are the constituent elements of the datasets shown in Fig. 2.

Table 2.

Counts of NER tags in the constructed datasets

DATASET	YEAR	MONTH	WEEK	DAY	LOCATION	DISEASE	CASE	DEATH
TRAIN	4,680	2,720	2,080	2,080	11,670	9,824	4,840	2,480
VALIDATION	1,200	1,000	1,240	880	4,730	3,640	1,760	1,400
TEST	2,360	1,480	1,280	1,360	3,570	3,432	1,240	960

https://cdn.apub.kr/journalsite/sites/jbv/2025-055-03/N0290550304/images/JBV_2025_v55n3_235_f003.jpg

Fig. 3

Distribution of NER tag proportions across the training, validation, and test datasets.

Model Training

We fine-tuned two biomedical domain-specific pre-trained BERT models, BioBERT and SciBERT, on our custom infectious disease NER dataset. To evaluate format robustness, each model was trained on two dataset versions—one with commas in numeric values and one without—resulting in four model variants (Table 3).

Table 3.

Four model variants used in this study

V1	V2	V3	V4
BioBERT	BioBERT	SciBERT	SciBERT
Without commas	With commas	Without commas	With commas

The models were fine-tuned using the NERDA framework. A total of 40 experiments were conducted by varying batch sizes and learning rates as shown in Table 4, following configurations recommended in earlier BERT studies (30, 43, 44). Model training was performed for 10 epochs on the Neuron supercomputing system at the Korea Institute of Science and Technology Information. The experimental settings are listed in Table 5.

Table 4.

Batch size and learning rate settings for model training

Training batch size	32, 16, 8
Test batch size	32, 16, 8
Learning rate	3e−5, 2e–5

Table 5.

Hyperparameters of the fine-tuned models

Hyperparameter	Value
GPU	NVIDIA Tesla V100-PCIE-16GB
Warmup steps	500
Max sequence length	128
Dropout	0.1

RESULTS

We fine-tuned four model variants, combining two pre-trained BERT models with two dataset formats. Each model was trained for 10 epochs with varying batch sizes and learning rates on the Neuron supercomputer. In addition, we constructed new datasets for detecting outbreak-related entities, including LOCATION, NAME, CASE, DEATH, and DATE.

To evaluate model performance, we used the F1-score, the standard metric for NER. Table 6 presents the micro-averaged F1-scores of the models across datasets. Performance was highest on the training data derived from ECDC reports, followed by the data of test (WHO) and validation (PAHO). The F1-scores for the training set were all 1.00. Test scores were consistently above 0.90, while validation scores were comparatively lower. Supplementary experiments confirmed that fine-tuning on separate training, validation, and test datasets yielded higher F1-scores than training on an integrated dataset.

Table 6.

F1-scores of the best models by dataset (V1: BioBERT w/o commas, V2: BioBERT w/ commas, V3: SciBERT w/o commas, V4: SciBERT w/ commas)

Model	V1	V2	V3	V4
Dataset
ECDC (training)	1.000	1.000	1.000	1.000
PAHO (validation)	0.967	0.956	0.941	0.963
WHO (test)	0.991	0.992	0.990	0.993

The best configuration for each model was further evaluated on the WHO test dataset, using a learning rate of 3e−5 and a batch size of eight. Table 7 summarizes the F1-scores per NER tag across model variants. The V4 model (SciBERT with commas) achieved the highest overall F1-score (0.993). Although V4 outperformed the other models overall, BioBERT achieved higher scores on certain tags, such as CASE, DEATH, and DATE_WEEK. Conversely, SciBERT performed better on tags such as DATE_MONTH and DATE_DAY.

Table 7.

F1-scores of model variants on WHO reports by entity tag

Model	V1	V2	V3	V4
Entity tag
B-LOC	1	1	1	1
I-LOC	1	1	1	1
B-DISEASE	1	1	1	1
I-DISEASE	1	1	1	1
B-CASE	0.968	1	0.967	0.973
B-DEATH	1	1	1	0.963
B-DATE_YEAR	1	1	1	1
B-DATE_MONTH	0.991	0.972	1	1
B-DATE_WEEK	0.954	0.985	0.938	0.985
B-DATE_DAY	0.976	0.957	0.970	0.985
AVG_MICRO	0.991	0.992	0.990	0.993

We also compared results across dataset versions with and without commas in numeric values of cases and deaths. For the DEATH tag, models trained without commas (V1 and V3) achieved higher F1-scores, whereas for the CASE tag, models trained with commas (V2 and V4) performed better.

To validate practical applicability, we tested V2 and V4 on sentences not used during training, sourced from ProMED-mail and WHO Outbreak News (65, 66). Example inputs include:

•A total of 3 cases of MVE virus infection and 2 deaths have been reported in Victoria this mosquito season.

•From 1 November 2022 to 27 January 2023, a total of 559 cases of meningitis (of which 111 are laboratory confirmed), including 18 deaths (overall CFR 3.2%), have been reported from Zinder Region, southeast of Niger, compared to the 231 cases reported during 1 November 2021 to 31 January 2022.

The results matched the actual values without an O tag, as shown in Table 8. The V4 model correctly extracted all entity tags, whereas V2 misclassified this and mosquito in the first sentence. The error associated with this may result from annotation patterns during training, since phrases like this year were consistently labeled as a DATE entity, potentially causing contextual confusion in similar cases. The error with mosquito is likely due to annotation patterns involving relatively long I-LOC sequences in the dataset, which reduced sequence-level precision in NER tasks.

Table 8.

Results of extracting infectious disease outbreak information from ProMED-mail and WHO Outbreak News

Sentences	V2	V4	Sentences	V2	V4
3	B-CASE	B-CASE	1	B-DATE_DAY	B-DATE_DAY
MVE	B-DISEASE	B-DISEASE	November	B-DATE_MONTH	B-DATE_MONTH
virus	I-DISEASE	I-DISEASE	2022	B-DATE_YEAR	B-DATE_YEAR
2	B-DEATH	B-DEATH	27	B-DATE_DAY	B-DATE_DAY
Victoria	B-LOC	B-LOC	January	B-DATE_MONTH	B-DATE_MONTH
this	B-DATE_YEAR	O	2023	B-DATE_YEAR	B-DATE_YEAR
mosquito	I-LOC	O	559	B-CASE	B-CASE
			meningitis	B-DISEASE	B-DISEASE
			111	B-CASE	B-CASE
			18	B-DEATH	B-DEATH
			Zinder	B-LOC	B-LOC
			Region	I-LOC	I-LOC
			southeast	I-LOC	B-LOC
			of	I-LOC	I-LOC
			Niger	I-LOC	I-LOC
			231	B-CASE	B-CASE
			1	B-DATE_DAY	B-DATE_DAY
			November	B-DATE_MONTH	B-DATE_MONTH
			2021	B-DATE_YEAR	B-DATE_YEAR
			31	B-DATE_DAY	B-DATE_DAY
			January	B-DATE_MONTH	B-DATE_MONTH
			2022	B-DATE_YEAR	B-DATE_YEAR

DISCUSSION

In this study, we presented Survice-BERT, a BERT-based model fine-tuned for biomedical NER in infectious disease surveillance reports. We also constructed a novel benchmark dataset from global public health reports, annotated with eight predefined NER tags. Experimental results demonstrated robust performance, with the V4 model (SciBERT with commas) achieving an average F1-score of 0.993, surpassing previous studies (41). Survice-BERT effectively extracts critical details such as time, location, and disease name from outbreak reports, providing structured outputs that support sustainable monitoring, analysis, and forecasting in epidemic research.

The COVID-19 pandemic underscored the critical importance of early detection of emerging infectious diseases. However, most countries remain insufficiently prepared, particularly for unfamiliar pathogens (67). In this context, imminent contagion in public health demands rapid and intelligent responses. For example, since May 2022, monkeypox outbreaks have emerged worldwide, including in non-endemic countries (68). Such concurrent outbreaks highlight the need for preemptive surveillance strategies (69). Recently, the rapid advancement of AI technologies has facilitated their integration into policy-making processes to strengthen continuous monitoring and early detection (70).

From a methodological perspective, this study shows that fine-tuning a domain-specific PLM can achieve high performance on a relatively small dataset within a short training time, effectively reducing the burden of manual surveillance. It also demonstrates the advantages of source-specific dataset separation and highlights the utility of our dataset as a benchmark resource. Furthermore, the findings emphasize the potential of adapting BERT-based models to broader biomedical NER tasks.

Nevertheless, several limitations remain. Due to the structured orthography of the dataset, preprocessing of input sentences is necessary to ensure compatibility. Currently, Survice-BERT can extract information only from English texts; extending it to multilingual datasets or incorporating translation functions during preprocessing will enhance its applicability. In addition, due to the nature of the report, indirect expressions, such as last month and this week, exist. Therefore, these should be normalized by introducing a post-processing technique so that entities can be completely grounded. Additionally, minor misclassifications observed during evaluation, such as this or mosquito, suggest that annotation patterns may influence tagging accuracy in particular contexts.

Future work will focus on developing training datasets for predictive tasks and expanding Survice-BERT with additional NER tags. We also plan to build multilingual versions fine-tuned on datasets in multiple languages or via translation APIs. Furthermore, we aim to develop improved pre- and post-processing techniques for normalization and performance enhancement. Moreover, we can provide a web-based or standalone application of the Survice-BERT to apportion the model outputs to end users, including public health officials, epidemiologists, and other domain researchers. The Survice-BERT model will continue to be freely available on GitHub (71).

Looking ahead, the inevitability of future pandemics, including the unknown “Disease X,” underscores the urgency of strengthening public health surveillance systems. This study contributes to continuous monitoring, timely detection, and predictive modeling for epidemic prevention. In conclusion, Survice-BERT has the potential to enhance public health preparedness and resilience in the face of future pandemics.

AUTHOR CONTRIBUTIONS

Saeyeon Cheon: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization. Insung Ahn: Conceptualization, Writing - Review & Editing, Supervision, Project administration, Funding acquisition.

FUNDING

This research was supported by ‘The Government-wide R&D to Advance Infectious Disease Prevention and Control’, Republic of Korea (grant number : RS-2023-KH140419 (HG23C1624)), and by the Korea Institute of Science and Technology Information (KISTI) (No. K25L1M4C4). Computational resources were provided by the Neuron supercomputing facility at KISTI.

ETHICS STATEMENT

Not applicable.

CONFLICT OF INTEREST

Not applicable.

References

Bogoch II, Watts A, Thomas-Bachli A, Huber C, Kraemer MUG, Khan K. Pneumonia of unknown aetiology in Wuhan, China: potential for international spread via commercial air travel. J Travel Med. 2020;27(2):taaa008.

10.1093/jtm/taaa00831943059PMC7107534

Undiagnosed pneumonia - China (Hubei): request for information. ProMED-mail. 2019 Dec 30. Available at https://www.promedmail.org/alert/6864153 [accessed on 18 September 2025].

Undiagnosed pneumonia - China (Hubei) (07): official confirmation of novel coronavirus. ProMED-mail. 2020 Jan 9. Available at https://www.promedmail.org/alert/6878869 [accessed on 18 September 2025].

Debata B, Patnaik P, Mishra A. COVID‐19 pandemic! It’s impact on people, economy, and environment. J Public Aff. 2020;20(4):e2372.

10.1002/pa.2372

Riccaboni M, Verginer L. The impact of the COVID-19 pandemic on scientific research in the life sciences. PLoS One. 2022;17(2):e0263001.

10.1371/journal.pone.026300135139089PMC8827464

Ioannidis JPA. The end of the COVID‐19 pandemic. Eur J Clin Invest. 2022;52(6):e13782.

10.1111/eci.1378235342941PMC9111437

Xu J, Xu K, Li Z, Meng F, Tu T, Xu L, et al. Forecast of dengue cases in 20 Chinese cities based on the deep learning method. Int J Environ Res Public Health. 2020;17(2):453.

10.3390/ijerph1702045331936708PMC7014037

Ribeiro MHDM, da Silva RG, Mariani VC, Coelho LDS. Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil. Chaos Solitons Fractals. 2020;135:109853.

10.1016/j.chaos.2020.10985332501370PMC7252162

Yeung AY, Roewer-Despres F, Rosella L, Rudzicz F. Machine learning–based prediction of growth in confirmed COVID-19 infection cases in 114 countries using metrics of nonpharmaceutical interventions and cultural dimensions: model development and validation. J Med Internet Res. 2021;23(4):e26628.

10.2196/2662833844636PMC8074952

Wakamiya S, Kawai Y, Aramaki E. Twitter-based influenza detection after flu peak via tweets with indirect information: text mining study. JMIR Public Health Surveill. 2018;4(3):e65.

10.2196/publichealth.862730274968PMC6231889

Dixon S, Keshavamurthy R, Farber DH, Stevens A, Pazdernik KT, Charles LE. A comparison of infectious disease forecasting methods across locations, diseases, and time. Pathogens. 2022;11(2):185.

10.3390/pathogens1102018535215129PMC8875569

Barua L, Zou B, Zhou Y. Machine learning for international freight transportation management: a comprehensive review. Res Transp Bus Manag. 2020;34:100453.

10.1016/j.rtbm.2020.100453

Buchmann CM, Grossmann K, Schwarz N. How agent heterogeneity, model structure and input data determine the performance of an empirical ABM–A real-world case study on residential mobility. Environ Model Softw. 2016;75:77-93.

10.1016/j.envsoft.2015.10.005

Domingos P. A few useful things to know about machine learning. Commun ACM. 2012;55(10):78-87.

10.1145/2347736.2347755

Haq MA, Ahmed A, Khan I, Gyani J, Mohamed A, Attia EA, et al. Analysis of environmental factors using AI and ML methods. Sci Rep. 2022;12(1):13267.

10.1038/s41598-022-16665-735918395PMC9345871

Cho H, Choi UJ, Park H. Deep learning application to time-series prediction of daily chlorophyll-a concentration. WIT Trans Ecol Environ. 2018;215:157-163.

10.2495/EID180141

Garg R, Barpanda S, Salanke NSGR, S R. Machine Learning Algorithms for Time Series Analysis and Forecasting. 2022;arXiv: arXiv:2211.14387.2022.

Cao H, Goh YM. Analyzing construction safety through time series methods. Front Eng Manag. 2019;6:262-274.

10.1007/s42524-019-0015-6

Chan EH, Brewer TF, Madoff LC, Pollack MP, Sonricker AL, Keller M, et al. Global capacity for emerging infectious disease detection. Proc Natl Acad Sci U S A. 2010;107(50):21701-21706.

10.1073/pnas.100621910721115835PMC3003006

Desai AN, Kraemer MUG, Bhatia S, Cori A, Nouvellet P, Herringer M, et al. Real-time epidemic forecasting: challenges and opportunities. Health Secur. 2019;17(4):268-275.

10.1089/hs.2019.002231433279PMC6708259

Semenza JC, Sudre B, Miniota J, Rossi M, Hu W, Kossowsky D, et al. International dispersal of dengue through air travel: importation risk for Europe. PLoS Negl Trop Dis. 2014;8(12):e3278.

10.1371/journal.pntd.000327825474491PMC4256202

Bhatia S, Lassmann B, Cohn E, Desai AN, Carrion M, Kraemer MUG, et al. Using digital surveillance tools for near real-time mapping of the risk of infectious disease spread. NPJ Digit Med. 2021;4(1):73.

10.1038/s41746-021-00442-333864009PMC8052406

Perera N, Dehmer M, Emmert-Streib F. Named entity recognition and relation detection for biomedical information extraction. Front Cell Dev Biol. 2020;8:673.

10.3389/fcell.2020.0067332984300PMC7485218

Aggarwal CC, Zhai C. Mining text data. Berlin/Heidelberg: Springer Science & Business Media; 2012.

10.1007/978-1-4614-3223-4

Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. J Mach Learn Res. 2011;12:2493-2537.

Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. (2016). Neural architectures for named entity recognition. 2016 Conf. North Am. Chapter Assoc. Comput. Linguist: Hum. Lang. Technol. San Diego, USA.

10.18653/v1/N16-1030

Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee L, et al. (2018). Deep contextualized word representations. 2018 Annu. Conf. North Am. Chapter Assoc. Comput. Linguist: Hum. Lang. Technol. New Orleans, USA.

10.18653/v1/N18-1202

Yang Y, Lin H, Yang Z, Zhang Y, Zhao D, Huai S. ADPG: Biomedical entity recognition based on Automatic Dependency Parsing Graph. J Biomed Inform. 2023;140:104317.

10.1016/j.jbi.2023.104317

Liu M, Tu Z, Zhang T, Su T, Wang Z. LTP: a new active learning strategy for CRF- based named entity recognition. Neural Process Lett. 2022;54(3):2433-2454.

10.1007/s11063-021-10737-x

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:4171-4186.

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog. 2019.

Raj Kanakarajan K, Kundumani B, Sankarasubbu M. (2021). BioELECTRA: pretrained biomedical text encoder using discriminators. 20th Workshop Biomed. Lang. Process. Online.

10.18653/v1/2021.bionlp-1.16

Trieu HL, Miwa M, Ananiadou S. BioVAE: a pre-trained latent variable language model for biomedical text mining. Bioinformatics. 2022;38(3):872-874.

10.1093/bioinformatics/btab70234636886PMC8756089

Gurulingappa H, Mateen-Rajput A, Toldo L. Extraction of potential adverse drug events from medical case reports. J Biomed Semantics. 2012;3(1):15.

10.1186/2041-1480-3-1523256479PMC3599676

Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, et al. STRING v10: Protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2015;43:D447-D452.

10.1093/nar/gku100325352553PMC4383874

Smith L, Tanabe LK, Ando RJ, Kuo CJ, Chung IF, Hsu CN, et al. Overview of BioCreative II gene mention recognition. Genome Biol. 2008;9 Suppl 2(Suppl 2): S2.

10.1186/gb-2008-9-s2-s218834493PMC2559986

Sun C, Yang Z, Wang L, Zhang Y, Lin H, Wang J. Biomedical named entity recognition using BERT in the machine reading comprehension framework. J Biomed Inform. 2021;118:103799.

10.1016/j.jbi.2021.103799

Yang X, Zhang H, He X, Bian J, Wu Y. Extracting family history of patients from clinical narratives: exploring an end-to-end solution with deep learning models. JMIR Med Inform. 2020;8(12):e22982.

10.2196/2298233320104PMC7772072

Caskey J, McConnell IL, Oguss M, Dligach D, Kulikoff R, Grogan B, et al. Identifying COVID-19 outbreaks from contact-tracing interview forms for public health departments: development of a natural language processing pipeline. JMIR Public Health Surveill. 2022;8(3):e36119.

10.2196/3611935144241PMC8906835

Joy M, Krishnaveni M. (2022). A review of epidemic surveillance systems for infectious diseases. 2022 Int. Conf. Comput. Commun. Secur. Intell. Syst. (IC3SIS). Kochi, India.

10.1109/IC3SIS54991.2022.9885291

Dellanzo A, Cotik V, Lozano Barriga DY, Mollapaza Apaza JJ, Palomino D, Schiaffino F, et al. Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus. BMC Bioinformatics. 2022;23(1):558.

10.1186/s12859-022-05094-y36564712PMC9780622

Gu D, He J, Sun J, Shi X, Ye Y, Zhang Z, et al. The global infectious diseases epidemic information monitoring system: development and usability study of an effective tool for travel health management in China. JMIR Public Health Surveill. 2021;7(2):e24204.

10.2196/2420433591286PMC7925143

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240.

10.1093/bioinformatics/btz68231501885PMC7703786

Doğan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform. 2014;47:1-10.

10.1016/j.jbi.2013.12.00624393765PMC3951655

Li J, Sun Y, Johnson RJ, Sciaky D, Wei CH, Leaman R, et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database. 2016;2016:baw068.

10.1093/database/baw06827161011PMC4860626

Kim J, Ohta T, Tsuruoka Y, Tateisi Y, Collier N. (2004). Introduction to the bio-entity recognition task at JNLPBA. Int. Joint Workshop Nat. Lang. Process. Biomed. Appl. Geneva, Switzerland.

10.3115/1567594.1567610

Gerner M, Nenadic G, Bergman CM. Linnaeus: a species name identification system for biomedical literature. BMC Bioinformatics. 2010;11:85.

10.1186/1471-2105-11-8520149233PMC2836304

Beltagy I, Cohan A, Lo K. (2019). SciBERT: A pre-trained language model for scientific text. 2019 Conf. Empir. Methods Nat. Lang. Process. & 9th Int. Joint Conf. Nat. Lang. Process. Hong Kong.

10.18653/v1/D19-1371

Ammar W, Groeneveld D, Bhagavatula C, Beltagy I, Crawford M, Downey D, et al. (2018). Construction of the literature graph in Semantic Scholar. 2018 Conf. North Am. Chapter Assoc. Comput. Linguist: Hum. Lang. Technol. New Orleans, USA.

10.18653/v1/N18-3011

Krallinger M, Rabal O, Akhondi SA, Pérez MP, Santamaría J, Rodríguez GP, et al. (2017). Overview of the BioCreative VI chemical-protein interaction track. 6th BioCreative Challenge Eval. Workshop. Bethesda, USA.

Nye B, Li J, Patel R, Yang Y, Marshall I, Nenkova A, et al. (2018). A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. 56th Annu. Meet. Assoc. Comput. Linguist. Melbourne, Australia.

10.18653/v1/P18-1019

Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112-2120.

10.1093/bioinformatics/btab08333538820PMC11025658

Huang K, Altosaar J, Ranganath R. (2020). ClinicalBERT: modeling clinical notes and predicting hospital readmission. ACM Conf. Health, Inference, Learn. Toronto, Canada.

Mulyar A, Uzuner O, McInnes B. MT-clinical BERT: scaling clinical information extraction with multitask learning. J Am Med Inform Assoc. 2021;28(10):2108-2115.

10.1093/jamia/ocab12634333635PMC8449623

Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc. 2021;3(1):1-23.

10.1145/3458754

Kjeldgaard L, Nielsen LC. NERDA GitHub repository. NERDA. Available at https://github.com/ebanalyse/NERDA [accessed on 18 September 2025].

Sang ETK, De Meulder F. (2003). Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. 7th Conf. Nat. Lang. Learn. at HLT-NAACL. Edmonton, Canada.

European Centre for Disease Prevention and Control. ECDC Communicable Disease Threats Report (CDTR). Available at https://www.ecdc.europa.eu/en/publications-data [accessed on 18 September 2025].

European Centre for Disease Prevention and Control. ECDC communicable disease threats report (CDTR). Week 18, 30 April – 6 May 2023. Available at https://www.ecdc.europa.eu/en/publications-data/communicable-disease-threats- report-30-april-6-may-2023-week-18 [accessed on 18 September 2025].

Pan American Health Organization, World Health Organization. COVID-19 region of the Americas update. Available at https://www.paho.org/en/covid-19-global-and-regional-daily-update [accessed on 18 September 2025].

Pan American Health Organization, World Health Organization. COVID-19 Daily Update: 4 Feb 2022. Available at https://iris.paho.org/handle/10665.2/55823 [accessed on 18 September 2025].

World Health Organization. Regional Office for the Western pacific. (2015). Dengue Situation Updates 2015. WHO Regional Office for the Western Pacific. Available at https://iris.who.int/handle/10665/274097 [accessed on 18 September 2025].

World Health Organization. Regional Office for the Western Pacific. (2023)‎. Dengue Situation Updates 2023. WHO Regional Office for the Western Pacific. Available at https://apps.who.int/iris/handle/10665/365676 [accessed on 18 September 2025].

Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4(1):86.

10.1038/s41746-021-00455-y34017034PMC8137882

Murray Valley encephalitis - Australia (04): (VICTORIA) fatal. ProMED-mail. 2023 Mar 7. Available at https://www.promedmail.org/alert/8708785promedmail.org/alert/8708785 [accessed on 18 September 2025].

Meningitis - Niger. World Health Organization. 2023 Feb 8. Available at https://www.who.int/emergencies/disease-outbreak-news/item/2023-DON439 [accessed on 18 September 2025].

Lahariya C, Thakur A, Dudeja N. Monkeypox disease outbreak (2022): epidemiology, challenges, and the way forward. Indian Pediatr. 2022;59(8):636-642.

10.1007/s13312-022-2578-235762024PMC9419123

Multi-country monkeypox outbreak: situation update. World Health Organization. 2022 Jun 10. Available at https://www.who.int/emergencies/disease-outbreak-news/item/2022-DON392 [accessed on 18 September 2025].

Tambo E, Al-Nazawi AM. Combating the global spread of poverty-related Monkeypox outbreaks and beyond. Infect Dis Poverty. 2022;11(1):80.

10.1186/s40249-022-01004-935794644PMC9261034

Patel M, Surti M, Adnan M. Artificial intelligence (AI) in Monkeypox infection prevention. J Biomol Struct Dyn. 2023;41(17):8629-8633.

10.1080/07391102.2022.213421436218112PMC9627635

Survice-BERT. GitHub. Available at https://github.com/csanny/Survice-BERT [accessed on 18 September 2025].

JOURNAL OF BACTERIOLOGY AND VIROLOGY ISSN:1598-2467(Print) 2093-0429(Online)

Preview

Survice-BERT: A BERT Model for Named Entity Recognition in Infectious Disease Surveillance Reports

ABSTRACT

MAIN

Fig. 1

Workflow of the Survice-BERT model. Unstructured surveillance text reports are processed into structured epidemiological tables through named entity recognition (NER).

Fig. 2

Structure and annotation example of the proposed datasets. Sentences were derived from three types of reports, and eight named entities were annotated using BIO tagging with B (beginning), I (inside), and O (outside), depending on position.

Table 1.

Description of our datasets

Table 2.

Counts of NER tags in the constructed datasets

Fig. 3

Distribution of NER tag proportions across the training, validation, and test datasets.

Table 3.

Four model variants used in this study

Table 4.

Batch size and learning rate settings for model training

Table 5.

Hyperparameters of the fine-tuned models

Table 6.

F1-scores of the best models by dataset (V1: BioBERT w/o commas, V2: BioBERT w/ commas, V3: SciBERT w/o commas, V4: SciBERT w/ commas)

Table 7.

F1-scores of model variants on WHO reports by entity tag

Table 8.

Results of extracting infectious disease outbreak information from ProMED-mail and WHO Outbreak News

AUTHOR CONTRIBUTIONS

FUNDING

ETHICS STATEMENT

CONFLICT OF INTEREST

References