The Depositar Team Presented at the 2025 Web Archiving Conference (WAC)

The depositar team recently attended the Web Archiving Conference 2025 (WAC2025) in Oslo, Norway, and gave a presentation in the Discovery & Access (News/Newspapers) session. The following is the title and an abridged abstract of the presentation.

Recently Orphaned Newspapers: From Archived Webpages to Reusable Datasets and Research Outlooks

Tyng-Ruey Chuang, Chia-Hsun Wang, and Hung-Yen Wu

We report on our progress in converting the web archives of a recently orphaned newspaper into accessible article collections in IPTC (International Press Telecommunications Council) standard format for news representation. After the conversion, old articles extracted from a defunct news website are now reincarnated as research datasets meeting the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles. Specifically, we focus on Taiwan’s Apple Daily and work on the WARC files built by the Archive Team in September 2022 at a time when the future of the newspaper seemed dim. We convert these WARC files into de-duplicated collections of pure text in ninjs (News in JSON) format. …

Our work in transforming the WARC records into ninjs objects produces a collection of unique 953,175 news articles totaling in 4.3 GB. The articles are grouped by the day/month/year they were published hence it is convenient to look into a specific date for the news that were published on that day. Metadata about each article — headline(s), subject(s), original URI, unique ID, among others — are mapped into the corresponding fields in the ninjs object for ready access. …

The slideset and the script used in the presentation can be found at the depositar (ark:37281/k5p3h9k37).

Presentation from the depositar team at the 2025 Web Archiving Conference
Presentation from the depositar team at the 2025 Web Archiving Conference