Open main menu

Link rot (also called link death, link breaking, or reference rot) is the phenomenon of hyperlinks tending over time to cease to point to their originally targeted file, web page, or server due to that resource being relocated or becoming permanently unavailable. A link that no longer points to its target, often called a broken or dead link, is a specific form of dangling pointer.

The rate of link rot is a subject of study and research due to its significance to the internet's ability to preserve information. Estimates vary dramatically between different studies.[citation needed]

PrevalenceEdit

A number of studies have examined the prevalence of link rot within the World Wide Web, in academic literature that references URLs, and within digital libraries.[1]

A 2003 study found that on the Web, about one link out of every 200 broke each week,[2] suggesting a half-life of 138 weeks. This rate was largely confirmed by a 2016–2017 study of links in Yahoo! Directory (which had stopped updating in 2014 after 21 years of development) that found the half-life of the directory's links to be two years.[3]

A 2004 study showed that subsets of Web links (such as those targeting specific file types or those hosted by academic institution) could have dramatically different half-lives.[4] The URLs selected for publication appear to have greater longevity than the average URL. A 2015 study by Weblock analyzed more than 180,000 links from references in the full-text corpora of three major open access publishers and found a half-life of about 14 years,[5] generally confirming a 2005 study that found that half of the URLs cited in D-Lib Magazine articles were active 10 years after publication.[6] Other studies have found higher rates of link rot in academic literature, but typically suggest a half-life of four years or greater.[7][8] A 2013 study in BMC Bioinformatics analyzed nearly 15,000 links in abstracts from Thomson Reuters's Web of Science citation index and found that the median lifespan of web pages was 9.3 years, and just 62% were archived.[9]

A 2002 study suggested that link rot within digital libraries is considerably slower than on the web, finding that about 3% of the objects were no longer accessible after one year[10] (equating to a half-life of nearly 23 years).

CausesEdit

Link rot can result from several occurrences. A target web page may be removed. The server that hosts the target page could fail, be removed from service, or relocate to a new domain name. A domain name's registration may lapse or be transferred to another party. Some causes will result in the link failing to find any target at returning an error such as HTTP 404. Other causes will cause a link to target content other than what was intended by the link's author.

Other reasons for broken links include:

  • the restructuring of websites that causes changes in URLs (e.g. domain.net/pine might be moved to domain.net/tree/pine)
  • relocation of formerly free content to behind a paywall
  • a change in server architecture that results in code such as PHP functioning differently
  • dynamic page content such as search results that changes by design
  • the presence of user-specific information such as a login name within the link
  • deliberate blocking by content filters or firewalls.
  • the removal of gTLDs[11]

Prevention and detectionEdit

Strategies for preventing link rot can focus on placing content where its likelihood of persisting is strong, authoring links that are less likely to be broken, taking steps to preserve existing links, or repairing links whose targets have been relocated or removed.

The creation of URLs that will not change with time is the fundamental method of preventing link rot. Preventive planning has been championed by Tim Berners-Lee and other web pioneers.[12]

Strategies pertaining to the authorship of links include:

Strategies pertaining to the protection of existing links include:

  • using redirection mechanisms such as HTTP 301 to automatically refer browsers and crawlers to relocated content
  • using content management systems which can automatically update links when content within the same site is relocated or automatically replace links with canonical URLs[15]
  • integrating search resources into HTTP 404 pages.[16]

The detection of broken links may be done manually or automatically. Automated methods, including plug-ins for WordPress, Drupal and other content management system can be used to detect the presence of broken URLs. An alternative is using a specific broken link checker like Xenu's Link Sleuth. However, if a URL returns an HTTP 200 (OK) response, it may be accessible, but the contents of the page could have changed and may no longer be relevant, meaning that manual inspection of links is often still necessary. Some web servers also return a soft 404, reporting to computers that the link works even though it doesn't. In a study published in 2004, a heuristic was developed for detecting soft 404s.[17]

Web archivingEdit

To combat link rot, web archivists are actively engaged in collecting the Web or particular portions of the Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. The goal of the Internet Archive is to maintain an archive of the entire Web, taking periodic snapshots of pages that can then be accessed for free via the Wayback Machine. In January 2013 the company announced that it had reached the milestone of 240 billion archived URLs.[18] National libraries, national archives and other organizations are also involved in archiving culturally important Web content.

A number of tools exist to archive web resources that may go missing in the future:

  • The "WayBack Machine", at the Internet Archive,[19] is a free website that archives old web pages. It does not archive websites whose owners have stated they do not want their website archived.
  • WebCite, a tool specifically for scholarly authors, journal editors and publishers to permanently archive "on-demand" and retrieve cited Internet references.[14]
  • Archive.is, an archive site which stores snapshots of web pages. It retrieves one page at a time, but unlike WebCite, it includes Web 2.0 sites such as Google Maps and Twitter.
  • Perma.cc, which is supported by the Harvard Law School together with a broad coalition of university libraries, takes a snapshot of a URL's content and returns a permanent link.[20]
  • The Hiberlink project, a collaboration between the University of Edinburgh, the Los Alamos National Laboratory and others, is working to measure “reference rot” in online academic articles, and also to what extent Web content has been archived.[21] A related project, Memento, has established a technical standard for accessing online content as it existed in the past.[22]
  • Some social bookmarking websites allow users to make online clones of any web page on the internet, creating a copy at an independent url which remains online even if the original page goes down.
  • Amber, created by the Harvard Berkman Center, is a tool built to fight link rot through archiving links on Wordpress and Drupal sites to prevent web censorship and bolster content preservation.[23]

However, such preserving systems may encounter on and off service interruption so that the preserved URLs are intermittently unavailable.[24]

See alsoEdit

Further readingEdit

Link rot on the WebEdit

  • Kille, Leighton Walter (8 November 2014). "The Growing Problem of Internet "Link Rot" and Best Practices for Media and Online Publishers". Journalist's Resource, Harvard Kennedy School. Archived from the original on 12 January 2015. Retrieved 16 January 2015.
  • Eysenbach, Gunther; Trudel, Mathieu (2005). "Going, going, still there: Using the WebCite service to permanently archive cited web pages". Journal of Medical Internet Research. 7 (5): e60. doi:10.2196/jmir.7.5.e60. PMC 1550686. PMID 16403724.
  • Bar-Yossef, Ziv; Broder, Andrei Z.; Kumar, Ravi; Tomkins, Andrew (2004). "Sic transit gloria telae: towards an understanding of the Web's decay". Proceedings of the 13th international conference on World Wide Web – WWW '04. pp. 328–337. CiteSeerX 10.1.1.1.9406. doi:10.1145/988672.988716. ISBN 978-1581138443.
  • Fetterly, Dennis; Manasse, Mark; Najork, Marc; Wiener, Janet (2003). "A large-scale study of the evolution of web pages". Proceedings of the 12th international conference on World Wide Web. Retrieved 14 September 2010.
  • Markwell, John; Brooks, David W. (2002). "Broken Links: The Ephemeral Nature of Educational WWW Hyperlinks". Journal of Science Education and Technology. 11 (2): 105–108. doi:10.1023/A:1014627511641.
  • Berners-Lee, Tim (1998). "Cool URIs Don't Change". Archived from the original on 2000-03-02. Retrieved 2019-01-31.

In academic literatureEdit

In digital librariesEdit

ReferencesEdit

  1. ^ Habibzadeh, P. (2013). "Decay of References to Web sites in Articles Published in General Medical Journals: Mainstream vs Small Journals". Applied Clinical Informatics. 4 (4): 455–464. doi:10.4338/aci-2013-07-ra-0055. PMC 3885908. PMID 24454575.
  2. ^ Fetterly, Dennis; Manasse, Mark; Najork, Marc; Wiener, Janet (2003). "A large-scale study of the evolution of web pages". Proceedings of the 12th international conference on World Wide Web. Retrieved 14 September 2010.
  3. ^ van der Graaf, Hans. "The half-life of a link is two year". ZOMDir's blog. Archived from the original on 2017-10-17. Retrieved 2019-01-31.
  4. ^ Koehler, Wallace (2004). "A longitudinal study of web pages continued: a consideration of document persistence". Information Research. 9 (2). Archived from the original on 2017-09-11. Retrieved 2019-01-31.
  5. ^ "All-Time Weblock Report". August 2015. Archived from the original on 4 March 2016. Retrieved 12 January 2016.
  6. ^ a b McCown, Frank; Chan, Sheffan; Nelson, Michael L.; Bollen, Johan (2005). "The Availability and Persistence of Web References in D-Lib Magazine" (PDF). Proceedings of the 5th International Web Archiving Workshop and Digital Preservation (IWAW'05).
  7. ^ Spinellis, Diomidis (2003). "The Decay and Failures of Web References". Communications of the ACM. 46 (1): 71–77. CiteSeerX 10.1.1.12.9599. doi:10.1145/602421.602422.
  8. ^ Lawrence, Steve; Pennock, David M.; Flake, Gary William; Krovetz, Robert; Coetzee, Frans M.; Glover, Eric; Nielsen, Finn Arup; Kruger, Andries; Giles, C. Lee (2001). "Persistence of Web References in Scientific Research". Computer. 34 (2): 26–31. CiteSeerX 10.1.1.97.9695. doi:10.1109/2.901164.
  9. ^ Hennessey, Jason; Xijin Ge, Steven (2013). "A Cross Disciplinary Study of Link Decay and the Effectiveness of Mitigation Techniques". BMC Bioinformatics. 14: S5. doi:10.1186/1471-2105-14-S14-S5. PMC 3851533. PMID 24266891. Archived from the original on 21 January 2015. Retrieved 16 January 2015.
  10. ^ Nelson, Michael L.; Allen, B. Danette (2002). "Object Persistence and Availability in Digital Libraries". D-Lib Magazine. 8 (1). doi:10.1045/january2002-nelson.
  11. ^ "The death of a TLD". blog.benjojo.co.uk. Archived from the original on 2018-07-26. Retrieved 2018-07-27.
  12. ^ Berners-Lee, Tim (1998). "Cool URIs Don't Change". Archived from the original on 2000-03-02. Retrieved 2019-01-31.
  13. ^ a b Kille, Leighton Walter (8 November 2014). "The Growing Problem of Internet "Link Rot" and Best Practices for Media and Online Publishers". Journalist's Resource, Harvard Kennedy School. Archived from the original on 12 January 2015. Retrieved 16 January 2015.
  14. ^ a b Eysenbach, Gunther; Trudel, Mathieu (2005). "Going, going, still there: Using the WebCite service to permanently archive cited web pages". Journal of Medical Internet Research. 7 (5): e60. doi:10.2196/jmir.7.5.e60. PMC 1550686. PMID 16403724.
  15. ^ Rønn-Jensen, Jesper (2007-10-05). "Software Eliminates User Errors And Linkrot". Justaddwater.dk. Archived from the original on 11 October 2007. Retrieved 5 October 2007.
  16. ^ Mueller, John (2007-12-14). "FYI on Google Toolbar's Latest Features". Google Webmaster Central Blog. Archived from the original on 13 September 2008. Retrieved 9 July 2008.
  17. ^ Bar-Yossef, Ziv; Broder, Andrei Z.; Kumar, Ravi; Tomkins, Andrew (2004). "Sic transit gloria telae: towards an understanding of the Web's decay". Proceedings of the 13th international conference on World Wide Web – WWW '04. pp. 328–337. CiteSeerX 10.1.1.1.9406. doi:10.1145/988672.988716. ISBN 978-1581138443.
  18. ^ "Wayback Machine: Now with 240,000,000,000 URLs | Internet Archive Blogs". 2013-01-09. Archived from the original on 2017-09-12. Retrieved 2014-04-16.
  19. ^ "Internet Archive: Digital Library of Free Books, Movies, Music & Wayback Machine". 2001-03-10. Archived from the original on 26 January 1997. Retrieved 7 October 2013.
  20. ^ Zittrain, Jonathan; Albert, Kendra; Lessig, Lawrence (12 June 2014). "Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations". Legal Information Management. 14 (2): 88–99. doi:10.1017/S1472669614000255.
  21. ^ "Hiberlink". Hiberlink.org. Archived from the original on 29 January 2015. Retrieved 15 January 2015.
  22. ^ "Memento: Time Travel for the Web". Memento. Archived from the original on 7 January 2015. Retrieved 15 January 2015.
  23. ^ "Harvard University's Berkman Center Releases Amber, a "Mutual Aid" Tool for Bloggers & Website Owners to Help Keep the Web Available | Berkman Center". cyber.law.harvard.edu. Archived from the original on 2016-02-02. Retrieved 2016-01-28.
  24. ^ Habibzadeh, Parham (2015-07-30). "Are current archiving systems reliable enough?". International Urogynecology Journal. 26 (10): 1553. doi:10.1007/s00192-015-2805-7. ISSN 0937-3462. PMID 26224384.

External linksEdit