Skip to Content

Digital Curation Students

A group for students of digital curation

Digital Curation Practitioners

A group for anyone practicing digital curation

Digital Curation Researchers

A group for anyone involved in digital curation research

Digital Curation Educators

A group for anyone involved in teaching digital curation

Internet Researcher and Offline Commander - commercial off-line browsing software for Windows

Internet Researcher and Offline Commander is commercial off-line browsing software for Windows.

GNU Wget

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

WERA (Web ARchive Access)

WERA (Web ARchive Access) is a freely available solution for searching and navigating archived web document collections. It works like the Internet Archive's Wayback Machine except it also allows for full-text search of the web archive.

The Web Curator Tool (WCT)

The Web Curator Tool (WCT) is an open-source workflow management application for selective web archiving. It is designed for use in libraries and other collecting organisations, and supports collection by non-technical users while still allowing complete control of the web harvesting process. It is integrated with the Heritrix web crawler and supports key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata.

MaximumSoft

MaximumSoft supports parsing and integrity checking for various formats, crawl scheduling and extraction of links from compressed Flash (.SWF) files.

Web Capture Tools

This page provides a list of Web Capture Tools.

The Stanford WebBase project

Warrick

Warrick is a free utility for reconstructing (or recovering) a website when a back-up is not available. Warrick will search the following web repositories for missing resources: Internet Archive, Google, Bing (formerly Live Search), and Yahoo.

Web Archiving Resources

This is a list of resources covering web archiving tools and practices.

Wayfinder

Wayback

Wayback is an open source java implementation of the The Internet Archive Wayback Machine.

WAXToolbar

WAXToolbar is a firefox extension to help users with common tasks encountered surfing a web archive. This extension depends on the open source wayback machine. Among the features of the WAX Toolbar is a search field for querying the wayback machine OR for searching a full-text NutchWAX index (if one is available). You can also use the toolbar to switch between proxy-mode and the regular Internet; when in proxy-mode you can easily go back and forth in time.

netpreserve.org

netpreserve.org is the website of the International Internet Preservation Consortium.

Tennyson Maxwell Information Systems

Tennyson Maxwell Information Systems offers a variety of features to support multithreaded retrieval, password-protected access, filtering, batch capture, and management of derived databases.

askSam

Sparkleware

Sparkleware is a commercial off-line browser.

Spadix software

Spadix Software can download websites from a starting URL, search engine results or web dirs, and is able to follow external links. It also supports filtering and crawling of password-protected sites.

pageVault

 

pageVault supports the archiving of all unique responses generated by a web server. It allows you to know exactly what information you have published on your web site, whether static pages or dynamically generated content, and regardless of format (HTML, XML, PDF, zip, Microsoft Office formats, images, sound), regardless of rate of change.

NutchWAX

NetarchiveSuite

The NetarchiveSuite is the complete web archiving software package developed within the netarchive.dk project from 2004 and onwards. The primary function of the NetarchiveSuite is to plan, schedule and run web harvests of parts of the Internet. It scales to a wide range of tasks, from small, thematic harvests (e.g. related to special events, or special domains) to harvesting and archiving the content of an entire national domain.

The Nalanda iVia Focused Crawler

The Nalanda iVia Focused Crawler (NIFC) is a focused Web crawler. It was created by Dr. Soumen Chakrabarti (Indian Institute of Technology Bombay) and developed with the support of IIT Bombay, the iVia Team and the U.S. Institute of Museum and Library Services.

mod_oai

The goal of the mod_oai project is to bring the efficiency of OAI-PMH to everyday web sites.

Metaproducts

Metaproducts offers several commercial capture and off-line browsing tools.

Evaluation of Open Source Spidering Technology

This is a paper from 2004.

Abstract

InfoMonitor: Unobtrusively archiving a World Wide Web server

This is a paper from 2005.

Abstract

It is important to provide long-term preservation of digital data even when that data is stored in an unreliable system, such as a fi lesystem, a legacy database, or even the World Wide Web. In this paper we focus on the problem of archiving the contents of a web site without disrupting users who maintain the site.

HTTrack

 

HTTrack is a free and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online.

Syndicate content


about seo