Skip to Content

Digital Curation Students

A group for students of digital curation

Digital Curation Practitioners

A group for anyone practicing digital curation

Digital Curation Researchers

A group for anyone involved in digital curation research

Digital Curation Educators

A group for anyone involved in teaching digital curation

Internet Researcher and Offline Commander - commercial off-line browsing software for Windows

Type of resource: 

Internet Researcher and Offline Commander is commercial off-line browsing software for Windows.

GNU Wget

Type of resource: 

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

WERA (Web ARchive Access)

Type of resource: 

WERA (Web ARchive Access) is a freely available solution for searching and navigating archived web document collections. It works like the Internet Archive's Wayback Machine except it also allows for full-text search of the web archive.

The Web Curator Tool (WCT)

Type of resource: 

The Web Curator Tool (WCT) is an open-source workflow management application for selective web archiving. It is designed for use in libraries and other collecting organisations, and supports collection by non-technical users while still allowing complete control of the web harvesting process. It is integrated with the Heritrix web crawler and supports key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata.

MaximumSoft

Type of resource: 

MaximumSoft supports parsing and integrity checking for various formats, crawl scheduling and extraction of links from compressed Flash (.SWF) files.

Web Capture Tools

Type of resource: 

This page provides a list of Web Capture Tools.

The Stanford WebBase project

Focus: 
Type of resource: 

Warrick

Type of resource: 

Warrick is a free utility for reconstructing (or recovering) a website when a back-up is not available. Warrick will search the following web repositories for missing resources: Internet Archive, Google, Bing (formerly Live Search), and Yahoo.

Web Archiving Resources

Type of resource: 

This is a list of resources covering web archiving tools and practices.

Wayfinder

Focus: 
Type of resource: 

Wayback

Type of resource: 

Wayback is an open source java implementation of the The Internet Archive Wayback Machine.

WAXToolbar

Type of resource: 

WAXToolbar is a firefox extension to help users with common tasks encountered surfing a web archive. This extension depends on the open source wayback machine. Among the features of the WAX Toolbar is a search field for querying the wayback machine OR for searching a full-text NutchWAX index (if one is available). You can also use the toolbar to switch between proxy-mode and the regular Internet; when in proxy-mode you can easily go back and forth in time.

netpreserve.org

Type of resource: 

netpreserve.org is the website of the International Internet Preservation Consortium.

Tennyson Maxwell Information Systems

Type of resource: 

Tennyson Maxwell Information Systems offers a variety of features to support multithreaded retrieval, password-protected access, filtering, batch capture, and management of derived databases.

askSam

Focus: 
Type of resource: 

Sparkleware

Type of resource: 

Sparkleware is a commercial off-line browser.

Spadix software

Type of resource: 

Spadix Software can download websites from a starting URL, search engine results or web dirs, and is able to follow external links. It also supports filtering and crawling of password-protected sites.

pageVault

Type of resource: 

 

pageVault supports the archiving of all unique responses generated by a web server. It allows you to know exactly what information you have published on your web site, whether static pages or dynamically generated content, and regardless of format (HTML, XML, PDF, zip, Microsoft Office formats, images, sound), regardless of rate of change.

NutchWAX

Focus: 
Type of resource: 

NetarchiveSuite

Type of resource: 

The NetarchiveSuite is the complete web archiving software package developed within the netarchive.dk project from 2004 and onwards. The primary function of the NetarchiveSuite is to plan, schedule and run web harvests of parts of the Internet. It scales to a wide range of tasks, from small, thematic harvests (e.g. related to special events, or special domains) to harvesting and archiving the content of an entire national domain.

The Nalanda iVia Focused Crawler

Type of resource: 

The Nalanda iVia Focused Crawler (NIFC) is a focused Web crawler. It was created by Dr. Soumen Chakrabarti (Indian Institute of Technology Bombay) and developed with the support of IIT Bombay, the iVia Team and the U.S. Institute of Museum and Library Services.

mod_oai

Type of resource: 

The goal of the mod_oai project is to bring the efficiency of OAI-PMH to everyday web sites.

Metaproducts

Type of resource: 

Metaproducts offers several commercial capture and off-line browsing tools.

Evaluation of Open Source Spidering Technology

Focus: 
Type of resource: 
Type of resource: 

This is a paper from 2004.

Abstract

InfoMonitor: Unobtrusively archiving a World Wide Web server

Type of resource: 
Type of resource: 

This is a paper from 2005.

Abstract

It is important to provide long-term preservation of digital data even when that data is stored in an unreliable system, such as a fi lesystem, a legacy database, or even the World Wide Web. In this paper we focus on the problem of archiving the contents of a web site without disrupting users who maintain the site.

HTTrack

Type of resource: 

 

HTTrack is a free and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online.

Syndicate content


about seo