[< Table of Contents](README.md) | [Building ArchiveSpark (advanced) >](Building.md)
:---|---:

# Recipes / Examples

To get common tasks done more quickly, we have prepared a few recipes that you can copy and customize for your needs.
In most cases, all you need to do is to change the paths to locate your data or replace the Data Specification to be used.
More about the provided Data Specifications can be found here: [DataSpecs](DataSpecs.md).

* [Building a corpus with title + text for a selected set of URLs](../notebooks/Selected_Title-and-Text.ipynb)
* [Analyzing term / entity distributions in a dataset](../notebooks/Analyzing_Term-Distributions.ipynb)
* [Extracting hyperlinks from webpages](../notebooks/Link_Extraction.ipynb)
* [Extracting embedded resources from webpages](../notebooks/Extracting_Embeds.ipynb)
* [Loading WARC / Generating CDX (enable more efficient processing)](../notebooks/Generating_CDX.ipynb)
* [Downloading a web archive dataset as WARC/CDX from the Wayback Machine](../notebooks/Downloading_WARC_from_Wayback.ipynb)

These recipes are supposed to serve as templates for your tasks. In order to tailor them for your needs, feel free to combine elements from different recipes.

More application-specific examples can be found in the related projects, such as:

* Create semantic Web triples from ArchiveSpark records with [ArchiveSpark2Triples](https://github.com/helgeho/ArchiveSpark2Triples).
* Analyze medical journals at the Medical Heritage Library (MHL) with [MHLonArchiveSpark](https://github.com/helgeho/MHLonArchiveSpark).
* Start analyzing the temporal Web starting from keywords issued to [Tempas](http://tempas.L3S.de/v2) (Temporal Archive Search) with [Tempas2ArchiveSpark](https://github.com/helgeho/Tempas2ArchiveSpark).

## Interoperability

We have shown that recipes can be reused among different kinds of archival datasets as well as data sources, e.g., web archives and digital journals.
For more information please read (and **cite**):

[H. Holzmann, Emily Novak Gustainis and Vinay Goel. *Universal Distant Reading through Metadata Proxies with ArchiveSpark*. 5th IEEE International Conference on Big Data (BigData). Boston, MA, USA. December 2017.](http://cci.drexel.edu/bigdata/bigdata2017/AcceptedPapers.html) [**Get full-text PDF**](http://www.helgeholzmann.de/papers/BIGDATA_2017.pdf) 

[< Table of Contents](README.md) | [Building ArchiveSpark (advanced) >](Building.md)
:---|---: