Target release1.3.1
Source (e.g. Github)https://github.com/ckan/ckanext-harvest
Main featuresThis extension provides a common harvesting framework for ckan extensions and adds a CLI and a WUI to CKAN to manage harvesting sources and jobs.
Prerequisite / Dependencies

CKAN >V2.0 Redis or TabbitMQ

LicenseGNU Affero General Public License (AGPL) v3.0
Installed by
Document status

COMPETED

Background and strategic fit

As we need to also include datasets from other catalogues, we required to use the harvesting function. This extension provides harvesting of catalogues of types: CKAN , OGC CSW, etc.

Installations and Requirements

This extension requires CKAN v2.0 or later on both the CKAN it is installed into and the CKANs it harvests.

  1. The harvest extension can use two different backends. You can choose whichever you prefer depending on your needs, but Redis has been found to be more stable and reliable so it is the recommended one:

    • Redis(recommended): To install it, run:

      sudo apt-get update
      sudo apt-get install redis-server

      On your CKAN configuration file, add in the [app:main] section:

      ckan.harvest.mq.type = redis
    • RabbitMQ: To install it, run:

      sudo apt-get update
      sudo apt-get install rabbitmq-server

      On your CKAN configuration file, add in the [app:main] section:

      ckan.harvest.mq.type = amqp
  2. Activate your CKAN virtual environment, for example:

    ENV /usr/lib/ckan/default/bin/activate
  3. Install the ckanext-scheming Python package into your virtual environment:

    pip install -e git+https://github.com/ckan/ckanext-harvest.git#egg=ckanext-harvest
  4. Install the python modules required by the extension (adjusting the path according to where ckanext-harvest was installed in the previous step):

    (pyenv) $ cd /usr/lib/ckan/default/src/ckanext-harvest/
    (pyenv) $ pip install -r pip-requirements.txt
  5. Add the relevant plugins to the ckan.plugins setting in your CKAN config file.

    ckan.plugins = harvest ckan_harvester
  6. If you haven't done it yet on the previous step, define the backend that you are using with theckan.harvest.mq.typeoption in the [app:main] section (it defaults toamqp):

    ckan.harvest.mq.type = redis

    There are a number of configuration options available for the backends. These don't need to be modified at all if you are using the default Redis or RabbitMQ install (step 1). However you may wish to add them with custom options to the into the CKAN config file the [app:main] section. The list below shows the available options and their default values:

    • Redis:

      ckan.harvest.mq.hostname (localhost)
      ckan.harvest.mq.port (6379)
      ckan.harvest.mq.redis_db (0)
      ckan.harvest.mq.password (None)
    • RabbitMQ:

      ckan.harvest.mq.user_id (guest)
      ckan.harvest.mq.password (guest)
      ckan.harvest.mq.hostname (localhost)
      ckan.harvest.mq.port (5672)
      ckan.harvest.mq.virtual_host (/)

    Note: it is safe to use the same backend server (either Redis or RabbitMQ) for different CKAN instances, as long as they have different site ids. The ckan.site_id config option (or default) will be used to namespace the relevant things:

    • On RabbitMQ it will be used to name the queues used, eg ckan.harvest.site1.gather and ckan.harvest.site1.fetch.
    • On Redis, it will namespace the keys used, so only the relevant instance gets them, eg site1:harvest_job_id, site1:harvest_object__id:804f114a-8f68-4e7c-b124-3eb00f66202f
  7. Restart CKAN. For example, if you've deployed CKAN with Apache on Ubuntu:

    sudo service apache2 reload

User interaction and design

For further configuration and running automatic jobs refer to the official page:
https://github.com/ckan/ckanext-harvest#configuration

Questions and Answers

Below is a list of Q&A from user sides:

QuestionAnswers

Further steps