Background and strategic fit
As we need to also include datasets from other catalogues, we required to use the harvesting function. This extension provides harvesting of catalogues of types: CKAN , OGC CSW, etc.
Installations and Requirements
This extension requires CKAN v2.0 or later on both the CKAN it is installed into and the CKANs it harvests.
The harvest extension can use two different backends. You can choose whichever you prefer depending on your needs, but Redis has been found to be more stable and reliable so it is the recommended one:
Redis(recommended): To install it, run:
sudo apt-get update sudo apt-get install redis-server
On your CKAN configuration file, add in the [app:main] section:
ckan.harvest.mq.type = redis
RabbitMQ: To install it, run:
sudo apt-get update sudo apt-get install rabbitmq-server
On your CKAN configuration file, add in the [app:main] section:
ckan.harvest.mq.type = amqp
Activate your CKAN virtual environment, for example:
ENV /usr/lib/ckan/default/bin/activate
Install the ckanext-scheming Python package into your virtual environment:
pip install -e git+https://github.com/ckan/ckanext-harvest.git#egg=ckanext-harvest
Install the python modules required by the extension (adjusting the path according to where ckanext-harvest was installed in the previous step):
(pyenv) $ cd /usr/lib/ckan/default/src/ckanext-harvest/ (pyenv) $ pip install -r pip-requirements.txt
Add the relevant plugins to the
ckan.plugins
setting in your CKAN config file.ckan.plugins = harvest ckan_harvester
If you haven't done it yet on the previous step, define the backend that you are using with the
ckan.harvest.mq.type
option in the [app:main] section (it defaults toamqp
):ckan.harvest.mq.type = redis
There are a number of configuration options available for the backends. These don't need to be modified at all if you are using the default Redis or RabbitMQ install (step 1). However you may wish to add them with custom options to the into the CKAN config file the [app:main] section. The list below shows the available options and their default values:
Redis:
ckan.harvest.mq.hostname (localhost) ckan.harvest.mq.port (6379) ckan.harvest.mq.redis_db (0) ckan.harvest.mq.password (None)
RabbitMQ:
ckan.harvest.mq.user_id (guest) ckan.harvest.mq.password (guest) ckan.harvest.mq.hostname (localhost) ckan.harvest.mq.port (5672) ckan.harvest.mq.virtual_host (/)
Note: it is safe to use the same backend server (either Redis or RabbitMQ) for different CKAN instances, as long as they have different site ids. The
ckan.site_id
config option (ordefault
) will be used to namespace the relevant things:- On RabbitMQ it will be used to name the queues used, eg
ckan.harvest.site1.gather
andckan.harvest.site1.fetch
. - On Redis, it will namespace the keys used, so only the relevant instance gets them, eg
site1:harvest_job_id
,site1:harvest_object__id:804f114a-8f68-4e7c-b124-3eb00f66202f
Restart CKAN. For example, if you've deployed CKAN with Apache on Ubuntu:
sudo service apache2 reload
User interaction and design
For further configuration and running automatic jobs refer to the official page:
https://github.com/ckan/ckanext-harvest#configuration
Questions and Answers
Below is a list of Q&A from user sides:
Question | Answers |
---|---|
Further steps
- https://docs.ckan.org/projects/ckanext-spatial/en/latest/harvesters.html
- CSW configuration for GDI (example: https://doc-ckan.readthedocs.io/en/c029/admin/harvesting.html)
- CQL (example: https://github.com/ckan/ckanext-spatial/issues/55)
- writing harvester: (explanation: https://github.com/ckan/ckanext-harvest/issues/292)