Skip to content

galaxyproject/usage-metering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

117 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scripts and queries for inspecting Galaxy usage data.

Environment and data setup

Before getting started, we'll get a hold of a subset of Galaxy data so it is easier and faster to work with. This setup needs to be done once for a given set of data. Once the data is available, we can start querying it. For Maine, these queries tend to run abut half an hour each.

Take a look at the collect_job_data.sh script to automate this process.

Extract job data from a Galaxy database

Put the following Job table query in a file called job_query.sql, adjusting desired dates:

COPY (
SELECT
  id, create_time, tool_id, tool_version, user_id, destination_id
FROM
  job
WHERE
  create_time >= '2021-08-01'
  AND create_time < '2022-09-01') TO STDOUT;

Run the query from the command shell (where galaxy_main is the name of database):

psql galaxy_main < job_query.sql > job_table.txt

Next, put the following Metrics table query in a file called metrics_query.sql, matching the dates to the ones used in the job query above:

COPY (
SELECT
    job.id,
    job.destination_id,
    job_metric_numeric.metric_name,
    job_metric_numeric.metric_value
FROM
    job
INNER JOIN job_metric_numeric
    ON job_metric_numeric.job_id = job.id
WHERE
    job.create_time >= '2021-08-01' AND
    job.create_time < '2022-09-01' AND
    (job_metric_numeric.metric_name = 'galaxy_slots' OR
     job_metric_numeric.metric_name = 'memory.max_usage_in_bytes' OR
     job_metric_numeric.metric_name = 'galaxy_memory_mb' OR
     job_metric_numeric.metric_name = 'galaxy_slots' OR
     job_metric_numeric.metric_name = 'cpuacct.usage' OR
     job_metric_numeric.metric_name = 'runtime_seconds')) TO STDOUT;

Run the query from the command shell (where galaxy_main is the name of database):

psql galaxy_main < metrics_query.sql > metrics_table.txt

Importing data into a local database

Once we have the data, we can import it into a local database for easier querying. To get started, let's start a Postgres server using a Docker container, mapping a volume to a host path (replace <pwd> with an absolute path):

docker run -d --rm --name main-data-postgres -e POSTGRES_PASSWORD=changeThis -v <pwd>/data/db-files/:/var/lib/postgresql/data -e PGDATA=/var/lib/postgresql/data/db-files/pg --mount type=tmpfs,destination=/var/lib/postgresql/data/pg_stat_tmp -p 5432:5432 postgres

Place the data files (job_table.txt and metrics_table.txt) into ./data/db-files/. File sql/db-files/tables.sql contains the structure for the tables for our local database. We will make the data from the files available in our PostgreSQL container. Note that there are a couple of sample files in the data/db-files/ folder that can be used for testing and developing queries.

Exec into the container and change into the data dir:

docker exec -it main-data-postgres bash
cd /var/lib/postgresql/data/

Next, we'll create the database and necessary tables:

createdb -U postgres -O postgres galaxy
psql -U postgres galaxy < tables.sql

Lastly, we import the data (this process will take about 10 minutes):

psql -U postgres -c "\copy job from 'job_table.txt';" galaxy
psql -U postgres -c "\copy job_metric_numeric from 'metrics_table.txt';" galaxy

Once the data is imported, refer to the queries in the sql folder. Access the database with:

psql -U postgres galaxy

Container management

To stop the container, issue

docker stop main-data-postgres

To start the container again, reusing already existing database, run the same docker run command from above. To start with a clean database, remove ./data/db-files/pg* directories and start a new container.

About

Scripts and queries for inspecting Galaxy usage data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6

Languages