Skip to content

lustre-collector: Type mismatches with lustre structs #56

@utopiabound

Description

@utopiabound

This is a migration of whamcloud/lustre-collector#65

Lustre prints s64 in places that lustre-collector parses as u64 thus throwing an error if a negative value is returned.

[root@node1 ~]# lctl get_param \*.*.ldlm_canceld.stats
ldlm.services.ldlm_canceld.stats=
snapshot_time             1690228679.404172099 secs.nsecs
start_time                1690208535.652492047 secs.nsecs
elapsed_time              20143.751680052 secs.nsecs
req_waittime              96 samples [usecs] -20 43536 54561 1897709649
req_qdepth                96 samples [reqs] 0 0 0 0
req_active                96 samples [reqs] 1 2 103 117
req_timeout               96 samples [secs] 15 15 1440 21600
reqbuf_avail              199 samples [bufs] 63 64 12688 809008
ldlm_cancel               96 samples [usecs] 5 235 3891 285769
Jul 24 19:57:39 node1 emf-stats-agent[667612]:  INFO emf_stats_agent: Stats collection is enabled
Jul 24 19:57:39 node1 emf-stats-agent[667612]: Error: LustreCollectorError(CombineEasyError(Errors { position: 11397, errors: [Unexpected(Token('-')), Expected(Static("whitespace")), Expected(Static("digit")), Message(Static("While parsing ldlm_canceld.stats"))] }))
Jul 24 19:57:39 node1 systemd[1]: emf-stats-agent.service: Main process exited, code=exited, status=1/FAILURE

This has been seen on a live system:

ldlm.services.ldlm_canceld.stats=
snapshot_time             1714662722.986857642 secs.nsecs
req_waittime              101358239600 samples [usecs] -36 1855805 5530720965329 21563670935443407
req_qdepth                101358239600 samples [reqs] 0 1164 1893537183 7828947261
req_active                101358239600 samples [reqs] 1 23 152657095033 281200017837
req_timeout               101358239600 samples [secs] 1 218 6892805470378 468801155450006
reqbuf_avail              210996467581 samples [bufs] 0 155 13398749465845 850967459690601
ldlm_cancel               101358239600 samples [usecs] 1 211241571 1436530054018 108103493490971286

Related Lustre Ticket: LU-9683
Related Lustre Ticket: LU-17853

Underlying issue is probably the use of ktime_get_real() in ptlrpc which is subject to negative movement due to leap seconds and NTP updates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions