rssLink RSS dla wszystkich kategorii
 
icon_orange
icon_red
icon_red
icon_blue
icon_blue
icon_blue
icon_green
icon_red
icon_blue
icon_blue
icon_blue
icon_blue
icon_blue
icon_blue
icon_blue
icon_blue
icon_red
icon_blue
icon_orange
icon_red
icon_blue
icon_blue
icon_blue
icon_blue
icon_green
icon_blue
icon_blue
 

FS#27141 — Storage Strasburg

Przydzielony do projektu— VPS
Nagła usterka
Storage
ZAMKNIĘTE
100%
Na skutek problemu opisanego na stronie http://prace.ovh.pl/?do=details&id=27123, ponownie 2 PG są nieaktywne. Pracujemy nad rozwiązaniem problemu.

Data:  poniedziałek, 05 grudzień 2016, 08:24
Powód zamknięcia:  Done
Komentarz od OVH - czwartek, 01 grudzień 2016, 13:12

Wszystkie PG są aktywne.


Komentarz od OVH - piątek, 02 grudzień 2016, 09:30

Dane są teraz umieszczone prawidłowo.


Komentarz od OVH - poniedziałek, 05 grudzień 2016, 08:24

02.12.2016, 18:34PM
Here is was had been done during the last few days, the root causes
and how we will improve the situation.
Root causes :
Hosts hanging under load on CPU 100% on sys due to some kernel bug
ceph monitors overloaded due to sharing CPU with OSDs and running on
slow disk - caused slow peering of PGs and often monitor elections.
ceph osd hammer version bug which caused osd to fail to start after
stopping (4 osds impacted) - http://tracker.ceph.com/issues/17916
ceph osd crashed due to failure to remove unfound object (1 osd
impacted) - http://tracker.ceph.com/issues/10405
ceph PG recovery priority issue which preferred to handle recover_wait
PGs (5k+) first to PGs which were inactive (2)
Actions :
(CET time)
2016-11-29 ~22:00 planned intervention to grow cluster
2016-11-29 23:00~04:00 2-3 hosts have had huge load and cpu was doing
100% on sys (cause unknown). After restart all continued
2016-11-30 ~04:40: A host hangs and we restart it
2016-11-30 ~5:00: An osd stops and fails to peer with other OSDs
causing blocked requests on it's PGs
2016-11-30 ~8:00: restarted all OSDs rack by rack, 2 OSDs failed to
start
2016-11-30 ~9:00: We stopped the same OSD which allowed solved PG
blocking
2016-11-30 10:48: all PGs are active, recovery/rebalance ongoing
2016-11-30 15:00: rebalance almost stopped, investigating
2016-11-30 ~22:00: we force PG to peer by setting down it's OSD.
2016-11-30 ~23:30: rebalance is progressing in a nice way
2016-12-01 01:16 1/2 inactive pg left (count changes every few
minutes)
2016-12-01 ~4:00: host crash with another osd not being able to start
2016-12-01 ~12:00 all PGs are active again, no more stuck in peered
state
2016-12-01 15:30: one monitor still down, it's not able to sync with
two others (syncing puts too big load on those two)
2016-12-01 ~18-20: monitors have elections every few seconds because
of a monitor that have a disk overloaded
2016-12-01 ~22:00: added a monitor with increased mon_sync_timeout to
3000 which allowed it to sync and join quorum (~5 minutes quorum
downtime in last sync phase). Mons load is now ok and they do not do
elections. Recovery continues.
2016-12-01 ~23:30: deleted 2 osd containers which failed to restart -
after all OSDs are up and in.
2016-12-02 01:18: Cluster stopped rebalancing 9K object missplaced due
to 2 unfound objects stalling recovery of 2 PGs
2016-12-02 11:18: Both PG fixed, cluster rebalances.
2016-12-02 13:30: cluster fully rebalanced, all OK
Quick actions taken to solve the issue
created a monitor on empty host with optimized XFS to have better perf
- after joining quorum monitor load decreased significantly
added 2 temp OSDs with PGs from crashed OSD to recover data
added 2 temp OSDs with same IDs as crashed OSDs to unclock recovery
manually increased osd_max_backfills on OSDs with inactive+peered PGs
to make recovery faster
wrote script to sequentially ceph osd down OSDs with undersized PGs to
unblock recovery on them
increased osd_recovery_max_active to 60 to speed up recovery of
recovery_wait PGs
increased osd_max_backfills to 2 (temporary) to speed up backfilling
Improvements :
Provide more resources to monitors
Run monitors on an optimized storage
Working on host hanging
Upgrade cluster to Jewel