bg-infra
Centre de Calcul de l'Institut National de Physique Nucléaire et de Physique des Particules

Comment exploiter 1 milliard d'observables par jour

root@ Sysadmin_Days :/ 8 #

Fabien Wernli

@faxm0dem

Large Hadron Collider DAQ

collecte

filtrage

traitement

Nom de Zeus !

1 milliard de collisions par seconde = 1 Po/s

Heureusement il y a les triggers1 Go/s

WLCG
CCIN2P3
Et ce qui nous intéresse ici, c'est les données provenant du data-center lui-même :

les métriques, les logs

Logs

kernel: Killed process 29959, UID 42046, (hadd) total-vm:202363492kB, anon-rss:13069860kB, file-rss:60kB
        ata2.00: exception Emask 0x0 SAct 0xffff SErr 0x0 action 0x0
        puppet-agent[16528]: Finished catalog run in 44.06 seconds

Metrics

hostmetricstatevalue
ccsvli64.in2p3.frcpu/cpu-idleok99.53
ccosvms0034.in2p3.frinterface-eth0/if_octets/rxok172.1
ccwntest14.in2p3.frdf-var/percent_bytes-freecritical0.004
samplerr: config
  • samplerr: Elasticsearch timeseries RRDTool-style
    
    (let
      [cfunc    [{:func samplerr/average :name "avg"}]
       archives [{:tf "YYYY.MM.dd" :step (t/seconds 30) :ttl   (t/days 1) :cfunc cfunc}
                 {:tf "YYYY.MM"    :step (t/minutes 40) :ttl (t/months 1) :cfunc cfunc}
                 {:tf "YYYY"       :step (t/hours 8)    :ttl  (t/years 5) :cfunc cfunc}]
       persist  (batch 10000 20 (samplerr/persist {:index-prefix ".samplerr-" :index-type "samplerr" :conn elastic}))]
      (by [:host :service]
        (samplerr/down archives persist)))
                          
    grafana
Searchguard : features
  • HTTPS
  • Node-Node SSL
  • Audit logging
  • REST management API
  • Kibana multi-tenancy
    • Access control levels
    • Index
    • Document
    • Field
demo

...
Leçon n⁰2 : Monitorer le monitoring!
The Large Hadron Collider