Refactor #30820: Drop SHA1 digest and use hash index instead for reports - Foreman

Refactor #30820

Updated by Lukas Zapletal almost 5 years ago

Foreman reports storage is designed so that reports are broken into individual lines and store them as 1:N references. The biggest offender is the messages and sources tables which contain those lines together with SHA1 hash - both are exactly the same, because puppet returls lines in format: "[source] log_content". So what we essentially do is break reports into lines, break lines into two pieces, calculate SHA1 sum from both parts and store that into the database. 

 The reason for all of that is that users can search for a source or for a (whole) report line using scoped search. This is only useful for Puppet users (source), I am unable to find a use case for searching for whole line. Full text search would make sense, or doing SQL LIKE query as well, but non of that Foreman supports. 

 The way this is implemented is to overcome SQL server limit of index because "messages"."value" and "sources"."value" fields are TEXT type. We caculate SHA1 hash and store the hash in a separate field which is then indexed. This is extremely slow, it's wasting of resources, there is no resolution of SHA1 conflicts. 

 This patch gets rid of our own removes ability to search for source and line completely while dropping SHA1 "digest" column fields for both tables and let SQL server do the index directly indices on the text column. Postgres has a hash index this field for that, it is faster than b-tree and efficient both tables effectively removing huge amount of data for comparison only, which is exactly what we need (we do not sort logs most deployments at the cost of either completely dropping the feature, or sources tables - alternatively keeping it makes no sense).  

 The patch  

 * drop digest completely from the code base and let using SQL server tablescan to do hashing via index 
 * drop digest field and index from both tables (2 indices, 2 fields) 
 * create index on value column (migration can get the same results. A combination could be slow used too - testing needed) 

 The only issue could be sqlite3 which we still require remove searching for packaging, from my rough testing I am able exact line but keep possibility to create index on text column just fine. Another struggle could be Ruby on Rails support do tablescan search for psql hash index, SQL command might be needed in a migration to create it. resources. 

 For the record, hash index is supported in postgres from version 8.2 and from version 10 it is transaction safe. Meaning that a crash on pre-10 version would require index rebuild which is a trivial operation. CentOS installations already use Postgres v12 and Debian (Buster) has v11. 

 Also, complete refactoring of reports is planned where searching in reports will be only possible via tablescan anyway: 

 https://community.theforeman.org/t/rfc-optimized-reports-storage/15573

Back

Project

General

Profile

Refactor #30820