DisCopy


Monday, 27 June 2022

PostgreSQL VACUUM!

VACUUM

is the garbage-collector of PostgreSQL Database to discard obselete (updated) / deleted records of tables, materialized views and optionally analyze the statistics of a database/object.

If a table doesn’t get vacuumed, it will get bloated, which wastes disk space and also hampers the performance.

 

PostgreSQL's VACUUM command is a maintenance command needs to be run periodically to cater the following vital tasks:

1.      To recover or reuse disk space occupied by updated or deleted rows.

2.      To analyze/update data statistics used by the PostgreSQL query planner.

3.      To protect against loss of very old data due to transaction ID wraparound.

 

In normal PostgreSQL operation, tuples/records/rows that are deleted or obsoleted by an update are not physically removed from their table until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables as VACUUM reclaims storage occupied by dead tuples.

Normally we don’t need to take care of all that, because the autovacuum daemon does that with some limitations based on the configuration.

The frequency and scope of the VACUUM operations performed for each of the above reasons will vary depending on the needs of each database environment. Some tables with heavy updates and deletes need frequent vacuuming to discard obselete and deleted records. 

 

  • VACUUM processes every table and materialized view in the current database that the current user has permission to vacuum by default i.e. without a table_and_columns list specified, .With a list, VACUUM processes only those specified table(s).
  • VACUUM ANALYZE performs a VACUUM and then an ANALYZE for each selected table. This is a handy combination form for routine maintenance scripts. This option Updates statistics used by the planner to determine the most efficient way to execute a query thus the Optimizer choose right plan including join and indexes.

VACUUM (without FULLsimply reclaims space and makes it available for re-use. This command can operate in parallel with normal reading and writing of the table, as an exclusive lock is not obtained on the table. However, extra space is not returned to the operating system immediately. it's just kept available for re-use within the same table. It also allows us to leverage multiple CPUs in order to process indexes. This feature is known as parallel vacuum. We can use PARALLEL option and specify parallel workers as zero to disable the parallelism.

  • VACUUM FULL rewrites the entire contents of the table into a new disk file with no extra space, allowing unused space to be returned to the operating system. This form is much slower and requires an ACCESS EXCLUSIVE lock on each table while it is being processed. 

BTW, “full” vacuum, which can reclaim more space, takes much longer and exclusively locks impacts the performance and availability of the specific table. This method also requires extra disk space, since it writes a new copy of the table and doesn't release the old copy until the operation is complete. Usually this should only be used when a significant amount of space needs to be reclaimed from within the table like after a huge purge activity on the table.

 

  • VACUUM FREEZE selects aggressive “freezing” of tuples.

Specifying FREEZE with VACUUM is useful in transaction id wrapping and is equivalent to performing VACUUM with the vacuum_freeze_min_age and vacuum_freeze_table_age   parameters set to zero

 Aggressive freezing is always performed when the table is rewritten, so this option is redundant when FULL is specified.

 

Equivalent Commands in other RDBMS:

  • ·      REORG and UPDATE STATISTICS in Sybase
  • ·      REORG and GATHER STATS in Oracle

 

Sunday, 26 June 2022

PostgreSQL Index types and use cases!

Indexes are database objects that can be created for a table to speed access to specific data rows by validating the existence of the data and pointing to the specific pages. Indexes store the values of the key(s) that were named when the index was created, and logical pointers to the data pages.

Although indexes speed data retrieval, they also can slow down data modifications (Each index creates extra work every time you insert, delete, or update a row) since most DML changes to the data also require updating the indexes.

Optimal indexing demands:

·       The workload on the table i.e. READs vs DMLs

·       An understanding of the behavior of queries that access unindexed heap tables, and tables with indexes

·       An understanding of the mix of queries that run on your server

·       An understanding of the PostgreSQL Query optimizer

Using indexes speeds optimizer access to data and reduces the amount of information read and processed from base tables. Whenever possible, the optimizer attempts index-seek-only retrieval to satisfy a query. With index-only retrieval, the database server uses only the data in the indexes to satisfy the query, and does not need to access rows in the table. The optimizer automatically chooses to use the indexes it determines will lead to the best performance.

Index Types

PostgreSQL provides several types of index: B-tree, Hash, GiST, SP-GiST, GIN and BRIN.

1. B-Tree

2. Hash

3. GiST

4. SP-GiST

5. GIN

6. BRIN

Each index type uses a different algorithm that is best suited to different types of queries. By default, the CREATE INDEX command creates B-tree indexes, which fit the most common situations. The other index types are selected by the keyword USING followed by the index type name.

For example, to create an index:

CREATE INDEX name ON table USING [HASH|GIN](column);

 

Hash Indexes

Hash indexes store a 32-bit hash code derived from the value of the indexed column. Hence, such indexes can only handle simple equality comparisons. The query planner will consider using a hash index whenever an indexed column is involved in a comparison using the equal operator:

=

To create a hash index, you use the CREATE INDEX statement with the HASH index type in the USING clause as follows:

Syntax:

CREATE INDEX index_name ON table_name USING HASH (indexed_column);

GIN indexes

GIN stands for Generalized Inverted Indexes. It is commonly referred to as GIN.

GIN indexes are most useful when you have multiple values stored in a single column, for example, hstorearray, jsonb, and range types.

BRIN Indexes

BRIN stands for Block Range Indexes. BRIN is much smaller and less costly to maintain in comparison with a B-tree index.

BRIN allows the use of an index on a very large table that would previously be impractical using B-tree without horizontal partitioning. BRIN is often used on a column that has a linear sort order, for example, the created date column of the sales order table.

GiST Indexes

GiST stands for Generalized Search Tree. GiST indexes allow a building of general tree structures. GiST indexes are useful in indexing geometric data types and full-text search.

SP-GiST Indexes

SP-GiST stands for space-partitioned GiST. SP-GiST supports partitioned search trees that facilitate the development of a wide range of different non-balanced data structures. SP-GiST indexes are most useful for data that has a natural clustering element to it and is also not an equally balanced tree, for example, GIS, multimedia, phone routing, and IP routing.

 

Implicit Indexes

Implicit indexes are indexes that are automatically created by the database server when an object is created. Indexes are automatically created for primary key constraints and unique constraints.

 

Thursday, 23 June 2022

Interesting Behaviour of PostgreSQL character data types!

Every RDBMS supports multiple data types namely numeric, monetary, character, date(time), binary, Boolean and blobs etc. PostgreSQL supports special datatypes like Geometric, Network-Address, UUID, XML, JSON, Text Search types etc.

Out of these 70% of columns are typically character data types. Character data types are strings of ASCII characters. Upper and Lower case alphabetic characters are accepted literally. There are three kinds of character data types as cited:

1.       fixed-length, blank padded - char

2.       variable-length with limit - varchar                       and

3.       variable unlimited length  - text

The storage requirement for a character datatype for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1 byte. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.

BTW, In any case, the longest possible character string that can be stored is about 1 GB. (The maximum value that will be allowed for n in the data type declaration is less than that. It wouldn't be useful to change this because with multibyte character encodings the number of characters and bytes can be quite different. If you desire to store long strings with no specific upper limit, use text or character varying without a length specifier, rather than making up an arbitrary length limit.)

Most importantly interesting behaviour in PostgreSQL is how it handles these character data types.

There is no performance difference among these three data types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

 

Source: postgresql.org