In HBase, the "memory to disk" flush operation

In HBase, the "memory to disk" flush operation (or "memstore flush") happens when data in the memstore (in-memory storage) is flushed to disk as HFiles in HDFS. This flush can occur based on multiple triggers rather than a strict time interval, including:

  1. Memstore Size Limit: The primary trigger is when the size of the memstore reaches a defined threshold. The default maximum memstore size per RegionServer is set by the configuration parameter hbase.regionserver.global.memstore.size, typically as a fraction of the total heap size. When the memstore fills up, it triggers an automatic flush to disk.

  2. Time-Based Flush (Optional): Although time-based flushes are not the main flush trigger, HBase provides an optional parameter for a maximum delay, which can ensure that data gets written to disk within a certain time, even if the memstore size limit hasn't been reached. This parameter is:

    • hbase.hregion.memstore.flush.period: This sets the maximum time (in milliseconds) that data can stay in memory before it must be flushed to disk. By default, this is set to 0 (disabled), meaning there is no strict time limit, and only the memstore size triggers the flush. If enabled, it will enforce a flush after the specified time, regardless of the memstore size.

    For example:

    hbase.hregion.memstore.flush.period=3600000  # Flush after 1 hour
    
  3. Manual Flush: You can also manually flush data by calling flush() on a table or RegionServer using the HBase shell or API. This is useful in certain situations for performance tuning or maintenance.

Best Practice

Leaving hbase.hregion.memstore.flush.period disabled (the default) generally provides optimal performance, as HBase will flush based on memory thresholds instead of time, which is often more efficient. However, in scenarios where you want to ensure data persistence on disk more frequently (such as during a heavy load with risk of data loss in case of crashes), you can configure this parameter to enforce a periodic flush.

댓글

이 블로그의 인기 게시물

Using the MinIO API via curl

How to checkout branch of remote git, 깃 리모트 브랜치 체크아웃

PySpark Dataframe from HBase

how to delete all issues on project in sentry, 센트리 이슈 삭제하기

Create topic on Kafka with partition count, 카프카 토픽 생성하기

The logs of the kubelet service

To switch to a specific tag in a Git repository

max_active_runs of Airflow

Auto-populate a calendar in an MUI (Material-UI) TextField component