Cassandra 文件

版本

您正在檢視預發行版本的說明文件。

檢視最新版本

Cassandra 記錄

Cassandra 豐富地支援記錄,並嘗試讓操作員最大程度地深入瞭解資料庫,同時將記錄中的雜訊限制在最低程度。

常見記錄檔案

Cassandra 有三個主要記錄,分別是 system.logdebug.loggc.log,分別儲存一般記錄訊息、除錯記錄訊息和 Java 垃圾回收記錄。

這些記錄預設位於 ${CASSANDRA_HOME}/logs,但大多數 Linux 發行版會將記錄移至 /var/log/cassandra。操作員可以使用提供的 logback.xml 檔案調整此位置以及記錄的層級。

system.log

此記錄是預設的 Cassandra 記錄,是開始任何調查的好地方。記錄在此記錄中的一些活動範例

  • 未捕捉的例外。這些對於除錯錯誤非常有用。

  • GCInspector 訊息表示垃圾回收器暫停時間過長。當暫停時間過長時,Cassandra 會列印暫停時間以及暫停當時系統 (執行緒狀態) 的狀態。這有助於縮小容量問題 (堆積不足或備用 CPU 不足)。

  • 關於節點加入和離開叢集的資訊,以及令牌中繼資料 (資料擁有權) 變更。這對於除錯網路分割、資料移動等很有用。

  • 鍵集/表格建立、修改、刪除。

  • StartupChecks 可確保作業系統最佳設定,以執行 Cassandra

  • 關於某些背景作業任務(例如索引重新分配)的資訊。

與任何應用程式一樣,尋找 ERRORWARN 行可能是第一步

$ # Search for warnings or errors in the latest system.log
$ grep 'WARN\|ERROR' system.log | tail
...

$ # Search for warnings or errors in all rotated system.log
$ zgrep 'WARN\|ERROR' system.log.* | less
...

debug.log

此記錄包含其他除錯資訊,在進行疑難排解時可能很有用,但可能比一般的 system.log 還要吵雜。記錄到此記錄的活動範例

  • 關於壓縮的資訊,包括壓縮開始時間、包含哪些 sstable,以及完成時間。

  • 關於記憶表沖刷到磁碟的資訊,包括發生時間、沖刷大小,以及沖刷影響哪些提交記錄區段。

此記錄可能非常吵雜,因此強烈建議使用 grep 和其他記錄分析工具深入探討。例如

# Search for messages involving a CompactionTask with 5 lines of context
$ grep CompactionTask debug.log -C 5

# Look at the distribution of flush tasks per keyspace
$ grep "Enqueuing flush" debug.log | cut -f 10 -d ' ' | sort | uniq -c
    6 compaction_history:
    1 test_keyspace:
    2 local:
    17 size_estimates:
    17 sstable_activity:

gc.log

GC 記錄是標準的 Java GC 記錄。使用預設的 jvm.options 設定,您可以在此記錄中取得許多有價值的資訊,例如應用程式暫停時間,以及暫停發生的原因。這可能有助於將吞吐量或延遲問題縮小到 JVM 設定錯誤。例如,您可以檢視最後幾次暫停

$ grep stopped gc.log.0.current | tail
2018-08-29T00:19:39.522+0000: 3022663.591: Total time for which application threads were stopped: 0.0332813 seconds, Stopping threads took: 0.0008189 seconds
2018-08-29T00:19:44.369+0000: 3022668.438: Total time for which application threads were stopped: 0.0312507 seconds, Stopping threads took: 0.0007025 seconds
2018-08-29T00:19:49.796+0000: 3022673.865: Total time for which application threads were stopped: 0.0307071 seconds, Stopping threads took: 0.0006662 seconds
2018-08-29T00:19:55.452+0000: 3022679.521: Total time for which application threads were stopped: 0.0309578 seconds, Stopping threads took: 0.0006832 seconds
2018-08-29T00:20:00.127+0000: 3022684.197: Total time for which application threads were stopped: 0.0310082 seconds, Stopping threads took: 0.0007090 seconds
2018-08-29T00:20:06.583+0000: 3022690.653: Total time for which application threads were stopped: 0.0317346 seconds, Stopping threads took: 0.0007106 seconds
2018-08-29T00:20:10.079+0000: 3022694.148: Total time for which application threads were stopped: 0.0299036 seconds, Stopping threads took: 0.0006889 seconds
2018-08-29T00:20:15.739+0000: 3022699.809: Total time for which application threads were stopped: 0.0078283 seconds, Stopping threads took: 0.0006012 seconds
2018-08-29T00:20:15.770+0000: 3022699.839: Total time for which application threads were stopped: 0.0301285 seconds, Stopping threads took: 0.0003789 seconds
2018-08-29T00:20:15.798+0000: 3022699.867: Total time for which application threads were stopped: 0.0279407 seconds, Stopping threads took: 0.0003627 seconds

這顯示許多有價值的資訊,包括應用程式暫停多久(表示在例如 33 毫秒的 JVM 暫停期間,沒有提供任何使用者查詢服務),以及進入安全點花了多久。您可以使用這些原始資料來取得最長的暫停時間

$ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n  | tail | xargs -IX grep X gc.log.0.current | sort -k 1
2018-08-28T17:13:40.520-0700: 1.193: Total time for which application threads were stopped: 0.0157914 seconds, Stopping threads took: 0.0000355 seconds
2018-08-28T17:13:41.206-0700: 1.879: Total time for which application threads were stopped: 0.0249811 seconds, Stopping threads took: 0.0000318 seconds
2018-08-28T17:13:41.638-0700: 2.311: Total time for which application threads were stopped: 0.0561130 seconds, Stopping threads took: 0.0000328 seconds
2018-08-28T17:13:41.677-0700: 2.350: Total time for which application threads were stopped: 0.0362129 seconds, Stopping threads took: 0.0000597 seconds
2018-08-28T17:13:41.781-0700: 2.454: Total time for which application threads were stopped: 0.0442846 seconds, Stopping threads took: 0.0000238 seconds
2018-08-28T17:13:41.976-0700: 2.649: Total time for which application threads were stopped: 0.0377115 seconds, Stopping threads took: 0.0000250 seconds
2018-08-28T17:13:42.172-0700: 2.845: Total time for which application threads were stopped: 0.0475415 seconds, Stopping threads took: 0.0001018 seconds
2018-08-28T17:13:42.825-0700: 3.498: Total time for which application threads were stopped: 0.0379155 seconds, Stopping threads took: 0.0000571 seconds
2018-08-28T17:13:43.574-0700: 4.247: Total time for which application threads were stopped: 0.0323812 seconds, Stopping threads took: 0.0000574 seconds
2018-08-28T17:13:44.602-0700: 5.275: Total time for which application threads were stopped: 0.0238975 seconds, Stopping threads took: 0.0000788 seconds

在此情況下,任何等待查詢的客戶端,在 17:13:41 時會遇到 56 毫秒的延遲。

請注意,GC 暫停不只垃圾收集,儘管一般來說,安全點快速的長時間暫停表示缺乏 JVM 堆積或 JVM GC 演算法設定錯誤。安全點緩慢的高暫停時間通常表示 JVM 在進入安全點時遇到問題,這通常表示磁碟機速度慢(Cassandra 大量使用記憶體對應讀取,而 JVM 不知道可能會遇到磁碟延遲,因此 JVM 安全點邏輯無法特別妥善處理阻擋記憶體對應讀取)。

使用這些記錄,您甚至可以使用類似 histogram.py 的工具取得暫停分佈

$ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | histogram.py
# NumSamples = 410293; Min = 0.00; Max = 11.49
# Mean = 0.035346; Variance = 0.002216; SD = 0.047078; Median 0.036498
# each ∎ represents a count of 5470
    0.0001 -     1.1496 [410255]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    1.1496 -     2.2991 [    15]:
    2.2991 -     3.4486 [     5]:
    3.4486 -     4.5981 [     1]:
    4.5981 -     5.7475 [     5]:
    5.7475 -     6.8970 [     9]:
    6.8970 -     8.0465 [     1]:
    8.0465 -     9.1960 [     0]:
    9.1960 -    10.3455 [     0]:
   10.3455 -    11.4949 [     2]:

我們可以在此情況中看到,儘管我們的平均效能非常好,但有些東西導致 JVM 暫停數秒 …​ 在此情況下,主要是由磁碟速度慢導致的安全點暫停

$ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | tail | xargs -IX grep X  gc.log.0.current| sort -k 1
2018-07-27T04:52:27.413+0000: 187831.482: Total time for which application threads were stopped: 6.5037022 seconds, Stopping threads took: 0.0005212 seconds
2018-07-30T23:38:18.354+0000: 514582.423: Total time for which application threads were stopped: 6.3262938 seconds, Stopping threads took: 0.0004882 seconds
2018-08-01T02:37:48.380+0000: 611752.450: Total time for which application threads were stopped: 10.3879659 seconds, Stopping threads took: 0.0004475 seconds
2018-08-06T22:04:14.990+0000: 1113739.059: Total time for which application threads were stopped: 6.0917409 seconds, Stopping threads took: 0.0005553 seconds
2018-08-14T00:04:06.091+0000: 1725730.160: Total time for which application threads were stopped: 6.0141054 seconds, Stopping threads took: 0.0004976 seconds
2018-08-17T06:23:06.755+0000: 2007670.824: Total time for which application threads were stopped: 6.0133694 seconds, Stopping threads took: 0.0006011 seconds
2018-08-23T06:35:46.068+0000: 2526830.137: Total time for which application threads were stopped: 6.4767751 seconds, Stopping threads took: 6.4426849 seconds
2018-08-23T06:36:29.018+0000: 2526873.087: Total time for which application threads were stopped: 11.4949489 seconds, Stopping threads took: 11.4638297 seconds
2018-08-23T06:37:12.671+0000: 2526916.741: Total time for which application threads were stopped: 6.3867003 seconds, Stopping threads took: 6.3507166 seconds
2018-08-23T06:37:47.156+0000: 2526951.225: Total time for which application threads were stopped: 7.9528200 seconds, Stopping threads took: 7.9197756 seconds

有時閱讀和瞭解 Java GC 記錄很困難,但您可以取得原始 GC 檔案,並使用 GCViewer 等工具將其視覺化,這些工具會將 Cassandra GC 記錄作為輸入,並向您顯示垃圾收集效能的詳細視覺化資訊。這包括暫停分析和吞吐量資訊。對於穩定的 Cassandra JVM,您可能希望暫停時間小於 200 毫秒,GC 吞吐量大於 99%。

Java GC 暫停是 Cassandra 中尾端延遲的主要原因之一(與磁碟延遲一起),因此有時在除錯尾端延遲問題時,這些資訊可能至關重要。

取得更多資訊

如果預設記錄等級不足,nodetool 可以使用 nodetool setlogginglevel 指令為各種套件和類別設定較高或較低的記錄等級。先從檢視目前等級開始

$ nodetool getlogginglevels

Logger Name                                        Log Level
ROOT                                                    INFO
org.apache.cassandra                                   DEBUG

也許 Gossiper 出了問題,我們希望在 TRACE 等級啟用它以獲得更多深入見解

$ nodetool setlogginglevel org.apache.cassandra.gms.Gossiper TRACE

$ nodetool getlogginglevels

Logger Name                                        Log Level
ROOT                                                    INFO
org.apache.cassandra                                   DEBUG
org.apache.cassandra.gms.Gossiper                      TRACE

$ grep TRACE debug.log | tail -2
TRACE [GossipStage:1] 2018-07-04 17:07:47,879 Gossiper.java:1234 - Updating
heartbeat state version to 2344 from 2343 for 127.0.0.2:7000 ...
TRACE [GossipStage:1] 2018-07-04 17:07:47,879 Gossiper.java:923 - local
heartbeat version 2341 greater than 2340 for 127.0.0.1:7000

請注意,以這種方式進行的任何變更都會在下次重新啟動 Cassandra 程序時還原。若要使變更永久生效,請將適當的規則新增至 logback.xml

diff --git a/conf/logback.xml b/conf/logback.xml
index b2c5b10..71b0a49 100644
--- a/conf/logback.xml
+++ b/conf/logback.xml
@@ -98,4 +98,5 @@ appender reference in the root level section below.
   </root>

   <logger name="org.apache.cassandra" level="DEBUG"/>
+  <logger name="org.apache.cassandra.gms.Gossiper" level="TRACE"/>
 </configuration>

請注意,如果您想要比此工具提供的更多資訊,還有其他可用的即時擷取選項,例如 packet-capture