A scoring system for the diagnosis of non-alcoholic steatohepatitis from liver biopsy
Article information
Abstract
Background
Liver biopsy is the essential method to diagnose non-alcoholic steatohepatitis (NASH), but histological features of NASH are too subjective to achieve reproducible diagnoses in early stages of disease. We aimed to identify the key histological features of NASH and devise a scoring model for diagnosis.
Methods
Thirteen pathologists blindly assessed 12 histological factors and final histological diagnoses (‘not-NASH,’ ‘borderline,’ and ‘NASH’) of 31 liver biopsies that were diagnosed as non-alcoholic fatty liver disease (NAFLD) or NASH before and after consensus. The main histological parameters to diagnose NASH were selected based on histological diagnoses and the diagnostic accuracy and agreement of 12 scoring models were compared for final diagnosis and the NAFLD Activity Score (NAS) system.
Results
Inter-observer agreement of final diagnosis was fair (κ = 0.25) before consensus and slightly improved after consensus (κ = 0.33). Steatosis at more than 5% was the essential parameter for diagnosis. Major diagnostic factors for diagnosis were fibrosis except 1C grade and presence of ballooned cells. Minor diagnostic factors were lobular inflammation (≥ 2 foci/ × 200 field), microgranuloma, and glycogenated nuclei. All 12 models showed higher inter-observer agreement rates than NAS and post-consensus diagnosis (κ = 0.52–0.69 vs. 0.33). Considering the reproducibility of factors and practicability of the model, summation of the scores of major (× 2) and minor factors may be used for the practical diagnosis of NASH.
Conclusions
A scoring system for the diagnosis of NAFLD would be helpful as guidelines for pathologists and clinicians by improving the reproducibility of histological diagnosis of NAFLD.
Hepatic steatosis has long been regarded as a general morphological change caused by a variety of etiologies, e.g., alcohol, viral hepatitis, drugs or toxins, or metabolic disease. Alcoholic steatohepatitis is a prototype of fatty liver disease but excessive alcohol consumption is regarded as a major challenge to studying the disease. Recently, abnormal hepatic steatosis, irrespective of inducing agents, has been classified as an independent disease that can lead to hepatocellular damage, can progress into chronic liver disease, and increase the incidence of liver cancer. Non-alcoholic fatty liver disease (NAFLD) is a disease entity characterized by hepatic steatosis without a history of significant alcohol use or other known liver disease. Metabolic syndrome, obesity, hyperlipidemia, nutritional imbalance associated with gastro-intestinal surgery, or parenteral nutrition are risk factors for NAFLD.
NAFLD is part of a hepatic steatosis spectrum that ranges from simple steatosis without clinical abnormality to steatohepatitis with manifestation of clinical symptoms. Clinical assessment, including abnormal liver function tests, radiologic findings, presence of subjective symptoms, other causes of liver disease, or consumption of alcohol or drugs, etc., is critical information for diagnosing NAFLD. A histological assessment with liver biopsy is considered the only means by which to judge simple steatosis and non-alcoholic steatohepatitis (NASH). The degree of steatosis, evidence of hepatocyte injury, and presence of fibrosis, which implies chronic liver injury or the possibility of progression to chronic liver disease, are the major factors that help to discriminate simple steatosis and steatohepatitis. Several grading systems have been published by US and European pathologists since Brunt et al. [1] published the first grading system in 1999 [2-5]. Common morphologic factors include the degree of steatosis, inflammation, ballooning change of hepatocytes indicating cellular damage, and fibrosis reflecting the chronicity of liver disease. These systems play an important role in providing quantitative assessment criteria for NAFLD, but they generally do not provide diagnostic criteria for judging if the disease is so called simple steatosis or NASH [3]. However, clinicians and researchers require pathologists to identify simple steatosis versus NASH for treatment or clinical study.
Classifications for simple steatosis or NASH differ depending on the researcher, and the histomorphological criteria for NAFLD pathological features in liver tissue remains subjective with low reproducibility. Thus, in this study we divided NAFLD into three diagnostic categories: ‘not-NASH,’ ‘borderline,’ and ‘NASH,’ and evaluated diagnostic agreement and proposed a diagnostic scoring system that could increase diagnostic consistency and accuracy.
MATERIALS AND METHODS
Case selection and histological review
Thirteen pathologists reviewed 31 liver biopsies that were clinically and pathologically diagnosed as NAFLD from 10 hospitals (Daegu Catholic University Medical Center, Dong-A University Hospital, Samsung Medical Center, Seoul National University Hospital, Inje University Seoul Paik Hospital, Seoul St. Mary’s Hospital, Soon Chun Hyang University Seoul Hospital, Wonju Severance Christian Hospital, Inha University Hospital, Chungnam National University Hospital). The selection criteria were clinically NAFLD (non-alcoholic, serologically negative for viral and autoimmune markers, abnormal levels of liver enzymes such as aspartate aminotransferase and alanine aminotransferase), and aged ≥ 19 years. Cirrhosis cases were excluded. Drug and toxic injuries were excluded. One hematoxylin and eosin and one Masson’s Trichrome–stained slide for each case were prepared anonymously and randomized by a researcher not involved in the study. Pathologists blindly assessed 12 histological parameters and made a final diagnosis of one of three diagnostic categories: ‘not-NASH,’ ‘borderline,’ and ‘NASH,’ in 31 liver biopsies. Twelve histological parameters and detailed scoring criteria were followed as previously reported [6].
Evaluation of diagnostic agreement, selection of histological parameters, and comparison of diagnostic models
The review was blindly conducted twice before and after the consensus meeting. Pre-consensus and post-consensus diagnostic agreements were compared, and selection of diagnostic parameters and modeling were based on the post-consensus results. The gold standard was the diagnosis that accounted for more than half of the participants’ agreements after consensus. Final diagnosis agreement rates were assessed by Free-Marginal Multirater Kappa (multirater κfree) [7]. Among the 12 histological parameters, histological parameters that significantly discriminated ‘not-NASH,’ ‘borderline,’ and ‘NASH’ were selected by chi-square test, univariate, and multivariate repeated measures logistic regression analysis. A p-value of <.05 was considered statistically significant. All statistical analyses (except kappa analysis) were performed using IBM SPSS statistics ver. 21 (IBM Corp., Armonk, NY, USA). The Kappa value was calculated using an online Kappa Calculator [8]. The cut-off value of the weighted model was determined by the receiver operating characteristic (ROC) curve.
Ethics statement
The Institutional Review Board of Seoul St. Mary’s Hospital approved this study with a waiver of informed consent (KIRB-00562_5-001).
RESULTS
Distribution of diagnoses and diagnostic agreement of NAFLD
Diagnostic frequency of all 31 cases before (pre) consensus and after (post) consensus were plotted and shown in Fig. 1. The agreement rate of ‘NASH’ or ‘borderline’ in the pre-consensus diagnoses of all 31 cases was 53%–100%, and there was no case in which the major diagnosis was ‘not-NASH.’ After consensus, five cases were classified as ‘not-NASH’ (case Nos. 21, 2, 11, 12, and 10) by more than 50% of pathologists and 22 cases were classified as ‘borderline’ or ‘NASH’ by more than 50% of pathologists. The remaining four cases (case Nos. 3, 20, 37, and 28) had no dominant diagnosis. Consensus made classification clearer than before consensus. Kappa values for interobserver agreement for pre-consensus and post-consensus diagnoses are summarized in Table 1. Pre-consensus kappa values were fair grade, and below 0.4 in all categories. Post-consensus kappa values were still fair except in the ‘NASH’ group (0.41) and were increased in all categories compared to the pre-consensus kappa values. Post-consensus kappa values increased from 0.35 to 0.41 compared to the pre-consensus kappa values in the ‘NASH’ group (n = 22). Agreement rates of NASH after consensus were 60.72%, a slight increase relative to before consensus (overall agreement rate 56.93%). Increase of agreement rates was more pronounced in the ‘not-NASH’ category, from 33.59% to 49.49%. Histologic pictures of representative cases, ‘not-NASH’ (case 11), ‘borderline’ (case 17), and ‘NASH’ (case 30) after consensus are illustrated in Fig. 2.
Selection of histological parameters for decision modelling
Twelve histological features in 31 cases that were diagnosed by 13 pathologists are summarized in Table 2 by final diagnosis. Significantly different histological parameters among diagnoses (chi-square p < .05) were fibrosis, lobular inflammation, microgranuloma, portal inflammation, ballooning change, Mallory body, and glycogenated nuclei. Multivariate logistic regression analysis showed fibrosis (except 1C), ballooning change, and microgranuloma were significant discriminators among the three groups; lobular inflammation, portal inflammation, Mallory body, and glycogenated nuclei were significant discriminators between ‘NASH’ and ‘not-NASH’ or ‘borderline.’ Considering the incidence of parameters, rare parameters, such as portal inflammation and Mallory body, were excluded. Ballooning change and fibrosis (except 1C) were selected as major factors; lobular inflammation, microgranuloma, and glycogenated nuclei were selected as minor factors to construct a diagnostic model.
Decision models and accuracy
Nine models were constructed for quantitative diagnosis and are described in Table 3. Models 1–6 were non-weighted models that depended on the presence of major or minor factors to diagnose, and the severity of factors was not considered (Table 3). Models 7–9 were weighted models which considered the grade of major and minor factors (Table 3). Model 7 used only major factors. Model 8 weighted major factors twice and minor factors were stratified into two groups to reduce the ambiguity of equivocal findings. None to mild grade was scored as 0, and moderate to severe was scored as 1. Model 9 basically adds 9 points to the major factors, which corresponds to the total sum of the minor factors and was the only model that used the degree of steatosis in calculations (Table 3). Table 4 and Fig. 3 summarize the diagnostic accuracy referenced with the post-consensus diagnosis as the gold standard, agreement rates, and area under the curve (AUC) calculated by the ROC curve. Four cases with no consensus diagnosis were excluded. Concordance rates were higher in all scoring models than post-consensus diagnoses (κ = 0.52–0.69 vs. 0.33). Sensitivity, rate of borderline cases, Kappa rates, and overall agreement rates of quantitative models were superior to the NAFLD Activity Score (NAS) system (Table 4). Specificity and false negative rates were similar or higher than the NAS system. Based on the AUC, model 8 showed the best performance (AUC, 0.88) (Fig. 3). Model 9 had lower false-positive and false-negative rates than other models.
Recommendation of decision model
Weighted model 8 and model 9 were the finalists for recommendation. Overall accuracy was better for model 9 than model 8; however, model 9 had higher borderline rates than model 8, and model 8 had a higher AUC curve than model 9. The scoring numbers of model 9 were large, ranging from 0 to 88; therefore, model 8 would be more practical for clinical use. External validation is required to confirm the efficacy of the scoring system for diagnosis.
DISCUSSION
NAFLD is a disease spectrum ranging from simple steatosis to steatohepatitis. A major difference between simple steatosis and steatohepatitis is the presence of cellular injury induced by fat accumulation, which is apparent by the ballooning change of hepatocytes, inflammation, and fibrosis. Many scoring systems have been published by Ludwig since 1980, but the purpose of these systems is to assess the severity of steatohepatitis, not to diagnose [9]. The NAS system is a scoring system using steatosis, ballooning change, and lobular inflammation, but diagnosis should be made before scoring. The reference range for diagnosis is 0–2 for not diagnostic of NASH, 5–8 for diagnostic of NASH, but scores of 3–4 are evenly distributed in not diagnostic, borderline, or positive for NASH groups [2]. Low agreement rates of NASH in histological diagnosis are well known because the evaluation of each diagnostic feature is subjective and has low concordance rates [3,6]. Another limitation of the NAS system as diagnostic criteria is the severity of steatosis that can obscure other grades, such as ballooning change and inflammation.
In the present study, we attempted to construct a scoring system for diagnosis to reduce inter-observer variation based on the 13 pathologists’ subjective assessment of 31 liver biopsies. Concordance rates of subjective assessment were fair before and after consensus, but quantitative scoring increased concordance rates up to a moderate to substantial level in all models (κ = 0.33 vs. 0.52–0.69). Decreased inter-observer variation in a semiquantitative scoring system was reported by the Fatty Liver Inhibition of Progression (FLIP) Pathology Consortium in 2014 [3]. They proposed a NASH diagnostic algorithm and Steatosis, Activity, and Fibrosis score (SAF score) based on the presence of steatosis and grade of ballooning-change and lobular inflammation. Grade 1 or 2 ballooning change, and grade 1 or 2 lobular inflammation were the minimum diagnostic criteria used in the FLIP algorithm [3]. Concordance rates increased from 77% to 97% after using the FLIP algorithm and the kappa value also increased from moderate grade to substantial grade (κ = 0.54–0.66) [3].
The diagnostic components of our study were based on the key discriminators of post-consensus diagnosis that were selected by multivariate logistic regression analysis and the chi-square test. Ballooning change and lobular inflammation were the same histological factors of other grading systems discriminating NASH from NAFLD. The different component from other grading systems was fibrosis. Generally, many scoring systems for hepatitis and NAFLD use the concepts of grade and stage. Fibrosis is the key feature of liver injury progression and is separately assessed from necroinflammatory activity. Lobular inflammation, portal inflammation, and presence of confluent necrosis are examples of activity. High activity grade means the current status of hepatic injury and stage of fibrosis predict the progression of liver disease. The FLIP algorithm uses ballooning change and lobular inflammation as diagnostic factors but not fibrosis, which is used to assess the severity of NASH [10].
Our study showed that pathologists considered the presence of fibrosis as a major histological feature of NASH. Our study enrolled adult NAFLD cases without other causes of hepatitis, such as virus, alcohol, or autoimmune disease. The pathologists were aware of these conditions beforehand and only assessed the diagnosis of NAFLD according to three categories. As fibrosis with steatosis was presenting as irreversible hepatic injury by steatosis, pathologists easily diagnosed NASH in this situation. Interestingly, grade 1C fibrosis, which is portal fibrosis and is usually observed in pediatric patients, did not affect the diagnosis of ‘not-NASH,’ ‘borderline,’ or ‘NASH.’ As the fibrosis grade increased, the tendency to diagnose NASH increased. The three-tiered scoring system for fibrosis (0, 1A, 1B-4 except 1C) was applied considering practicality, reproducibility of grade 1A, and the smothering effect of a high fibrosis score over other diagnostic factors. Our previous report on the reproducibility of pathologic features of NAFLD mentioned ambiguity between the normal framework of the perivenular area and obvious pericellular collagen deposition [6]. Ballooning change is a mandatory feature of NASH, but inter-observer agreement was not so high (κ-value after consensus = 0.34); therefore, we adopted three levels for fibrosis grade and ballooning change [6] to prevent ambiguous scores affecting NASH diagnosis.
A common feature of our proposed model and the FLIP algorithm is that the amount of fat deposition was dismissed for diagnosis and fat deposition is considered as a minimum requirement of NASH. Grade of steatosis is a major factor in the NAS system [11]. Different features between our proposed model and the FLIP algorithm are (1) presence of the borderline category in the diagnostic group (steatosis vs. NASH in FLIP; ‘not-NASH,’ ‘borderline,’ and ‘NASH’ in our model), (2) cutoff level of ballooning and lobular inflammation for definite NASH, and (3) adaption of fibrosis as a diagnostic component. In the FLIP criteria, grade 1 ballooning and grade 1 lobular inflammation is the minimum requirement for NASH, but this category might be included as borderline by our model because the cut off value for lobular inflammation in our model was higher than that of the FLIP algorithm/SAF score (2–4 foci/200 × field vs. <2 foci per lobule) [3]. Borderline cases defined by our model might be defined as NASH by the FLIP algorithm. A relatively low NASH criteria by FLIP was reported in a comparative validation study of the NAS and SAF score [12]. Rastogi et al. [12] reported concordance of not-NASH and NASH by the NAS system and SAF algorithm, but 79.4%–94.4% of borderline-NASH diagnosed by NAS were diagnosed as NASH by the SAF algorithm.
Fibrosis is a major predictor for the progression of NAFLD; however, the NAS and FLIP algorithm/SAF score exclude fibrosis in the decision scheme. Exclusion of fibrosis in the score risks missing the fibrotic inactive NAFLD cases. Rastogi and colleagues reported that 76.39% diagnosed by NASH and 78.63% diagnosed by the FLIP algorithm/SAF score, who were not-NASH, showed the presence of fibrosis [12]. Only the fibrosis stage, but no other histological feature, was found to be independently associated with long-term overall mortality, liver transplantation, and liver-related events in a retrospective study of 619 NAFLD patients [13]. Inclusion of fibrosis as a diagnostic criterion may risk narrowing the range of definite NASH; however, considering the low progression rates of simple steatosis without fibrosis and low inter-observer reproducibility of perivenular fibrosis and ballooning change, a borderline category with equivocal features can be a buffering group between not-NASH and definite NASH.
The limitations of our study are that the performance of the model was not verified in external datasets and clinicopathologic analysis was not performed due to the small size of the cohort. Further study including external validation of the model and risk prediction for disease progression of each diagnostic group could provide valuable information.
In summary, a semi-quantitative scoring system increased the diagnostic reproducibility of NASH, and subjective assessment and summation of two major factors (× 2; ballooning and fibrosis, range 0–2) and minor factors (lobular inflammation, glycogenated nuclei, and microgranuloma, range 0–1) are proposed as a practical NASH diagnostic criteria (diagnostic range: 0–3, ‘not-NASH’; 4–5, ‘borderline’; 6–11, ‘NASH’).
Notes
Author contributions
Conceptualization: SYJ.
Data curation: ESJ.
Formal analysis: KL, ESJ.
Funding acquisition: ESJ, SYJ.
Investigation: KL, ESJ.
Methodology: KL, ESJ, EY, SYJ.
Project administration: SYJ.
Resources: KL, ESJ, EY, YKK, MYC, JMK, WSM, JSJ, CKP, JBP, DYK, JHS, SYJ.
Software: KL.
Supervision: SYJ.
Validation: KL, SYJ.
Visualization: KL.
Writing—original draft: KL.
Writing—review & editing: KL, SYJ.
Conflicts of Interest
The authors declare that they have no potential conflicts of interest.
Funding
This study was supported by the Academic Research Fund from the Korean Society of Pathologists.
Acknowledgements
We are grateful to all members of the Gastrointestinal Pathology Study Group of the Korean Society of Pathologists, particularly Eunsil Yu for scanning the virtual slides.