In the present study, univariate CLR analysis identified 16 significant factors, including low T-score, walking difficulty, low BMI, low MMSE score, low milk intake, significant fall at home, low education, smoking habit, fractures experienced after age 55 years, fecal incontinence, vision impairment, presence of major diseases, ADL difficulty, IADL difficulty, no regular exercise, and coordination abnormality. The first six factors remained statistically significant in stepwise multivariate analysis, with low T-score being the most important one among them. In comparison of ANN and CLR for fracture risk assessment, ANN provided statistically higher discrimination and calibration power in the modeling and testing datasets in cross validation analyses.
In the literature, various clinical risk factors have been reported for hip fractures
, but their combined effects for fracture prediction varies. The present matched case control study investigated most of the different kinds of potential personal and environmental risk factors. The 16 significant factors left in univariate analysis were mostly personal and modifiable. This outcome supports the finding that at-home falls of old people are mainly due to impaired general health, rather than external hazards
, and emphasizes the importance of improving bone strength and general health for fracture prevention. It has been reported that milk supplement can increase the bone density in Chinese women
[17, 18] and low milk intake could lead to high fracture risk in our study. Low milk intake might also account for low education level which was associated with high fracture risk
. Walking difficulty and low MMSE could account for vision impairment, poor coordination, low ADL and IADL. Low BMD, the most significant variables in our analyses, could account for smoking habit, associated diseases, lacking of exercise, fecal incontinence and previous fractures. BMD measurement is an important tool for assessing osteoporosis. It can be used for diagnosis, monitoring of treatment, and fracture risk prediction. Hip fracture risk increased by 3.7 times per SD decrease in femoral neck BMD at the age of 50 years
. The present study supports the finding that combining BMD and clinical risk factors can further improve the predictability of hip fracture and emphasize the multidirectional approach for patient at risks.
Logistic regression and ANN are currently the most widely used models for diagnosis and prognosis studies in biomedicine. Logistic regression has the advantages of high interpretability of model parameters and ease of use, but the use of linear combinations of variables is not suitable for modeling highly nonlinear complex interactions as is demonstrated in biologic and epidemiologic systems
. ANN with its resemblance to the human brain is appealing because of flexible nonlinear systems that show robust performance in dealing with noisy, incomplete or missing data and have the ability to generalize. They may be better at predicting outcomes when the relationships between the variables are multidimensional as found in complex biological systems. The ANN model allows inclusion of a large number of variables and there are not many assumptions (such as normality) that need to be verified. However, the comparative performance of these two methods has been widely reported with great controversy in the literature. In a review of 28 major studies carried out by Sargent
, the performance was superior for ANN in 10 studies (36%), was superior for logistic regression in 4 cases (14%), and was similar in the remaining 14 cases. In another review of 72 papers conducted by Dreiseitl and Ohno-Machado
, with statistical tests, both models performed similarly in 42%, ANN better in 18%, and logistic regression better in 1%. By contrast, without statistical tests, ANN was better in 33% and logistic regression better in 6%. The authors also surveyed the quality of the methodology and found a shortage of reporting ANN model building details in 49%, lack of statistical testing in 39%, and lack of calibration information in 75%. ANN is theoretically more flexible than logistic regression because of multi-layer networks, but on the other hand, it is threatened by over-fitting and instability
. Especially, there are still no set methods for constructing ANN models
, which may lead to the wide variation in the comparative results.
Over-fitting ANN model which are trained too closely on limited available data would lose its generalization. The network with generalization could offer reasonable outputs in new unseen data. A commonly used method to improve generalization in data-mining is a three-way data split with cross validation
 as in the present study. The modeling datasets were split into training and validation subsets. The error on the validation subset was monitored during training epoch and once the error had increased, the training was stopped (early stopping). The network with lowest validation errors was chose. This generalization property may obtain good output data without training on all possible available datasets. Another practical problem is ANN instability
 which means that changes in the training data may produce very different models and consequently different performance on unseen data. The instability is caused by training getting caught in different local minima in the error surface. This instability problem can be fixed by building ANN ensembles and aggregating the results of the networks
. The aggregated outputs with diversified individual networks will have lower variance and smaller bias than a single network. Furthermore, the 10-fold cross splitting method used for building the ANN ensembles could ensure each datum was equally used for both training and validation. The present study showed that ANN significantly outperformed CLR in terms of discrimination and calibration in both 16- and 6-variable models. However, it may lead to biased superior performance in ANN training or validation subsets when compared with CLR models. Thus, we used the cross validation testing datasets for ANN and CLR generalization comparison. Besides, as shown in the Table
2, comparison of discrimination on a single testing dataset might lead to no significant difference or even higher accuracy in CLR. This might explain the high inconsistency in the comparisons of these two classifiers reported in the literature, especially if statistical testing was not performed
[15, 25]. In the present study, nonparametric tests for paired samples in 10 cross validation groups could detect the significant difference between the two classifiers in datasets with varied patterns.
Sensitivity, specificity and accuracy determined according to a pre-specified cutoff point are also commonly used for comparing the performance of the classifiers
. Actually, the risk score computed by the classifiers may be affected by the disease prevalence; thus selection of the cutoff points is important for a fair comparison. In the present study, the Youden index defined by the point with the minimum of the summation of the false positive and false negative rates in the ROC curve best differentiates between subjects with disease and those without disease when equal weight is given to sensitivity and specificity. Using the Youden index as the cut-off point can be independent from the disease prevalence and makes the predicting models more applicable to different series of patients
. It has been reported that the use of a cut-off point arbitrarily determined at a risk score equal to 0.5 might lead to biased results and unfair comparisons
The present study had limitations. First, as a matched case control study, age and sex were not included in the predictive models. This exclusion might lower the performance of the classifiers. Second, some clinical risk factors were not included, such as the geometry of the proximal femurs or maternal history of hip fractures, because the former is not a routine examination for the elderly and the latter might be subject to information or reporting bias. Third, all the continuous variables were converted to binary variables with a cut-off point of the Youden index. This method could maximize the difference between cases and controls and make the comparison more fair and clinical application easier. However, some important information might be lost if the distribution of the variables was complex
. Fourth, it was not fair for CLR if the interaction terms or quadratic functions were not included. However, these interaction terms were not routinely examined in conventional analyses. Besides, no significant interaction between the input variables was found in the present study. Fifth, participant partition using 10-fold cross validation method in the present study might result in a sample size too small for validation and testing and increase the variance
. Besides, this sample size was also not enough for a standard HL analysis, which required at least 400 cases
. Bootstrap resampling method might be another option to improve the efficiency of validation. Last, although considerable efforts, through many trial-and-errors, were made to optimize the design of the neural networks, they still could be further improved in model topology or ensemble method