About Me

My photo
Geógrafa pela Unicamp (2014), incluindo um ano de intercâmbio universitário na Universidade de Wisconsin (EUA). Possui experiência na área de geotecnologias, GIS e planejamento urbano, tendo realizado estágios na Agemcamp, American Red Cross e - atualmente - no Grupo de Apoio ao Plano Diretor da Unicamp.

Thursday, March 21, 2013

Chi-Square Testing



Introduction

For a department of tourism in a certain state, it’s important to understand the spatial differences in the tourism potential. This knowledge allows public incentives in different areas accordingly with its specific potential. In Wisconsin, it’s considered a region called Up North where it’s assumed to have a high tourism potential.  This project intends to analyze if there’s a genuine prominence of the northern counties over the southern counties in Wisconsin.

A large dataset was provided to choose a few variables to be analyzed. As a preliminary examination, the capability to house tourists was the main subject. Then, the variables selected were the number of campsites, hotel beds and seasonal homes.

To distinguish the counties, the highway 29 was used as the boundary between north-south. With that, the analysis of the Chi Square over the state using these variables can support the analysis of different characteristics exclusively from north or not.

Methodology

The first step was to use the software ArcGIS to create a division between north and south of Wisconsin. For that, data was imported from the ESRI database: counties and highways. To deal only with Wisconsin counties it was necessary to run a query based on the State Field. With the selection, a feature class only with Wisconsin counties was created. The feature was re-projected to the Wisconsin Transverse Mercator (NAD1983) to minimize the projection distortion.

Over this feature, the U.S highways were also added, again running a query it was possible to select only the Highway 29 and create a simple layer to visualize the counties from north or south. It was not necessary to create a feature class for this highway because its use is only temporary.

After creating a field to contain the position of each county (north or south) an editing session is necessary to add new attributes. By selecting the counties located at north from the highway, it was possible to give them the code “1” all at once, in the same way as the ones in the south. With that, Wisconsin was then divided in North and South (Figure 1)

Figure 1 – Northern and Southern Wisconsin


A large database was provided with much tourism information about each county, the look-up table was essential to identify which features were more relevant to the matter. As said before, the decision of dealing with housing focused the approach in seasonal homes, hotel beds and campsites. To map these variables, it was necessary to join the stand-alone table with the information to the counties feature class.

Firstly, since there were a lot of non-necessary data in the table, in the fields’ properties of the table, all the fields were hidden, keeping visible only the name of the county and the selected variables. With that, a simplified table helped to keep the organization of the procedures. Subsequently, three new fields were created to include the category each variable would fall into, to simplify, the codes 1, 2, 3 and 4 were used. (Figure 2)

Figure 2 – Simplified table to maintain organization


These different categories are necessary to allow the statistical applications. For that, different classification methods could be used: natural breaks, the default; standard deviation; equal interval and quintile. In this specific case, the equal interval was used to symbolize the features in a map. With these intervals mapped, it was possible to select the different categories and add information to those fields created, following a similar procedure to the south-north classification mentioned before.

After having well organized and classified table, it was exported as a .dbf file to allow its statistical manipulation inside SPSS software. Then, it was possible to start the hypothesis testing using the Chi-Square method. This method analyzes the observed distribution of a sample with the expected distribution it would have. For that, some hypotheses are stated: the null hypothesis and the alternative hypothesis.

The null hypothesis is based in the idea that the observed frequency fits the expected distribution, so there’s no difference between them. Then, the sample is random and happens by chance. The opposite scenario is considered for the alternative hypothesis. In this case, the observed frequency doesn’t fit the expected distribution, showing that there’s a significant difference between these values, concluding that the sample is not random and does not happen by chance.

Therefore, the main procedure in this exercise will be to analyze the hypotheses for each variable. For that, the SPSS software will make the calculations using the default confidence interval of 95%.

Results

The first variable to be analyzed was the number of hotel beds per county, a preliminary interpretation of a map symbolized by quantities shows that the north is not prominent in the amount of hotel beds. In the contrary, the counties with highest amounts are located in the southern portion of Wisconsin (Figure 3).

Figure 3 – Distribution of hotel beds in Wisconsin


It’s important to notice that the visualization is affected by the classification method used. The equal interval method commonly shows more elements in the lowest categories instead of having a more diverse scenario, as it would be in a quintile method – each category with the same amount of occurrences.  Hence, it’s important that the analysis doesn’t be simply the interpretation of one map in a specific classification method. The map in this case works as a preliminary view of the distribution, which in any of the methods would be recognizable. Then, the application of the Chi-Square test (Table 1) can support the idea seen in the map or show a possible distortion. 
Table 1 – Chi-Square Test for Hotel Beds


Analyzing the Chi-Square results, the p-value of 0.676 is extremely higher than the significance level of 0.05. Then, the test failed to reject the null hypothesis: the expected distribution of hotel beds doesn’t have significant difference from the observed distribution. So, the segregation between north and south is not valid. The sample is random and happens by chance, not having relation with the category 1 and 2 (north and south). On that account, the statistical result confirms the idea interpreted on the map: the north is not prominent in this variable. However, the results show that neither the south is prominent: none are; both have a statistical similarity for that matter, based on the Chi-Square result.

A similar result is observed when analyzing the distribution of campsites (Figure 4). The southern counties appear in higher categories than the northern counties. The same issue with the classification method can be considered. However, in this case, there are more northern counties falling at least in the second classification, the same happens in south, the variation is higher than when dealing with hotel beds.

Figure 4 – Distribution of Campsites in Wisconsin


The apparent lack of difference between the distribution in north and south is confirmed when analyzing the results of the Chi-Square testing (Table 2). The p-value is again extremely higher, 0.637, than the significance level of 0.05, therefore, the null hypothesis is failed to be rejected. Again, the sample happens by chance, showing that there’s no significant difference between the frequencies of campsites observed and expected. That shows that neither southern nor northern Wisconsin have a notable distribution of campsites when comparing to each other.
Table 2 – Chi-Square Testing for Campsites in Wisconsin

For last, the examination of the amount of seasonal homes finally present some sort of difference between north and south when mapped differing the four categories (Figure 5). Almost all the southern counties are in the lowest category, while ten counties from the north appear in the upper categories. However, as said before, only the visualization of a variable in a specific classification method is not enough to guarantee a realistic difference.
Figure 5 – Distribution of Seasonal Homes in Wisconsin.


The results of the Chi-Square test are, then, analyzed to confirm the interpretation of the mapped distribution (Table 3). Different from the other two variables analyzed in this project, in this case, the p-value is extremely low:  0.002. The significance level of 0.05 is much higher than that, being possible then to reject the null hypothesis.
Table 3 – Chi-Square Testing for Seasonal Homes

Consequently, there’s a significant difference between the distribution of seasonal homes observed and expected. The sample doesn't occur by chance and it’s not random. But, this result only means that the classification north-south has a relation with the number of seasonal homes, it doesn’t say which relation. Hence, it’s necessary to analyze the observed count and expected count. For the position 1 – North – the expected count is that the frequencies would be higher in the lower categories (less seasonal homes) and lower in the higher categories (more seasonal homes). The observation shows the opposite, reason why it can be determined that the northern region has a higher frequency of seasonal homes than the expected in comparison with the south. In this last variable, the map illustrates well the results obtained by the statistical tests: the concentration of seasonal homes in northern Wisconsin.

Conclusion
Gathering all the results obtained in this project, it’s possible to affirm that in the housing section of the Up North tourism, hotel and camping accommodations are not the strength of the area. Two of the three variables failed to reject the null hypothesis, but that doesn't mean that the Up North doesn't have a tourism potential. The choice of the variables needs to be considered. Since the last variable – seasonal homes – showed extremely out of the expected frequency, it’s possible to say that the accommodation resources of the Up North are not necessarily standard and without any remarkability in comparison with the rest of the state. The meaning of that is that, within the accommodation resources, the use of seasonal homes is much more prominent than other ways such as hotels and camping.

The reasons for that can lie in a predominance of regular-basis tourism, considering that seasonal homes are more stable than campsites and hotels: it’s generally always the same family who goes visit the Up North in specific occasions, rather than random tourists from everywhere who don’t necessarily have a relation with the area. It can be also suggested that the lower temperatures of the northern area discourage the intensity of camping, reason why it’s not prominent. However, there are no facts in this project showing this relationship, but only the elaboration of possible causes for the results found.

It’s important also to remember that this project only analyzed variables related to housing for tourists, so a deeper analysis with more diverse variables would need to be made to characterize better the whole concept of the Up North.

No comments:

Post a Comment