About Me

My photo
Geógrafa pela Unicamp (2014), incluindo um ano de intercâmbio universitário na Universidade de Wisconsin (EUA). Possui experiência na área de geotecnologias, GIS e planejamento urbano, tendo realizado estágios na Agemcamp, American Red Cross e - atualmente - no Grupo de Apoio ao Plano Diretor da Unicamp.

Thursday, March 21, 2013

Chi-Square Testing



Introduction

For a department of tourism in a certain state, it’s important to understand the spatial differences in the tourism potential. This knowledge allows public incentives in different areas accordingly with its specific potential. In Wisconsin, it’s considered a region called Up North where it’s assumed to have a high tourism potential.  This project intends to analyze if there’s a genuine prominence of the northern counties over the southern counties in Wisconsin.

A large dataset was provided to choose a few variables to be analyzed. As a preliminary examination, the capability to house tourists was the main subject. Then, the variables selected were the number of campsites, hotel beds and seasonal homes.

To distinguish the counties, the highway 29 was used as the boundary between north-south. With that, the analysis of the Chi Square over the state using these variables can support the analysis of different characteristics exclusively from north or not.

Methodology

The first step was to use the software ArcGIS to create a division between north and south of Wisconsin. For that, data was imported from the ESRI database: counties and highways. To deal only with Wisconsin counties it was necessary to run a query based on the State Field. With the selection, a feature class only with Wisconsin counties was created. The feature was re-projected to the Wisconsin Transverse Mercator (NAD1983) to minimize the projection distortion.

Over this feature, the U.S highways were also added, again running a query it was possible to select only the Highway 29 and create a simple layer to visualize the counties from north or south. It was not necessary to create a feature class for this highway because its use is only temporary.

After creating a field to contain the position of each county (north or south) an editing session is necessary to add new attributes. By selecting the counties located at north from the highway, it was possible to give them the code “1” all at once, in the same way as the ones in the south. With that, Wisconsin was then divided in North and South (Figure 1)

Figure 1 – Northern and Southern Wisconsin


A large database was provided with much tourism information about each county, the look-up table was essential to identify which features were more relevant to the matter. As said before, the decision of dealing with housing focused the approach in seasonal homes, hotel beds and campsites. To map these variables, it was necessary to join the stand-alone table with the information to the counties feature class.

Firstly, since there were a lot of non-necessary data in the table, in the fields’ properties of the table, all the fields were hidden, keeping visible only the name of the county and the selected variables. With that, a simplified table helped to keep the organization of the procedures. Subsequently, three new fields were created to include the category each variable would fall into, to simplify, the codes 1, 2, 3 and 4 were used. (Figure 2)

Figure 2 – Simplified table to maintain organization


These different categories are necessary to allow the statistical applications. For that, different classification methods could be used: natural breaks, the default; standard deviation; equal interval and quintile. In this specific case, the equal interval was used to symbolize the features in a map. With these intervals mapped, it was possible to select the different categories and add information to those fields created, following a similar procedure to the south-north classification mentioned before.

After having well organized and classified table, it was exported as a .dbf file to allow its statistical manipulation inside SPSS software. Then, it was possible to start the hypothesis testing using the Chi-Square method. This method analyzes the observed distribution of a sample with the expected distribution it would have. For that, some hypotheses are stated: the null hypothesis and the alternative hypothesis.

The null hypothesis is based in the idea that the observed frequency fits the expected distribution, so there’s no difference between them. Then, the sample is random and happens by chance. The opposite scenario is considered for the alternative hypothesis. In this case, the observed frequency doesn’t fit the expected distribution, showing that there’s a significant difference between these values, concluding that the sample is not random and does not happen by chance.

Therefore, the main procedure in this exercise will be to analyze the hypotheses for each variable. For that, the SPSS software will make the calculations using the default confidence interval of 95%.

Results

The first variable to be analyzed was the number of hotel beds per county, a preliminary interpretation of a map symbolized by quantities shows that the north is not prominent in the amount of hotel beds. In the contrary, the counties with highest amounts are located in the southern portion of Wisconsin (Figure 3).

Figure 3 – Distribution of hotel beds in Wisconsin


It’s important to notice that the visualization is affected by the classification method used. The equal interval method commonly shows more elements in the lowest categories instead of having a more diverse scenario, as it would be in a quintile method – each category with the same amount of occurrences.  Hence, it’s important that the analysis doesn’t be simply the interpretation of one map in a specific classification method. The map in this case works as a preliminary view of the distribution, which in any of the methods would be recognizable. Then, the application of the Chi-Square test (Table 1) can support the idea seen in the map or show a possible distortion. 
Table 1 – Chi-Square Test for Hotel Beds


Analyzing the Chi-Square results, the p-value of 0.676 is extremely higher than the significance level of 0.05. Then, the test failed to reject the null hypothesis: the expected distribution of hotel beds doesn’t have significant difference from the observed distribution. So, the segregation between north and south is not valid. The sample is random and happens by chance, not having relation with the category 1 and 2 (north and south). On that account, the statistical result confirms the idea interpreted on the map: the north is not prominent in this variable. However, the results show that neither the south is prominent: none are; both have a statistical similarity for that matter, based on the Chi-Square result.

A similar result is observed when analyzing the distribution of campsites (Figure 4). The southern counties appear in higher categories than the northern counties. The same issue with the classification method can be considered. However, in this case, there are more northern counties falling at least in the second classification, the same happens in south, the variation is higher than when dealing with hotel beds.

Figure 4 – Distribution of Campsites in Wisconsin


The apparent lack of difference between the distribution in north and south is confirmed when analyzing the results of the Chi-Square testing (Table 2). The p-value is again extremely higher, 0.637, than the significance level of 0.05, therefore, the null hypothesis is failed to be rejected. Again, the sample happens by chance, showing that there’s no significant difference between the frequencies of campsites observed and expected. That shows that neither southern nor northern Wisconsin have a notable distribution of campsites when comparing to each other.
Table 2 – Chi-Square Testing for Campsites in Wisconsin

For last, the examination of the amount of seasonal homes finally present some sort of difference between north and south when mapped differing the four categories (Figure 5). Almost all the southern counties are in the lowest category, while ten counties from the north appear in the upper categories. However, as said before, only the visualization of a variable in a specific classification method is not enough to guarantee a realistic difference.
Figure 5 – Distribution of Seasonal Homes in Wisconsin.


The results of the Chi-Square test are, then, analyzed to confirm the interpretation of the mapped distribution (Table 3). Different from the other two variables analyzed in this project, in this case, the p-value is extremely low:  0.002. The significance level of 0.05 is much higher than that, being possible then to reject the null hypothesis.
Table 3 – Chi-Square Testing for Seasonal Homes

Consequently, there’s a significant difference between the distribution of seasonal homes observed and expected. The sample doesn't occur by chance and it’s not random. But, this result only means that the classification north-south has a relation with the number of seasonal homes, it doesn’t say which relation. Hence, it’s necessary to analyze the observed count and expected count. For the position 1 – North – the expected count is that the frequencies would be higher in the lower categories (less seasonal homes) and lower in the higher categories (more seasonal homes). The observation shows the opposite, reason why it can be determined that the northern region has a higher frequency of seasonal homes than the expected in comparison with the south. In this last variable, the map illustrates well the results obtained by the statistical tests: the concentration of seasonal homes in northern Wisconsin.

Conclusion
Gathering all the results obtained in this project, it’s possible to affirm that in the housing section of the Up North tourism, hotel and camping accommodations are not the strength of the area. Two of the three variables failed to reject the null hypothesis, but that doesn't mean that the Up North doesn't have a tourism potential. The choice of the variables needs to be considered. Since the last variable – seasonal homes – showed extremely out of the expected frequency, it’s possible to say that the accommodation resources of the Up North are not necessarily standard and without any remarkability in comparison with the rest of the state. The meaning of that is that, within the accommodation resources, the use of seasonal homes is much more prominent than other ways such as hotels and camping.

The reasons for that can lie in a predominance of regular-basis tourism, considering that seasonal homes are more stable than campsites and hotels: it’s generally always the same family who goes visit the Up North in specific occasions, rather than random tourists from everywhere who don’t necessarily have a relation with the area. It can be also suggested that the lower temperatures of the northern area discourage the intensity of camping, reason why it’s not prominent. However, there are no facts in this project showing this relationship, but only the elaboration of possible causes for the results found.

It’s important also to remember that this project only analyzed variables related to housing for tourists, so a deeper analysis with more diverse variables would need to be made to characterize better the whole concept of the Up North.

Tuesday, March 12, 2013

Lottery Sales in Milwaukee - Mean Center, Standard Distance and Weight


Introduction

Statistic and spatial analysis are widely used to support or refuse the argument of different organizations, being considered an impartial scientific-based position, free from individual interests. Although this is not completely truth since the scientific process includes a number of decisions that can have different results, the awareness of those decisions can provide a realistic answer for the problems observed.

In this project, a hypothetical scenario is being used to analyze the dynamic of a spatial distribution: in Milwaukee County, the Civil Rights groups are claiming that the lottery tickets are concentrated mostly in areas dominated by minorities. The goal of the project is to explore, examine and interpret the related data, with geostatistical tools, determining if these arguments are justified or not.

Methodology

To answer this question, the data available was a table with the addresses of the lotteries and its sales amount; also, a feature class of the census data related to the population race, with the percentage of non-white population. Thus, for the purposes of this project, the non-white population will represent the minorities.

The first step to analyze this data is to geocode the table with the lottery addresses, using the Geocoding Tool in Arc Map. After creating a point feature and gather the data that will be used, a geodatabase was created to guarantee reliability and organization within the dataset. The projection of the features also needed to be changed, to minimize its distortion. The choice was the Wisconsin State Plane – South Zone, where Milwaukee County is located, since the smaller the area the projection is made for, the less distortion it will have.

With this standards set up, it’s time to apply the geostatistical and symbolization tools to explore the data. For that, it’s important to present the concepts of mean center, standard center and the application of weight.
Mean Center is a parameter derived from the simple mean in a sample. However, instead of using an attribute of each entity, it uses the coordinates. Then, all the X coordinates of the sample will give the X mean, as well as with the Y coordinates. The combination of the X mean and the Y mean will provide a point feature that represents the mean center.

Standard Center can be considered a spatial version of the simple standard deviation. The mean, where the 0 standard deviation is located, in this case, will be the mean center. For this reason, there’s no negative standard deviation. A radius will be created in the area where the features locations are within the first standard deviation, considering the relation of its coordinates with the mean center coordinates.

In both situations, a different analysis can be made by adding weight to each parameter. That’s useful when your feature has important attributes that you want to explore. The quantities of the attribute you choose will give different levels of important in the calculation of the mean and the standard deviation. That’s why it’s called that the parameter will be weighted. Without the weight, each feature has equal relevance for the calculation; with the weight, each feature is treated differently, depending of the attribute chose.

At first, these parameters will be applied for the pure location of the lotteries, without considering the difference in its sales. Then, with the weight, the amount of sales of each lottery will personalize the result with a deeper interpretation. To apply all these concepts, the Spatial Statistics Tools inside Arc Toolbox is going to be used.

Also, in a more general perspective, the Z-score and probabilities will be analyzed for the whole county and three selected tracts. The Z-score represent simply the exact standard deviation of a specific feature.  The standard deviation is presented in six intervals: -3, -2, -1, +1, +2 and +3. By using the Z-score, it’s possible to determine the exact position of a feature, for example, between 0 and +1.

Each Z-score has a representation within the probability, allowing the analysis of the chances of one attribute be higher or lower than a number of your choice in a year, for example. For that, a standard statistical table is used, where you can determine the probability of a given z-score. The inverse process can also be made, when you have the probability and needs to find the corresponding z-score.

However, by using the table, some approximations are made, since there are not all the possible z-scores and probabilities. Because of that, to guarantee precision of the results in the project, the NORM.INV function of Microsoft Excel is used to provide the exact result, without approximations.

For last, the symbolization in Arc Map is used to visualize and classify the data in a way to be easily interpreted. Proportional and choropleth maps are used to support the results and reach to a conclusion about the spatial question.

Results

Firstly, it’s necessary to have a preliminary perspective of the characteristics of Milwaukee County. In the Map 1 is possible to notice a non-white population concentration in the north-western region of Milwaukee, and it’s extremely high concentrated in the area outlined in red.

Thus, the analysis will be focused in this area, interpreting if the results of high sales are being placed there or close to there. The symbolization of the percentage of non-white people will be the background of most of the other maps. That will help to analyze the sells results, but a transparency was applied to avoid the maps to be cluttered.

Outlined in pink are the tracts selected in this exercise to have the Z-score calculated. In the northern tract, the z-score of 2.26 shows that this county has a high amount of sales, far from the mean. The situation is more extreme in the eastern tract, where the z-score is 7.95, which can be even considered an outlier, because the sales are extremely high. In the western tract, the situation can be considered normal, since the z-score is 0.39, extremely close to the mean. Also answering the exercise questions, by using the formula in Microsoft Excel (Figure 1), for the entire county, in 70% of the time the lottery sales will exceed U$91,504. However, in 20% of the time, the lottery sales will exceed U$628,122.

Figure 1 – Use of Excel to find exact results.

Map 1- Non white distribution in Milwaukee

In the map 2, it’s possible to perceive how the lottery sales are concentrated in the south, with the exception of some tracts in the north-east and in the center. However, it’s important also to consider the size of each tract. The tracts further from the center are larger, so it’s natural that the sales are higher. In the other hand, none of these high-sales tracts coincides with high concentration of non-white population. Thus, apparently, based only in these two maps, looks like the sales are concentrated in tracts with a predominant white population.

Map 2 – Sales Distribution in Milwaukee\

With the analysis of the mean centers in 2007 and 2009, the map 3 and 4 shows that when the weight is applied to the mean center, it shifts to the south, where there’s no concentration of non-white population.

Map 3 – Mean Centers in 2007

Map 4 – Mean Centers in 2009



When applying the standard distance in both years (Map 5 and Map 6), there’s no big temporal difference. In both occasions, the standard distance covers a portion of the “non-white area”, but also a big part of “white area”. Thus, it doesn’t seem to mean a necessary concentration either in white or non-white regions.

Map 5 – Sales Distribution in 2007

Map 6 – Sales Distribution in 2009

However, when analyzing the map 7, comparing both years in a single maps, a slightly shift is noticed: in 2009 the standard distance is a little more in the north-east than in 2007.

Map 7 – Comparison between 2007 and 2009

The reason for that can be noticed when looking closer to comparison between the mean centers of 2007 and 2009 (Map 8). Both normal and weighted mean centers have shifted to north, in direction of the area with high non-white population is noticed, in the period of two years. It’s a small shift, it doesn’t change even the tract where the mean center is located, however, it shows an important result about the variation of time.

Map 8 – Comparison of Mean Centers (2007-2009)


Conclusion

Considering the first results, apparently there’s no discrimination of race related to the lottery sales. The high amount of sales does not fall in tracts with non-white concentration. By simply seeing the mean center, it falls within the center of the county itself, so it also doesn’t look to have a considerable difference. Actually, the opposite idea is found when analyzing the mean center weighted by the amount of sales: it shifts to the south, where the concentration of white population is higher. Everything leads to think that there’s no reason for the allegations of the civil rights groups.

However, when analyzing the temporal dimension, it’s noticed that there’s a shift in direction to the area where the non-white concentration is higher. Then, it’s possible to affirm that in until 2009 there was no big difference in the amount of sales depending on the race predominant in a given region. However, the results lead us to think that in the future, the situation can change, because there’s a tendency of shifting the mean center to north – where the non-white population is concentrated.

The studies about this matter should be expanded to recent years, in favor to analyze how the temporal change is occurring. Also, it’s important to understand the limitations of the analysis, since some factors were not considered. The absolute amount of sales can also be related with the size of the tracts and the total population living in it. Also, the population that lives in a given county doesn’t necessary buy lottery tickets only within its tract. This is true especially if it’s an urban area, where the mobility of people is higher. Therefore, the results of this project show that by now and with this data there’s no reason for the allegations of the Civil Rights Groups, however, that doesn’t mean that the matter should be ignored. Contrarily, this project encourages more studies to be made not only with the same concepts, but also including other variables.

Thursday, February 14, 2013

Horses and Orchards in Wisconsin

In this section of the exercise, data related to the number of horses and the orchard acres per county in Wisconsin were analyzed. The goal was to find where the best places to invest in these types of farms are. At first, some parameters were calculated statistically, and then some maps were created and spatially studied.
                                                         Table 1 - Statistic Analysis
Horses
Orchard Acres
Mean
1666
133.5
Median
1495
39.5
Mode
0
0
Skewness
1.43
7.18
Kurtosis
3.18
56.45
Standard Deviation
1156.46
389.17








Firstly, a general inquiry was made with the data available, obtaining the results presented in the Table 2. In both samples it was noticed that the median value is smaller than the mean. However, the difference wasn't too big.  Also for both samples, the mode was 0, meaning that the most frequent quantity of horses or orchard acres was 0. That makes perfect sense, considering is easier that two or more counties just don’t have the type of farming than having exactly the same amount of it.
In the skewness and kurtosis parameters, it’s apparent the prominence of the orchard acres. Most counties have their amount of orchards close to the mean, resulting in a peaked curve (high positive kurtosis), but these values are in general lower than the mean, reason why the positive skewness is also high. For the horses, the values are also positive, but not as high as for the orchards. The skewness is small, so there’s more results lower than the mean, but there’s no big discrepancy. However, the kurtosis is relatively high, showing that most of the values are near the mean.
The standard deviation in both cases suggests a lack of variety in the values that are lower than the mean. That’s because the standard deviation is almost as high as the mean value in the horse sample and even higher than the mean in the orchard sample. It means that there’re almost no samples in the -2 Standard Deviation section, so the outliers must be more frequent in the positive area.
After that, the production of Wisconsin maps could illustrate better how these values are distributed spatially. Examining the Map 1 and Map 2, the counties with more horses are concentrated in the south-west portion of Wisconsin, especially in Clark, Monroe, Vernon, Grant and Dane counties.
All of these mentioned counties can be considered prominent in the amount of horses. However, by analyzing how far their values are from the mean in the Map 3, Dane county stands out as the one with the higher amount of horses. Generally, it’s noticed a strong pattern in which the north-east region is weak in this variable and the north-west has the higher concentration of horses. Therefore, the investment would be better applied in locations within this area.
With the orchard farming, an opposite pattern was noticed. Almost all the counties don’t have a relatively high amount of orchard acres. It means that this kind on farming is highly concentrated in a few counties, generally in the west region of Wisconsin (Map 4), but with an extreme prominence of Door County, located in the east. This county has 33.34% of all the orchard acres in Wisconsin (Map 5), it’s clearly an outlier and it would be a really good place to invest the money. However, it’s only one county in the east region, so the tendency in the west might be interesting as well.
The decision-making with this analysis was based in the idea that the higher the number of the farming type, the more successful the farming is in that location. However, to guarantee better results, it would be interesting to have other variables about each farming type as well. Especially dealing with investments, to have the data about the profit made by each kind of farming per county would result in a more accurate answer. However, the available data fitted well in the purpose of having a general idea of how this kind of farmings are distributed within Wisconsin.