It is interesting to observe how the price of a house changes over time based on its location.
Through its platform FRED, the Federal Reserve Bank of St. Louis provides research data on various economic indicators. The source of one such indicator comes from the U.S. Federal Housing Finance Agency and is an index of the house prices in the period between 1975 and 2016. This data is provided on a county level, which means that there are thousands of files, each containing a year and and a house price index relative to the year 2000. If we dig deeply, we will find 1488 files, where the number of US counties could be bigger:
This makes it hard to find which county corresponds to which location on the map. If we tried to color each region with a value based on the index, even a single county that wasn't considered in the study could leave a hole in the map. If we were to compare this map with the map showing the regions where no people live, we may see that smaller counties tend to have more people per square meter.
What makes this longitudinal study interesting is its exhaustiveness and simplicity. Although the amount of data seems overwhelming, it is very low-dimensional, which means that we don't need a lot of computational resources to analyze it. To make the analysis simpler, we could combine all time-series into a single file of size ≈560kB to reduce the amount of disk I/O.
On FRED we can see that each time series has already been visualized separately, but this means that there is no easy way to compare the indices of the different counties relative to each other. This way we may observe wider trends or find counties where the house prices have deviated from the norm. Ideally, we want to see this in an interactive way—pointing the mouse at the data should reveal to which county it belongs. The initial overview graphic didn't allow such interactivity.
We must also consider using multiple diagram types to capture the richness of the data from multiple perspectives and offset any disadvantages that a particular diagram type might have. We have seen that when too many lines are involved, even when they are thin or semi-transparent, their overlapping can make it hard to distinguish them. For this reason, we may consider plotting the distributions (of each time series) on a boxplot, to see whether we can detect some anomalies in the data. This is shown on a separate 250-inch-high plot (≈3.9MB). If we zoom in and scroll, we can see that Richland County(MT) shows very high index values, where the median value (shown in red) is also very high. Other counties that catch our attention are New York County(NY), Los Angeles County(CA), San Bernardino County(CA), Honolulu County(HI), District of Columbia(DC), Mercer County(ND), Monroe County(FL).
We can also find the counties with the highest median index value:
201.31 New York County, NY 187.59 Richland County, MT 159.91 Custer County, MT 158.03 Cook County, MN 156.70 Valle County, MT 148.13 Broadwater County, MT 147.40 Pope County, MN 145.81 Mercer County, ND 145.74 St Marys County, MD 144.75 Lincoln County, MT 144.67 Lemhi County, ID 144.64 Dawes County, NE 144.36 Walthall County, MS 143.87 Clay County, KS 143.82 Morgan County, MO
Four out of the top 6 counties are located in MT (likely Montana), which is an interesting observation, but for instance Los Angeles County or District of Columbia do not appear in the top 15.
We could also find the median (among all county medians) at state level:
145.81 ND (1 county) 127.53 MT (29 counties) 116.77 VT (2 counties) 113.72 LA (51 counties) 113.69 WY (2 counties) 112.26 NZ (1 county) 110.33 IA (87 counties) 109.94 KY (88 counties) 109.35 MO (75 counties) 108.81 NE (43 counties) 108.59 ID (38 counties) 107.69 MS (55 counties) 107.27 AR (48 counties) 107.09 ME (15 counties) 106.72 MN (76 counties) 106.52 NM (3 counties) 106.31 AL (55 counties) 106.10 NH (2 counties) 104.92 HI (4 counties) 104.20 KS (55 counties) 104.07 GA (116 counties) 103.66 NC (10 counties) 103.19 IL (89 counties) 102.95 IN (89 counties) 102.86 SD (3 counties) 102.72 WV (5 counties) 102.25 WI (5 counties) 101.36 CO (48 counties) 100.90 AK (8 counties) 95.99 NY (17 counties) 95.21 DE (3 counties) 94.44 CT (8 counties) 94.25 PA (14 counties) 93.37 OK (4 counties) 93.22 MD (25 counties) 93.02 UT (4 counties) 92.04 TN (7 counties) 91.92 TX (21 counties) 91.30 WA (12 counties) 91.18 MI (81 counties) 90.99 AZ (13 counties) 89.99 NV (3 counties) 89.65 FL (56 counties) 89.27 VA (12 counties) 88.70 CA (55 counties) 86.97 OH (5 counties) 85.00 OR (10 counties) 84.66 SC (6 counties) 83.51 NJ (12 counties) 80.62 MA (14 counties) 77.44 DC (1 county)
North Dakota seemingly ranks first on the list with the highest median house price index during the observed period. However, this is not quite true, since some states have relatively few examined counties, which can skew the results. For this reason, the number of counties which were taken into account is shown in parentheses. When it comes to North Dakota itself, only Mercer county has been examined, which as seen in the previous list, tends to have a very high median value. Possibly for a similar reason, District of Columbia ranks last. As you can see, the fidelity with which we can analyze is still dependent upon how much data we have.
What appears less surprising is that MT ranks once again high on the list, followed by LA and IA if we consider only states with at least 30 examined counties. If this seems as a reasonable criteria to us, then we could filter once again, considering MT as a corner case and including it in the results:
127.53 MT (29 counties) 113.72 LA (51 counties) 110.33 IA (87 counties) 109.94 KY (88 counties) 109.35 MO (75 counties) 108.81 NE (43 counties) 108.59 ID (38 counties) 107.69 MS (55 counties) 107.27 AR (48 counties) 106.72 MN (76 counties) 106.31 AL (55 counties) 104.20 KS (55 counties) 104.07 GA (116 counties) 103.19 IL (89 counties) 102.95 IN (89 counties) 101.36 CO (48 counties) 91.18 MI (81 counties) 89.65 FL (56 counties) 88.70 CA (55 counties)
It would be pointless to try to make a statement about an entire state, given only few counties. What appears interesting now is that a high-GDP state like CA has a relatively low house price index. Could that be attracting more people there?