"Don't wait to buy real estate, buy real estate and wait" - Russell Sage
So I did this work a while back and thought I would get it added to the Portfolio. Its main interest is that if you're afraid of the Pandemic you might be convinced that the world is not so bad. It also shows some skills with geospatial analysis and maps creation.
This data is available on Zoopla please see here Sold House Prices - Get historic data only on Zoopla.
Executive Summary
It is easy to believe that the Pandemic would have disrupted the property market and there seems no evidence that this is the case. Also trying to establish what causes higher or lower prices I was only able to conclude that postcode mattered the most. Several bedrooms didn't seem to be important as the postcode for the final price. A hypothesis might be proposed that this correlation was related to the general market panic where set not by the utility of owning a property but being driven and distorted by a market panic. I don't believe this to be the case as couldn't see within the data but haven't looked that far back.
Individual sellers might be driven by a desire to desperately sell but if it is the case it is clear that this was met by people equally desperate to buy. Therefore beyond the acute market crash in the specific month, the pandemic was declared it appears that the pandemic did not depress house prices and that the industry was robust.
It should be noted that this robustness might be obvious much of buying a house can be done online. It can be done without much fuss online. Zoopla itself and the open availability of its data might have avoided a housing market crash but we will never know.
Key take a way's like the most city the most expensive property happens in the city centre and there is a very large distance between these central properties and the location around the edge. There are several voids in the city pan though where the property does not seem up for sale. For those thinking the Pandemic caused a huge hit to the economy the data is heart warming, and while in March sales numbers do drop off a cliff they recover in the same quarter suggesting the people of Manchester were unperturbed by the situation.
The size of the blue circles below is proportional to the size of transactions taking place in the location. You can see that in the city centre prices are many orders of magnitude larger than anything else happening. While the rest of the city is fairly consistent in property price.

The larger transactions also seem to be done by different agents. The below splits of the preeminent agent in the area and puts them in red. Whether you zoom in or out the blue (and larger transactions) are transacting by specialised agents . You can see from the red that no large transactions happen outside the city centre. That there are a large void near centre (assumed to be retail or rental land use).

Moving onto the data I looked at number of bedrooms expecting that such values would likely predict the house price but this doesn't seem to be the case as location seems more important. Of course there is a connection but especially compared to the outliers in the city centre there seems a reason to think postcode was the more important factor.
You can see the graph about bedrooms to price below. Investigating the data further we can see there is some mistakes where Flat's that included 30 bedrooms where labelled as a single flat and not the more intuitive "Block Of Flats" and that one went for 1/2 a million. Therefore outside of price being reported there is a concern the data might have issues.
I think the reason to think that there is some disconnect between price and number of bedrooms is the 50 bedroom building that went for less. Therefore when buying property in Manchester the mantra of "location location location" seems powerfully descriptive.
Later I also found that the Type of property also seemed to matter. If I was to build a regression or machine learning algorithm I would start with these two values and while number of bedrooms might seem more obviously the main predictor it seemed to have been skewed in the data so I think you would also need to remove the property type "flat" the reasons for which are explained below.

The below is a count of when a property was listed. the large drop in the middle is march when the initial Pandemic broke. This is the reason to suggest that the Pandemic did not have the economic impacts being reported at the time. It is also noted that it seems to be recovering even before any government action happened as you can see recovering listings by the end of the first month. Interestingly I am not sure if this is before government had stated it was permissible to move property.

The maps created are fully interactive HTML maps and can be clicked in and get specific prices and these are available upon request. These maps are ready to be uploaded to a website and could be viewable to users if that was made a requirement. Longitude and latitude data was obtained online
for the creation of these maps.
I was unable to identify the agent but I strongly think it is related to ID but there is nothing to confirm this. When considering work done branch, ID, and any way you wish to aggregate the estate agents you will find there are individuals, branches, and agencies that massively outperform others. This is consistent with power-law rather than the 80/20 rule meaning that certain estate agencies are likely to have much more experience than others.
The agents also appear territorial specialising in certain postcodes rather than split across the whole city. There is not enough evidence to say for sure but individual postcodes are very much likely to differ in the agents that are most willing to deal with the property. This is something I didn't investigate further but it is interesting as it contradicts that otherwise, one would think given Zoopla being an online service you would expect an agent or branch to service the whole city.
it is easy to think why this wouldn't be the case that there is simply enough business for branches to specialise and agencies to split agents into geographically based specialists but it was a split I wasn't quite expecting. Despite this, it is clear that individual branches don't focus on single postcodes so it would seem that agents specialise in specific locations.
Finally, when investigating the number of bedrooms, type of property, and postcode I noted there was no group called hotel yet there were several "flats" with >10 somewhere around 25 bedrooms being sold. This was odd as there was a group called "block of flats" which wasn't being used for this class of transaction. Looking further into it there are sales of properties with more bedrooms in certain postcodes. I concluded there was a sale of several hotels in the city centre due to the pandemic and international travel restrictions though were being mislabelled as land or flat within the data.
This remains a hypothesis but it appears obvious that you would sell hotels during a pandemic environment and odd that they were mislabelled as no label of the hotel was used and the more obvious labels were not used.
The bulk of these insights was pulled from doing data mining. In the Data mining code I created I cycled through all the different columns as measures and automatically created graphs to show the data. I then built the maps to highlight key insights.
Appendix and methods used
Beginning from initial data clean to actual results. I have always been taught to provide a written executive summary and then a technical appendix that looks at the methods used so if someone wanted to recreate the work they could hopefully figure it out. I had shown this to someone in the past and their comments of your report just appear to end confused the hell out of me as the Executive Summary was the first thing they read!
Therefore the above is the broad set of insights about the year. Only read the below if you need more technical or a guide to the methods used.
Head and Tail
Below is a printout from Python of the head of the data.

Use of the data.dtypes will show whether a value in the data frame is an object and therefore is a series of strings and “int64” represents a 64 bit number

I can use the following commands to change between data types.
CODE:
to_datetime(df['column name'])
to_timedelta(df['column name'])
to_numeric(df['column name'])
df['amount'].astype('int32') or type.
I note that the first listed date is listed as an object i.e. a string. I then use the below code to change it to a
date value.
data["Date"]=pd.to_datetime(data["first_listed_date"])
print(data.head())
Data Mining
I then used the below code that will cycle through all the headers and attempt to create a scatter plot. The purpose is to see what insights can have in the fastest

This is intended to get as quickly as possible to details. I also get it to record the amount of each row that is not a number this is so can get a sense of how dirty the data is and any blanks. This is going to take some time to run and in meantime, I will do some other things.
For time being I will exclude NAN’s as an assumption that the exclusion doesn’t matter and will return once the above cleaning script has data mined some finer details afterward.

Cleaning data:
I noted 17 different columns where NANs are present which is most columns. I note county is the worst and drop it straight away as I can’t see the utility of keeping it when everything is in Manchester. To do this I use a drop command or the below code.
I also do the same with id as you’ll read below I had come to the conclusion I wasn’t sure what it was an ID of.

I then created a script to cycle through all and put in filler values where there are any other NANs for all the columns. I know price and bedrooms is a number and while I put in a filler value of zero I make a mental note to be willing to drop it later if you did any machine learning you would want to know you had done this.

Things learned from the data mining approach
ID is not unique – question to confirm what ID relates to is it buyer, seller, or agent ID? There is a likelihood that the id relates to the address or property though this would need to be confirmed if it became an issue.
If we assume that ID is an ID of the agent then you can see there is a large number of agents a small number who do huge numbers of transactions. Though this is where it becomes difficult as absent the business knowledge to know I cannot make any assumption of what ID means.
If you were a business and this was agent ID your next thing to do might be to build some KPIs based on this data. Ranking agents and setting averages based on analysis of percentile for your staff. Though I think this is something that has to be done in light of the business needs and strategic realities.
It is noted that a few city centre transactions represent the majority of the money. If you had a focus on these transactions in the city centre would change the KPI's you would create.

Many counties have only one id but there might be clustering in certain neighborhoods. I consider maps as an option. The fact that id concentrate in certain locations also deepens my suspicion that ID is agent ID.

Some Generic and some would say obvious things about the data
Different branches have a different counts of transactions. The scale being larger than ID is another reason that I strongly expect that ID is Agent ID as it makes sense that a branch being a group of agents would have a larger count than an ID if agent.

Did different companies open at different times?
Below is the date listed and with the branch.
Looking at the branch there might be a link between opening date and branch as some seem to have more history than others. I don't think this means that different branches opened up at different times after the pandemic. The increase in several branches would be in line with the industry expanding as seems to suggest that the number of active branches increased.

Looking at branch versus postcode. It does seem that branches do not limit themselves to postcodes. Though there are some gaps so clearly not all branches service all postcodes most do.

There seem to be some properties of > 10 bedrooms, will need to clarify if there are any hotels on the market. Also, it should be noted you would not conclude from the below that the reason price followed location, not the number of bedrooms that the locations with many bedrooms are grouped into a specific area.
You can see that there appear to be several outliers in the data.

I confirmed this concern that apart from a single outlier prices were not increasing or decreasing in a visible way in which I would want to base the rest of the report based on time-frequency (though I feel more meaningful graphs might be extracted using moving averages).
You would also note that during the pandemic months you cannot see an impact in the price that the house that where sold went for.
Also if you were of a conspiratorial mindset the outliers did not happen at the height of the pandemic. (Just saying)!

The conclusion that much of the land is similarly priced is reinforced by the outlier being well land. Though also looking at the different prices based on house type. While I note that the prices are not the same across the market it is interesting how similar terraced, bungalows, new house are in price but you can see enough differences.
I am convinced that the Flats going for millions are not individual flats but blocks of flats or hotels. I just don't believe a single flat went for that price.
I also think if I was to build a machine learning price predictor the house type would be a better predictor than the number of bedrooms, especially alongside postcode.

It can be seen clearly that some flats have >10 bedrooms. I am going to call that a data error and think someone was sloppy in their data labeling. I don't believe a flat with 25 rooms went for a million and it wasn't secretly a block of flats and or a hotel...

And I don't know why I want to prove this but below is the postcode with several rooms to show the sales of flats with supposedly 25 rooms tend to happen in certain postcodes. I am going to say it's a hotel... they're selling hotels and mislabelling them on Zoopla.
Though when you stop and think about it a lot of sales of hotels in a pandemic when international travel is restricted makes sense. So while it is a hypothesis it seems somewhat obvious that this is a potential explanation for our various outliers.

Making Maps
I realise that I will need longitude and latitude to work with the data and get the data from the below link. Having downloaded the longitude and latitude data for the UK I can do a left join on the postcode the code of which is below. I used Folium could produce an HTML map of Manchester. Below is an initial example using the shapefile.
Download UK Postcodes with Latitude and Longitude (freemaptools.com)

I will also need Manchester Geospatial data and Shape Files so I get them from the below link.
OS Maps: online mapping and walking, running, and cycling routes (ordnancesurvey.co.uk)

I then check to see if there are no new NANs introduced then after this I go back to a join and change it to an inner. This is a join on official data and as far as I can see the data I downloaded should be authoritative. I ignore it but I add it to the issues to raise that there appear some postcodes were added incorrectly into the online data.
Once I had worked through this I got the following interactive map that shows all postcodes sized for price and intractable by clicking on the dots. I had set a greyscale but I could change to taste.

We can see price is very normalised except in what is assumed to be the city centre where it increases. We can also see that Manchester is a “doughnut” shaped with what appears to be city centre or commercial area where property is not available.

This code produces a HTML file it could be shared on a website and the same code that produces it could be modified to show different information.
Conclusion
It is really hard to make a argument that the pandemic happened outside the single month of collapse. It did not appear to affect house prices and while there appears to be what I suspect to be the selling of several hotels there is nothing to confirm they are hotels. It is somewhat concerning that if they where hotels then the sale being obfuscated as something else would be troubling. Though it is more likely to be a data error.
In the end I made a pretty map and the conclusion was the market was same as it ever was. No market collapse no panic just a pause of a month or two. You could use this method to try and predict house prices by removing these outliers and filter data over several years and see where house prices where growing fastest.
Maybe that is the great advantage of Zoopla and others that with open access to the data we can do our own analysis of house prices. Hopefully if you've read to the end you might feel a bit empowered to run your own analysis of the housing market.
I also feel nothing but admiration for the fact that it recovered so fast and think some credit needs to go to the housing market for transitioning to a online setup way before the pandemic.
I haven't bothered looking to rank or analyse the actual branches or agencies but you very easily could figure out from this who you could approach from this data. You could also do your buying and selling with more confidence by looking up the actual data.
FULL CODE:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data=pd.read_csv(r"C:\Data\Manchester\manchester-3year-market-newinstructions.csv")
print(data.head())
print(data.tail())
print(data.dtypes)
data["Date"]=pd.to_datetime(data["first_listed_date"])
print(data.head())
#for x_nd in data.columns:
# for y_nd in data.columns:
# try:
# sns.scatterplot(data=data, x=x_nd, y=y_nd)
# #,index='index')
# plt.xlabel(x_nd)
# plt.ylabel(y_nd)
# plt.savefig(r'C:\Data\Manchester\manchester_test_'+str(x_nd)+'_'+str(y_nd)+'.png')
# plt.close()
# except:
# pass
# try:
# sns.histplot(data=data, x=str(x_nd))
# plt.savefig(r'C:\Data\Manchester\manchester_test_'+str(x_nd)+"_histogram.png")
# plt.close()
# except:
# pass
# count=0
# for nan in data[data[x_nd].isna()]:
# print(nan)
# count+=1
# print(x_nd)
# print(count)
for x_nd in data.columns:
count=0
for nan in data[data[x_nd].isna()]:
count+=1
data=data = data.drop(columns='county')
data=data = data.drop(columns='id')
for x_nd in data.columns:
if x_nd=='price' or x_nd=='bedrooms':
data[data[x_nd].isna()]=0
else:
data[data[x_nd].isna()]='unknown'
for idx,row in data[data.isna().any(axis=1)].iterrows():
print(row)
clean_data=data.dropna()
data2=pd.read_csv(r"C:\Data\Manchester\ukpostcodes.csv")
df_inner = pd.merge(clean_data, data2, on='postcode', how='inner')
print(df_inner.isnull().any())
print(data.size)
print(df_inner.size)
df_inner = pd.merge(clean_data, data2, on='postcode', how='left')
print(df_inner.head())
import folium
m=folium.Map(location=[53.483959, -2.244644 ],tiles="Stamen Toner", zoom_start=12)
m.save(r"C:\Data\Manchester\my_map.html")
center_lat = df_inner.mean().latitude
center_long = df_inner.mean().longitude
m = folium.Map(location=[center_lat, center_long],tiles="Stamen Toner", zoom_start=4.5)
count=0
mappy={}
long={}
lat={}
strike={}
df_inner.count=1
sw = [0,0]
ne =[999999,999999]
#data[['latitude', 'longitude']].min().values.tolist()
#data[['latitude', 'longitude']].max().values.tolist()
for idx,row in df_inner.iterrows():
if row.price != 'unknown':
if row.postcode in mappy:
mappy[row.postcode]+=row.bedrooms
else:
mappy[row.postcode]=row.bedrooms
long[row.postcode]=row.longitude
lat[row.postcode]=row.latitude
if row.latitude > sw[0]:
sw[0]=row.latitude
if row.latitude < ne[0]:
ne[0]=row.latitude
if row.longitude > sw[1]:
sw[1]=row.longitude
if row.longitude < ne[1]:
ne[1]=row.longitude
count+=row.bedrooms
strike[row.postcode]=0
if row.agent == 'strike':
strike[row.postcode]+=row.bedrooms
for row in mappy:
try:
folium.Circle(
location=[lat[row], long[row]],
popup=row,
radius=(float(mappy[row])/count)*20000,
color='blue',
fill=True,
fill_color='blue'
).add_to(m)
folium.Circle(
location=[lat[row], long[row]],
popup=row,
radius=(float(strike[row])/count)*20000,
color='red',
fill=True,
fill_color='red'
).add_to(m)
except:
pass
m.fit_bounds([sw, ne])
m.save(r"C:\Data\Manchester\strike_map_bedrooms.html")
del m
sns.lineplot(x='price',y='Date',hue='price',data=data)
plt.savefig(r'C:\Data\Manchester\line_plot_postcode.png')
plt.close()
#pd.to_numeric(clean_data.price)
#pd.to_numeric(clean_data.bedrooms)
#pd.to_numeric(x)
#print(clean_data.dtypes)
#clean_data.value_per_room=clean_data.price/clean_data.bedrooms
s = sns.barplot(x ="postcode", y = 'price', data = data)
plt.savefig(r'C:\Data\Manchester\postcode_price.png')
plt.close()
s = sns.barplot(x ="postcode", y = 'price', data = clean_data, hue = "agent")
plt.savefig(r'C:\Data\Manchester\postcode_price_by_agent.png')
plt.close()
s = sns.barplot(x ="postcode", y = 'bedrooms', data = clean_data)
plt.savefig(r'C:\Data\Manchester\postcode_bedrooms.png')
plt.close()
s = sns.barplot(x ="postcode", y = 'bedrooms', data = clean_data, hue = "agent")
plt.savefig(r'C:\Data\Manchester\postcode_bedrooms_by_agent.png')
plt.close()
s = sns.barplot(x ="postcode", y = 'property', data = clean_data)
plt.savefig(r'C:\Data\Manchester\postcode_property.png')
plt.close()
s = sns.barplot(x ="postcode", y = 'property', data = clean_data, hue = "agent")
plt.savefig(r'C:\Data\Manchester\postcode_property_by_agent.png')
plt.close()
# create stacked bar chart for monthly temperatures
clean_data.plot(kind='bar',x='value_per_room', stacked=True)
# labels for x & y axis
plt.xlabel('Months')
plt.ylabel('Temp ranges in Degree Celcius')
# title of plot
plt.title('Monthly Temperatures in a year')
Add comment
Comments