Category: Data analysis and reports

Descion Tree

Use the outline of code we discussed in class to create a decision  tree for the IrisDataSet which predicts the Type column using the other attributes. Create three versions of this tree: one using entropy, one using the Gini coefficient, and one using the Classification error as splitting criteria. Use the first half of the data set as the training data and the second half as the test data. Provide the error rate for each tree.

Office Supply Store Data Analysis

An office supply store tests a telemarketing campaign to its existing business customers. The company targeted approximately 16,000 customers for the campaign. Assume you are a consultant brought on board to help the company leverage and use the findings from the tests to its advantage. Refer to the accompanying spreadsheet, which contain the results of the tests.
The detailed requirements and expected deliverables are mentioned in Capstone Assignment.docx.
Three sample presentations are attached for reference. Data to be used are in excel file.

Python Data Analytics + Tkinter GUI Visualization

I have done 80%, I need to fulfil some functional requirements.
1. The algorithm needs to change, the attributes is dynamically read from the uploaded csv file. Details I can share with you later on.
2. Once the algorithm is modified successfully, I need to display the final result graph/table onto Tkinter GUI, inside a canvas frame.

Twitter analysis project

Research Report – Twitter Analysis and Presentation

The aim of this assignment is to collect Twitter data, summarise the data using a spreadsheet or other tool, and then write a report about that data. The purpose of the report is to investigate and discuss the use of twitter analysis by researchers, brands or journalists (depending on your major). The report is not meant to be written as a public facing report or feature, but rather an internal research report that might be used in a professional context or to inform your own practices.

You can choose to follow a group of people or a hashtag/hashtags over a period of time that will yield a reasonable sized data set ( a few thousand tweets at least up to a max of 250 thousand is about the right size for this task, much bigger as Excel will struggle to open the file). Suitable targets could be hashtags for a TV show or media event, a new or defective product, a group of journalists attending a conference or the conference hashtag,  a brand campaign, or news event as it happens.

You may have to try a few different scenarios before you get some data you can use. For example broad hashtags like christmas or happy is a bad choice, corbyn (during PMQ) or MUFC (during a game) are probably better ones. Spend a little time exploring how hastags are used together (co-hashags) partly to make sure that you have all the relevant tags covered (i.e ‘Manu’ as well as ‘MUFC’), this can be done with the Twitter advanced search page. You should write about this hashtag research as part of your reflection.

Once you have collected the tweets and profile data use this data set to discuss the following questions in your report. You can do more analysis than this, but these are expected as part of the report.

Required analysis

a) Who were the top tweeters and retweeters?

b) How many of your top tweeters are bots? (remove as many as possible from your data set before performing the rest of the analysis)

c) What was the top retweet? and what was the ratio of tweets to retweets in your data set?

d) What % of tweet/retweets in your data set came from the top 10  tweeters?

e) Use a word cloud or word tree of the most used words in your data set to show the type of language being used. Was the hashtag being used in conjunction with other hashtags?

f) Where to the tweets come from? What % are geocoded, what % of profiles have a location?

g) Do the tweeters fall into any demographic groupings that you can see (look at some follower, friend counts, total number of tweets etc)

In addition to answering these questions you can perform other types of acquisition and/or analysis and you may be awarded extra marks for doing so.

Visualisations you should also include your report.

Time series for your tweets on a suitable timeframe unit

Word Cloud or Word Tree of language in popular retweets (or co-hashtag use)

Chart showing the % of tweets to retweets

Chart showing the % tweets geocoded and the % of profiles with locations

Histogram of tweeters volumes (i.e. 1 person tweeted more than 100 times, 5 people tweeted 50-100 times, 50 people tweeted 10-50 times, 1,000 people tweeted 5-10 times etc)

The report should be ~1500 words done as a basic but well styled HTML page that includes some visualisations to help illustrate your data. MSc students should attempt visualisations using a JavaScript library rather than iframe embeds. You should try to use a template system to start your page. As well as answering the questions above in your report you should do some research on social media analytics and Twitter use in journalism and consider the how the types of analysis you have performed can be used in a professional context. Include references to research material you used in your report. You can also talk about the 5th estate and Twitter more generally and it’s effect on journalism and society making reference to your own data where possible.

You should also supply a written ~500 word reflection. The reflection should consider the following points. Why did you settle on the hashtag(s) and timeframe that you did? What issues did you encounter in gathering the tweets and analysing them, how did you overcome these problems. How would you extend or improve your study given more time and/or resources? Include attributions for any code libraries or images used in your report.

Submission

The submission should consist of a single word or rtf document that contains your reflection and a link to the online report. If you have used code in the acquisition or analysis of your tweets, you should also provide a link to a github GIST for each one. You should add plenty of comments to this code to demonstrate your understanding of how it works.

Marking scheme

Will be allocated based on the following scheme

5/25 Acquisition – Research and discussion of method used to acquire tweets and data obtained

5/25 Presentation – HTML/CSS, layout, quality of writing, overall quality

5/25 Visualisation – Quality, scope/difficulty, integration with report

5/25 Analysis – Quality, depth, difficulty

5/25 Reflection – Discussion of techniques, self critique, journalistic context

Note that extra marks can be used for using acquisition, analysis and presentation techniques beyond those taught in class.

Code is not required for the acquisition stage to pass the assignment unless you are on an MSc award. Note that use of code in this coursework can contribute towards the award of the MSc for journalism students.

Introduction & Explanation

Produce an illustrated report that uses analysis and techniques examined during lectures and practicals to examine the distribution, variation and relationships between at least two variables from the following London data:
UK Census
Air Quality
Roads and Parks
Airbnb
Another dataset for London as agreed with your lecturers
The following specific requirements apply (over and above the official Coursework Submission Requirements):
Students are expected to present and interpret a mix of descriptive statistics, maps, tables (and other visualisations) to provide an evidential base to describe spatial patterns and relationships. Literature should be used to support analysis of the patterns and relationships observed, including a discussion of the possible underlying drivers or causes. Analysis could be at neighbourhood, borough, or city scales.
You are free to develop a topic that speaks to your research and study interests, but some possible topics include: the impact of Airbnb on housing; the relationship between air pollution and deprivation; and the impact of green space and roads on air pollution. The code used to create the supplied data set is available for those who wish to extend it with new data. Feel free to discuss your ideas with the module co-ordinator, especially if you wish to use data not supplied to you.
Your submission should include a balanced assessment of the strengths and limitations of the data (e.g. what is recorded, what is not recorded, what is potentially misleading, etc.), as well as a justification of the methods used in your analysis. The focus of this assessment is a demonstration of judgement and understanding, not mindlessly applying every technique acquired during the term.
The report should be structured using the following sub-headers:
Introduction: to set the context for your analysis, including brief overview of relevant literature;
Data and Methods: briefly describe the origin of the data and the rationale for any
transformation/manipulation of the data;
Results: present an analysis (not simply a summary) of your data using charts, maps and tables (ensure
these are embedded in text);
Discussion: reflect on the possible drivers or causes of the Results, including commenting on the weight of
evidence provided (e.g. the strengths and weaknesses of analyses and data used);
Summary: briefly wrap-up your report with the key conclusions you want the reader to take-away.
Figures and summary tables should be used and be well-presented. Use of wider literature to support discussion and analysis is important. Any code used for analysis should be presented in an Appendix (not in the main body of the report).

https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA%20Data.csv.gz – data to be used