29 Feb 2016

20Minutos Blog Awards 2015 Statistics with Excel and R

#52: 20Minutos Blog Awards 2015 Statistics with Excel and R

NB: Si prefieres leer este blog en español, recuerda que con Google-Blogger (el herramiento usado para este blog) es posible traducir este blog de ingles a español. El 'Google Traductor' esta en el menu a la derecha.

The Spanish newspaper 20Minutos, organizes the 10th edition of its Blog Awards ("Premios 20Blogs"), with 7652 participating blogs from (mostly) Spanish speaking bloggers. Until 2nd of March everybody with a 20Minutos account can vote 20 times, one time for every blog-category, and the blog with the most votes wins the public prize, although the jury decides who will be the winner of the 1st prize. And like last year, I'm participating with this blog. 20Minutos allows you to promote your own blog to attract votes in the Blog Awards, so: if you like my blog, you can vote for it here:

http://lablogoteca.20minutos.es/excel-ejemplos-para-tu-trabajo-deportes-y-mas-50022/0/

To do so, you have to register here:

http://www.20minutos.es/usuarios/registro/

But enough self-promotion, let´s start with what this blog is about: MS Excel. For the 2014-blog-competition I made an Excel with some statistics, see:
http://worktimesheet2014.blogspot.com/2015/02/excel-statistics-for-20minutos-blog.html

and for this year's edition I did the same, but this time I made other statistics, and not only with Excel but also with R. My idea was to get into 1 Excel all the blog-records (with data like URL, Author, Title, Description and Category in which blog is participating), and to get all this data, I used the (free) tool https://www.import.io/ for web-scraping the 20Min. website for the blog-contest: http://lablogoteca.20minutos.es/premios-20blogs/

For more info about this great data-extract tool, see e.g.:

http://www.interhacktives.com/2014/03/06/scrape-data-without-coding-step-step-tutorial-import-io/

https://www.youtube.com/watch?v=V8lHaaAWGXc&feature=youtu.be

The result of the import had some 300 duplicate records, but with Excel it's easy to eliminate duplicates, see e.g.:

https://support.office.com/en-us/article/Filter-for-unique-values-or-remove-duplicate-values-ccf664b0-81d6-449b-bbe1-8daaec1e83c2

To this dataset I added some extra calculated fields like country of blog (only posible if domain-name of website has a country-code like "es" (Spain) and blog-platform (Blogger, Wordpress etc), see fig.2a.
NB: My (V)lookup-table with the mapping country-code - country I got from here:
http://www.web-wise-wizard.com/internet-dns-web/internet-iddd-country-codes.html

And with this dataset, I made the statistics which you can see in the figures 2b, 3 etc below, using pivot-tables and pivot-charts. But before getting into more detail about these stats I present in fig.1 a Word Cloud for which I used R (the script is included in the Download-section below). I used R to make the world cloud (instead of using a site like http://www.wordle.net/) to learn a bit about R, and because of it's extensive Text Mining capabilities. Some sites which helped me with this are:

https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/

http://www.webmining.cl/2012/07/text-mining-de-twitter-usando-r/

To make this word cloud, I used the column Description (of a blog), a 'memo-field'.
Not suprisingly, the most used word is... 'Blog', and then comes 'World', 'Life'', 'Ínformation'.

.


fig.1 Word Cloud of Blog-descriptions of all blogs participating in 20Min. Blog-contest.





fig.2a: Data imported in Excel from 20Min. (formula-columns which are not interesting for user are hidden).



fig.2b: total Blogs per category'.

NB: Note that the totals in this chart correspond almost 100% to the ones on the 20Min.site (http://lablogoteca.20minutos.es/premios-20blogs/ ), as it should, but with a difference of 1 (in my stats, category Viajes has 404 records, not 405 as on the 20Min. site, I wander if that is because of an error in my data-extraction from this site or maybe an error by 20Min.?)




fig.3: Total Blogs per Platform

NB: Google's Blogger (domain Blogspot) appear more popular then Wordpress.




fig.4: Total Blogs per Country

NB: for those of you who have never heard of Tokelau: this (is)land offers a 1 year free (tk) domain (for everyone, so not only for the people of TK), after that you have to pay, or they close your site, which happened to my site, http://www.eigersoftware.tk ..





fig.5: Total Blogs per Author


NB:
- this pivot-table report filters authors with >= 7 blogs, for how you can do this filtering, see:
http://stackoverflow.com/questions/12209841/filtering-pivot-table-columns-only-count-if

- 'Grupo LM' has a total of 28 blogs in the competition, and I found out that this is a company which pays bloggers if they send them a blog-post. To see all 28 blogs, double-click on measure-value '28' in the pivot-table, and then Excels opens a new worksheet with the 28 records.
I wonder why the blog-authors who work for LM did not sign up for the blog-contest with their own name.

And to conclude this post, I´d like to say good luck ('mucho suerte') to all my fellow-bloggers in the "Premios 20Blogs" Blog Awards.
BTW: To decide on which blog to vote, I also searched for Éxcel' in my Excel and found several blogs, e.g. this excelent blog, from Argentina (on which I voted):

http://excelmascerca.blogspot.com.es/


Downloads:


#Mirror 1: Google Drive (zip file with Excel and R-script files):

https://goo.gl/ZAf54w


#Mirror 2: Microsoft Onedrive (Excel file):

http://1drv.ms/1REIjjC


#Mirror 3: Scribd.com (PDF file):

https://www.scribd.com/doc/301523553/20minutos-blogcompetition-2015-statistics-with-excel-and-r



1 comment:

Rajan Vishwakarma said...

Did you ever come across a situation wherein, you see a lot of copied data or duplicates in excel and you don’t even remember copying it. Well, it happens with all of us at some point in time and finding and removing such copied data in your spreadsheet will take up a lot of time. So instead use the following tricks for finding and removing duplicates.

https://bit.ly/2LyY79s