<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Data Scientist</title>
	<atom:link href="http://www.thedatascientist.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.thedatascientist.com</link>
	<description>Mine, Visualize, and Learn</description>
	<lastBuildDate>Fri, 13 Jan 2012 19:51:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>How Turntable Will Save The Music Industry</title>
		<link>http://www.thedatascientist.com/2012/01/10/turntable-saves-music-industry/</link>
		<comments>http://www.thedatascientist.com/2012/01/10/turntable-saves-music-industry/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 09:08:11 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Journal]]></category>
		<category><![CDATA[grooveshark]]></category>
		<category><![CDATA[music]]></category>
		<category><![CDATA[turntable]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=220</guid>
		<description><![CDATA[As I jumped from room to room on Turntable.fm last night my eyes caught a glimpse of a rare room titled &#8220;AOKIxSOLREPUBLIC&#8221; . I clicked it with a fury. &#8220;Sorry due to fire codes you have been escorted out of the building&#8221; was the message I received, which was Turntable&#8217;s cute way of telling me that [...]]]></description>
			<content:encoded><![CDATA[<div id="attachment_239" class="wp-caption aligncenter" style="width: 570px"><a href="http://www.thedatascientist.com/wp-content/uploads/2012/01/aoki2.png"><img class=" wp-image-239" title="aoki" src="http://www.thedatascientist.com/wp-content/uploads/2012/01/aoki2.png" alt="" width="560" height="233" /></a><p class="wp-caption-text">DJ Aoki playing a special set and interacting with the audience on Turntable.fm</p></div>
<p style="text-align: left;">As I jumped from room to room on <a href="http://turntable.fm/" target="_blank">Turntable.fm</a> last night my eyes caught a glimpse of a rare room titled &#8220;AOKIxSOLREPUBLIC&#8221; . I clicked it with a fury. &#8220;Sorry due to fire codes you have been escorted out of the building&#8221; was the message I received, which was Turntable&#8217;s cute way of telling me that the room was already full. I reloaded the page. Same message. For the next 5 to 10 minutes I kept refreshing the page until I finally secured my seat. I was in.  I had one of 200 spots available to hear the famed DJ Aoki play some of his favorite tracks.  As I sat there listening to the music, I couldn&#8217;t help but think this was a monumental shift in the way music is being consumed online.  And even more incredibly, the value being delivered could potentially be worth billions to an industry used to losing atleast that much.</p>
<p style="text-align: left;">I know, I know. This story has been told many times before.  There have been countless music startups that have been birthed (and subsequently deadpooled) over the years. Even so I stand by my claim, but first you&#8217;ll need a little bit more context.</p>
<div>I love music. I&#8217;m a programmer and an analyst so I&#8217;m often at a desk for 12 hours a day and music keeps me from going crazy.  I&#8217;ve been playing guitar for 12 years. I&#8217;ve surely spent thousands attending concerts and buying music.  For the past several years I have been able to merge my passions of programming and music by having the opportunity to work on a lovely little site called <a href="http://www.grooveshark.com">Grooveshark</a>, now boasting about 35 million users.  When I joined, I think I was employee number 20 or something like that and the 200k&#8217;th user of the service.  I remember our million user party being uttlerly insane for two reasons (other than the obvious debaucherous ones, ofcourse&#8230;):</div>
<p>&nbsp;</p>
<ul>
1) It was unthinkable that we had over 1 million people using our service and<br />
&nbsp;<br />
2) It was unthinkable that we were actually still in business (we had survived lawsuits, and many employees &#8211; including myself &#8211; had worked without pay for 6-10 months).</ul>
<p>&nbsp;</p>
<div>Sam Tarantino and Josh Greenberg, our CEO and CTO respectively, convinced me (well, all of us to an extent) that the future of music was not in digital sales, but in endorsement-like deals where bands and brands could be matched. This model made sense to me. If you&#8217;ve been paying any amount of attention to this space you&#8217;ve seen artists come out with products that undoubtedly will earn them much more than their record sales, from Miley Cyrus&#8217;s <a href="http://www.walmart.com/ip/Miley-Cyrus-Max-Azria-Juniors-Skinny-Jeans/13255569">jean line</a>, Jennifer Lopez&#8217;s <a href="http://www.thesmokinggun.com/documents/jennifer-lopez-phony-fiat-ad-564812">Fiat/Gucci faux-Bronx endorsements</a>, and Dr. Dre <a href="http://www.beatsbydre.com/">Beats</a> which have been incorporated into everything from headphones to PC speakers.  But the problem with this model is that these types of deals are only available to the uber-famous, where name recognition can go a long way.  So the question remains, how do you (as a music site) make artists money if they aren&#8217;t popular?  Grooveshark addressed this problem by letting unknown artists &#8220;piggyback&#8221; off popular artists.  That is, if a new country artist wanted their music heard, we would insert their music in the recommendations set after Lady Antebellum, a Grooveshark-signed popular artist.  This worked pretty well, but success wasn&#8217;t always guaranteed.  The recommendations had to be good for it to work, and often they weren&#8217;t (I should know&#8211; I worked on the algorithm for 8 months).  Other music services face this challenge as well and all of them seem to tackle the problem the same way, either through advertisements or recommendations.</div>
<p>&nbsp;</p>
<div>But Turntable&#8217;s model is entirely different. Its genius is in allowing users have a stake in what they play, and more often than not, the platform is used for self-promotion, not unlike Twitter.   This model works amazingly well for music discovery.  For the past 6 or 7 months that I&#8217;ve been using Turntable, I have found close to 100 new songs.  I have developed a love for a genre of music (House) that I never even knew I had.  I have witnessed the rise of some great new artists like 3LAU and Dotcom that have taken the internet by storm.  Last week, I barely made it into another room entitled &#8220;3LAU Exclusive Releases&#8221; because people flocked to hear new music by the 21 year old fledgling mashup artist.</div>
<p>&nbsp;</p>
<div>Turntable has allowed unknown artists to become popular simply by giving them the opportunity to let their work stand on its own.  All an artist has to do is become a DJ, play some of their music, and get rated by their peers.  They can even talk to their fans directly in the chat dialogue and get real-time feedback, as if the voting alone wasn&#8217;t enough of an indication.  If people like it, their rank increases.  The higher the rank, the more brand recognition they develop.  And because this all happens online, an unknown artist like 3LAU can put on a queue of his music for 15 minutes, go take a shower, and come back to find that he has 200 rabid followers.  And for people like me who can&#8217;t get into those packed rooms, there should be an option to pay for a seat instead of getting kicked out because I showed up late.  I&#8217;d be more than happy to give 3LAU, DJ Aoki, and Turntable $10-$20 to hear a new set.</div>
<p>&nbsp;</p>
<div>And this is what will save the music industry.   The record labels can scout out new artists by just lurking around.  They can arrange to have headlining artists DJ their own rooms and charge an entrance fee. They can simulcast live concerts and enable potentially millions more people to hear their favorite bands.  The artists themselves now have a decent shot at becoming massively popular, even if they choose to forgo working with the major labels.  And it would be possible for them to make a decent online income if Turntable allowed them to collect money for hosting rooms.  Lastly, Turntable could have a revenue source that wouldn&#8217;t be the root of their downfall.  Yes, licensing fees will surely increase as their popularity increases, but so will their bookings revenue.  And as their self-promotion platform becomes more established, licensing may infact represent a decreasing proportion of their costs.</div>
<p>&nbsp;</p>
<div>If past is prologue, then all this speculation can go down as just another piece about what online music could have been.  But for once, I&#8217;m optimistic about this industry.  I&#8217;m optimistic that I won&#8217;t have to change music services once the majors kill it off in 5 years.  I&#8217;m optimistic that artists will get the respect, and the income, they deserve.  I&#8217;m optimistic that I&#8217;ll get to &#8220;go to concerts&#8221; when I&#8217;m stuck in an office churning through numbers. And lastly, I&#8217;m optimistic that we can finally put this issue behind us.</div>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2012/01/10/turntable-saves-music-industry/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tips For Launching Successful Mobile Apps</title>
		<link>http://www.thedatascientist.com/2012/01/03/mobile-app-tips/</link>
		<comments>http://www.thedatascientist.com/2012/01/03/mobile-app-tips/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 09:16:26 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Journal]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[admob]]></category>
		<category><![CDATA[angry birds]]></category>
		<category><![CDATA[apple]]></category>
		<category><![CDATA[cpc]]></category>
		<category><![CDATA[crowdstar]]></category>
		<category><![CDATA[gree]]></category>
		<category><![CDATA[halo]]></category>
		<category><![CDATA[inmobi]]></category>
		<category><![CDATA[iphone]]></category>
		<category><![CDATA[ltv]]></category>
		<category><![CDATA[minomonsters]]></category>
		<category><![CDATA[mobile]]></category>
		<category><![CDATA[retention]]></category>
		<category><![CDATA[tapjoy]]></category>
		<category><![CDATA[udid]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=199</guid>
		<description><![CDATA[I recently spent some time looking at a lot of Free-To-Play Games data, mostly through App Data and App Annie, as well as played somewhere around 84 different games from the top games studios. This data is freely accessible but a pain to cobble together. Compiling all this market research took quite a bit of manual labor, but I learned a few things about this competitive market and thought it would be beneficial to my readers as well.  Many of you are developers and might find these types of insights valuable, not only for developing games, but other applications as well.  So here you go!  My Forbes-like list of Top  5 Tips to Make your Game Successful:]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.thedatascientist.com/wp-content/uploads/2012/01/mobile-apps-2.jpg"><img class="aligncenter size-medium wp-image-208" title="mobile-apps-2" src="http://www.thedatascientist.com/wp-content/uploads/2012/01/mobile-apps-2-300x162.jpg" alt="" width="300" height="162" /></a></p>
<p><em>Like this article?  Want to receive updates on data sets I create, other articles, or updates from the community?  Subscribe to the mailing list!</em></p>
<p>Have you ever thumbed through the Games section of the App Store and wondered how some games get to the Top 25&#8211;and stay there?  Or if there were any common elements of gameplay that resonate well with certain groups of users?  Or perhaps you&#8217;re a developer who&#8217;s interested in IPhone games and you&#8217;ve observed some anecdotal clues but have had trouble finding data to support your intuition?  Do you want to make more money with your existing IPhone game?</p>
<p>I recently spent some time looking at a lot of Free-To-Play Games data, mostly through <a href="http://www.appdata.com/">App Data</a> and<a href="http://www.appannie.com/"> App Annie</a>, as well as played somewhere around 84 different titles from the top games studios. This data is freely accessible but a pain to cobble together.  Why would I do such a thing&#8230;you might ask? Well, partly because I love data and love uncovering something interesting from disparate data sets.  Partly because I make my own IPhone apps and am trying to figure out how to navigate the App Store&#8217;s competitive landscape.  And partly because <a href="http://www.linkedin.com/in/physcab">I work</a> at <a href="http://www.gree-corp.com/">GREE</a> as a data analyst and want to fully understand the type of product GREE is bringing to the market, and its potential users.  I also happen to be a gamer myself!! (SC2, Angry Birds, Halo, and MinoMonsters)</p>
<p>Compiling all this market research took quite a bit of manual labor, but I learned a few things about this competitive market and thought it would be beneficial to my readers as well.  Many of you are developers and might find these types of insights valuable, not only for developing games, but other applications as well.  So here you go!  My Forbes-like list of Top  5 Tips to Make your Game Successful:</p>
<h3>Tip #1:  If you want to make money, target women</h3>
<p>The data overwhelmingly supports the notion that women are the biggest spenders in the Top Grossing Ranks category of app store Games section.  At the risk of appearing sexist, I classified all 84 titles as either male or female depending on what I thought was a best-guess target market.  For example, I made the assumption that the games <em>Top Girl</em> and <em>Social Girl</em> by Crowdstar are game titles targeted to women, while <em>Blood and Glory</em> a title by Glu Mobile would be targeted to men.  And if there were games that I just felt too conflicted to make the call, ie Angry Birds, then I labeled it as &#8220;Either&#8221;.  The result?  46 games were targeted to women, 34 to men, and 4 either.  The average top grossing rank of a female-targeted game was 113 vs 145 for men.  Revenue is tied to rank, and it grows rather exponentially.  From sources that I&#8217;ve gathered from around the web, most notably <a href="http://www.insidemobileapps.com/">Inside Mobile Apps</a>, female-targeted games bring in almost<strong> $100k more per a month, on average, across the top 100 ranked games.</strong></p>
<h3>Tip #2:  Offer a grab-bag of virtual currency as an In App Download</h3>
<p>While I&#8217;m still tabulating numbers, I noticed that most (meaning possibly &gt; 70%) of the Top Grossing games offered virtual &#8220;currency&#8221; as an In App Download.  I use the word currency as a broad term here.  For some games like <em>Cityville</em> by Zynga, the IAP&#8217;s are quite literally USD for Zynga Dollars, or whatever their equivalent name is.  Probably something like &#8220;Bag O&#8217; Dolla&#8217;s&#8221;.  Other games trade &#8220;energy&#8221; for money, meaning if you want certain events to go faster, pay more money.  I can think of a few reasons why game designers do this: a) It makes the most sense for users, so there is less friction during the buying process. b) Its easy for game developers because Apple&#8217;s IAP process is a royal pain in the ass to implement and c) Its easy to scale for whales (see Tip #3).</p>
<h3>Tip #3:  Plan BIG for Whales</h3>
<p>It&#8217;s no secret that there is this crazy phenomenon within games that your most loyal users will spend ridiculous amounts of cash to improve their gameplay experience.  This will stun you but has been vetted by Zynga&#8217;s S-1 filing.   As much as 25-50% of Zynga&#8217;s revenues come from these 1% of users.  Even <a href="http://www.businessweek.com/magazine/zyngas-quest-for-bigspending-whales-07072011.html">articles written</a> about it.  These ballin&#8217; users are dubbed &#8220;whales&#8221; and you need to absolutely, without a doubt, target/incorporate/cater/suck-up to them in your game design. They will reward you handsomely.  So how do you plan big??  Take whatever &#8220;currency&#8221; you&#8217;ve decided for your game, and offer a few top tiers around the $50 to $100 per purchase mark.  Simple as that.</p>
<h3>Tip #4:  You will probably have to buy users, but you won&#8217;t know what good it&#8217;ll do</h3>
<p>Not much of a tip, I know.  But in this competitive landscape, you&#8217;ll have to atleast start out by buying users through a network such as AdMob, InMobi, or TapJoy.  Your main objective is to boost the ranking of your app as high as your budget will allow.  Remember, revenue scales nearly exponentially with App Store ranking, atleast with the Top Grossing Games category.  The environment right now sucks though because unlike with Google Ad&#8217;s for Web, your conversion funnel will be crippled by Apple&#8217;s policies and distribution.  For example, tying a download to a CPC creative is problematic ever since Apple announced the deprecation of the UDID.  Some networks show download conversions, some do not.  It will be your responsibility to A/B test and figure out if the value of your users are worth more than the ad-spend.  Good news is that clicks can be bought for as low as $0.04-$0.08.  You just won&#8217;t know whether these users actually spent the most money on your game or referred others who did.</p>
<h3>Tip #5: Push, Push, Push, and Push some more</h3>
<p>Push notifications are a remarkable retention mechanism on the IPhone.  But use them intelligently because they can create an annoyance for users or can be quite embarrassing.  During my testing (I swear this was testing!) I was repeatedly pinged about the status of my date on <em>Social Girl </em>or that my teepee reward was ready on <em>Alien Family.</em>  Since push notifications give the same sound (often, but sometimes not) and the same vibration as an SMS, I was often fooled to open up the notification during a time that wasn&#8217;t entirely ideal. Like drinking with my friends at a bar. Ouch.  With that said, because these notifications are so uniform between applications, its a viral channel that you must utilize. And surprisingly, many popular apps do not, like Tetris and Angry Birds.</p>
<p>I hope I fused a little bit of data and a little bit of anecdotal experience to shed some light on what you can expect when developing an application for Apple&#8217;s App Store, particularly a Free to Play game.  While not entirely conclusive, I will continue to firm up these numbers and offer more concrete resources (possibly some open-source datasets!) for the community.  If you&#8217;d like to be notified when these data sets become available, please sign up for my mailing list!</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2012/01/03/mobile-app-tips/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Big Data is Useless Without Science</title>
		<link>http://www.thedatascientist.com/2011/11/12/big-data-is-useless-without-science/</link>
		<comments>http://www.thedatascientist.com/2011/11/12/big-data-is-useless-without-science/#comments</comments>
		<pubDate>Sat, 12 Nov 2011 19:29:31 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Journal]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[cassandra]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[flume]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[mahout]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[voldemort]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=187</guid>
		<description><![CDATA[Today, companies have more customer data than they can handle. Like a digital version of the show Hoarders, companies try to keep every bit of detail for as long as possible with the hope that one day these useless bits can be turned into massive new revenue opportunities. Over the past five years, bright engineers have devised open-sourced solutions to store and process the data deluge. We now even have a “big data stack” — that is, a framework for commoditizing data.
]]></description>
			<content:encoded><![CDATA[<p><em><a href="http://www.thedatascientist.com/wp-content/uploads/2011/11/CB2.jpg"><br />
</a>This is a guest post that first appeared on the Kontagent blog, found here:</em></p>
<p><a title="Big Data is Useless Without Science" href="http://kaleidoscope.kontagent.com/2011/11/09/big-data-is-useless-without-science/"><em>http://kaleidoscope.kontagent.com/2011/11/09/big-data-is-useless-without-science/</em></a></p>
<p>&nbsp;</p>
<p>Many years ago I researched explosives by shining a light on them. It was every bit as exciting as it sounds. We would shine a light, take a picture, then study the explosive to see if it changed. I would painstakingly scour thousands of data points, looking for small fluctuations in intensity, signs of discoloration, or any statistically significant feature. We collected immense amounts of data from sensors, but the explosive always looked the same when we took snapshots. Then eventually we found out that if we looked not just at the snapshots, but also at the differences between the snapshots using a mathematical formula, we could see dramatic changes. We found out that every explosive was different, and we could effectively detect an explosive from a distance by just shining a light. Today, that research is being used to scan people before they enter airports for bombs.</p>
<p dir="ltr">Today, companies have more customer data than they can handle. Like a digital version of the show <a href="http://www.aetv.com/hoarders/" target="_blank">Hoarders</a>, companies try to keep every bit of detail for as long as possible with the hope that one day these useless bits can be turned into massive new revenue opportunities. Over the past five years, bright engineers have devised <a href="http://hadoop.apache.org/">open-sourced solutions</a> to store and process the data deluge. We now even have a “big data stack” — that is, a framework for commoditizing data.</p>
<div id="attachment_216" style="text-align: center;"><a href="http://kaleidoscope.kontagent.com/wp-content/uploads/2011/11/CB1.jpg"><img class="aligncenter" title="Big Data is Useless without Science 1" src="http://kaleidoscope.kontagent.com/wp-content/uploads/2011/11/CB1.jpg" alt="" width="152" height="164" /></a><em>A simplified big data framework</em></div>
<p>&nbsp;<br />
At a previous job, I was essentially the housekeeper of our home-built data infrastructure based on <a href="http://hadoop.apache.org/" target="_blank">Hadoop</a>. The idea was to create a monolithic tracking system that would record all anonymous user data, batch it up and store it in a distributed store. We’d then process and query it to find fascinating facts about our users that would help drive product roadmaps. At least that was the idea. Sound familiar? In practice though, we weren’t exactly sure what to track. Which user attributes were meaningful? How would we identify statistically significant behavior? Were we sure of data accuracy enough to influence the product roadmap? It turns out that we–and many other companies it seems–neglect a major component of the “big data stack”: science. So let’s modify our framework.</p>
<div id="attachment_217" style="text-align: center;"><a href="http://kaleidoscope.kontagent.com/wp-content/uploads/2011/11/CB2.jpg"><img class="aligncenter" title="Big Data is Useless without Science" src="http://kaleidoscope.kontagent.com/wp-content/uploads/2011/11/CB2.jpg" alt="" width="139" height="202" /></a><em>An actionable big data framework</em></div>
<p>&nbsp;</p>
<p style="text-align: left;">This diagram may seem a bit strange if you’re a software engineer. Because typically when you talk to engineers about big data, you’ll hear a litany of tool sets that sound like characters out of a Harry Potter novel: Voldemort, Pig, HDFS, Oozie, Zookeeper, Flume, Hive, Cassandra… you get the picture. We have yet to get to a point where science can be commoditized, and perhaps it never will (though <a href="http://mahout.apache.org/" target="_blank">Mahout</a> is a step in the right direction). Despite all these tool sets, scientists will always be needed for their intuition, interpretation, and curiosity. Scientists are needed to analyze the business needs of a customer and ask the right questions to solve critical business problems. Scientists are needed to transform an ugly piece of log data into a beautiful infographic that can spur an organization to launch a new product, bolster existing services, or otherwise remain nimble in a highly competitive economic environment.</p>
<p style="text-align: left;">These scientists are not scientists in the traditional sense. Their domain isn’t necessarily their bachelor’s, master’s, or dissertation topic. On the surface, my background in explosive materials doesn’t sound like it helps me with my day job of helping clients understand user behavior in their mobile and social applications. Instead, I’ve learned that “data scientists,” no matter what their background, specialize in providing insight by using keen analytical and quantitative skills. If needed, they will clean, explore, and model data sets to create new information products and key metrics. These scientists are not in a cubical doing mundane research towards an elusive goal. They are highly collaborative and “high-touch,” that is, they constantly communicate with a key stakeholder so an end goal is reached.</p>
<p>The scientific process and research I conducted on explosives was instrumental in creating a product that went to market. The organization I was a part of taught me to be curious and to look at data sets in <em>new</em> ways. We had the infrastructure to store, process, and query the data, but ultimately it was our<em> insight</em> that produced a working prototype. In today’s data-driven world, the organizations that best leverage their data and invest in the right people to derive insights from that data will gain large competitive advantages over organizations that fly blind.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2011/11/12/big-data-is-useless-without-science/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A data experiment gone wrong</title>
		<link>http://www.thedatascientist.com/2011/08/30/a-data-experiment-gone-wrong/</link>
		<comments>http://www.thedatascientist.com/2011/08/30/a-data-experiment-gone-wrong/#comments</comments>
		<pubDate>Tue, 30 Aug 2011 01:24:18 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Journal]]></category>
		<category><![CDATA[ben fry]]></category>
		<category><![CDATA[choropleth]]></category>
		<category><![CDATA[d3]]></category>
		<category><![CDATA[fips]]></category>
		<category><![CDATA[map]]></category>
		<category><![CDATA[nathan yau]]></category>
		<category><![CDATA[zipcode]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=148</guid>
		<description><![CDATA[Sometimes projects do not go as planned.  Unfortunately, the first project I wanted to showcase on this blog was the one that bit the dust.  But I&#8217;ll share a post mortem on it anyways since the lessons were valuable. The Question:  Do Southern states really listen to more country music? All good experiments start with a question. [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes projects do not go as planned.  Unfortunately, the first project I wanted to showcase on this blog was the one that bit the dust.  But I&#8217;ll share a post mortem on it anyways since the lessons were valuable.</p>
<h2>The Question:  Do Southern states really listen to more country music?</h2>
<p>All good experiments start with a question.  Before every project, I come up with a question I want to answer so I do not lose focus as the work gets more complicated.  For this project I wanted to pick something insightful and relatable.  An excellent example of a model I wanted to follow is <a title="U.S Unemployment Map" href="http://flowingdata.com/2009/11/04/unemployment-2004-to-present-the-country-is-bleeding/">Nathan Yau&#8217;s U.S Unemployment Map</a>.  This is ultimately how I wanted to display the answer to the country music question.  I envisioned in my head that Louisiana, Mississippi, Georgia, and Florida would all be dark-colored and the rest of the United States would be lighter shades of that same color.  Assuming that color represented country music popularity, that would mean that the answer to my question was &#8220;YES&#8221;!  I also envisioned that if I got access to a dataset that was timestamped and collected data for a few weeks, that maybe I&#8217;d be able to put together a fancy little movie showing the transition of certain genres over time.  Seeing this visualization could perhaps point to macro trends like the distribution of a new single by a popular artist, a concert that was making its way across the U.S, and of course the affinity certain counties had for certain genres of music.</p>
<p><span class="Apple-style-span" style="font-size: 20px; font-weight: bold;">The Twitter #NowPlaying Dataset</span></p>
<p>The next step was to get the data.  Data requirements for this experiment were: recency (semi-important), free (important), large (important), location-based (important), and parse-able (semi-important).  I immediately thought of Twitter as an appropriate data supply.  Twitter has an excellent push API that constantly sends new status updates that contain all the important bits about a Tweet you can think of: urls, hashtags, mentions, timestamps, and location. All for free. Requirements, check!  Now, I didn&#8217;t really want just <em>any</em> tweet, I wanted tweets just about music.  As a casual Twitter user, I&#8217;ve often seen how people will use the #nowplaying hashtag to say what they&#8217;re currently listening to on their streaming music platform of choice.  So I decided to filter the API looking only for the #nowplaying hashtag.  Data started to flow in like Niagara, so I wrote a little program to catch the tweets, batch them up, and dump them to disk.  I then promptly forgot about that process and came back a couple weeks later.</p>
<h2>Problems</h2>
<p>When I began to sift through the data to put my visualization together, I began to realize that the mountainous amounts of data I was expecting turned out to be really not that much.  The raw data was quite large, but the final form of the data I needed to make the visualization work was &lt;FIPS Code, Value&gt;  where FIPS is the <a title="FIPS (Federal Information Processing Standard) useful for making geographic maps" href="http://en.wikipedia.org/wiki/Federal_Information_Processing_Standard">Federal Information Processing Standard</a> for plotting county-level data and value was the popularity value for genre in that region.  I thought I&#8217;d use raw tallies of the genre for that FIPS code to start off.  However, Twitter doesn&#8217;t give location data in FIPS.  In fact, Twitter gives what users end up writing in the location field, so that could be anywhere from the nicely labeled &#8220;San Francisco, CA&#8221; to the absurd but Twitter-normal &#8220;Somewhere in the middle of nowhere&#8221;.  After filtering out all these &#8220;bad&#8221; locations, I was left with a fraction of the raw data.</p>
<p>The next step was to translate the #nowplaying data to genre data.  For example, a typical entry might look like &#8220;#nowplaying Lady Gaga &#8211; Monster&#8221;.  To translate the Artist/ Song combo to Genre, I grabbed a <a title="Discogs" href="http://www.discogs.com/">Discogs</a> dataset with Artist to Genre columns and Song to Genre columns, then matched them to terms within the nowplaying string.  Of course, after this string matching, many entries were eliminated because they either a) weren&#8217;t correctly formatted as Artist &#8211; Song or Song- Artist b) Didn&#8217;t have Discogs data or c) Discogs data didn&#8217;t have any genre data.  After all these filters and matching and distillation, I then reduced the set to FIPS, Value pairs where I could then plot them with the convenient <a title="D3.js Choropleth example" href="http://mbostock.github.com/d3/ex/choropleth.html">D3.js library</a> choropleth.  The result?  Well, not exactly what I had <a title="Excellent U.S Thematic Map Example: U.S Unemployment Visualized" href="http://flowingdata.com/2009/11/12/how-to-make-a-us-county-thematic-map-using-free-tools/">envisioned</a> <img src='http://www.thedatascientist.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/08/choropleth_pop.png"><img class="aligncenter size-full wp-image-149" title="choropleth_pop" src="http://www.thedatascientist.com/wp-content/uploads/2011/08/choropleth_pop.png" alt="" width="588" height="310" /></a>  U.S Choropleth Thematic Map Fail</p>
<h2>Lessons Learned</h2>
<p>After all was said and done, I did learn some important lessons.</p>
<p>1. <strong> U.S Thematic Maps requires loads of data</strong>.  While you don&#8217;t necessarily need to visualize every county in the U.S, to show a relative majority of the counties you need a very detailed dataset.  If you scrape the data from most popular services (ie. Twitter) you&#8217;re going to be bound by Twitter&#8217;s own popularity in specific areas of the U.S (or elsewhere).  The reason why visualizing U.S unemployment works but music genres doesn&#8217;t is that the U.S Department of Labor <em>keeps very accurate data for every county in the U.S</em>.  So choropleth maps might make sense for other similarly detailed datasets.</p>
<p>2.  <strong>Zipcodes &gt; FIPS??? </strong>When you visualize data, you have to munge the data to be portrayed in the right units.  The aforementioned U.S choropleth map takes in a key-value pair of FIPS code to magnitude value.  That didn&#8217;t work for this experiment.  But there are plenty of other options.  For example, I could have used Zipcode instead and plotted the data using <a title="Ben Fry Zipdecode for visualizing geographic data" href="http://benfry.com/zipdecode/">Ben Fry&#8217;s technique</a>.  Or if I really wanted to be fancy, I could have made my own custom boundaries from GPS data and translated them into shapefiles which could then be filled in.  But that&#8217;s perhaps another tutorial&#8230;.</p>
<p>3. <strong>Take into account dataset filtering.</strong>  I&#8217;ll be honest, I completely underestimated how much data I would have to discard.  At every point along the pipeline of shuffling data from one stage to another, I seemed to have lost 50% volume.  I failed to account for how much this would affect my final results, especially after I reduced the raw data to its &#8220;plot-able&#8221; form.</p>
<p>Hope this breakdown was beneficial. In the future I might put together the code samples I used to conduct this analysis. Until then&#8230;time to start planning the next project!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2011/08/30/a-data-experiment-gone-wrong/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Today I launched FitLabs, my first web product</title>
		<link>http://www.thedatascientist.com/2011/08/25/today-i-launched-fitlabs-my-first-web-product/</link>
		<comments>http://www.thedatascientist.com/2011/08/25/today-i-launched-fitlabs-my-first-web-product/#comments</comments>
		<pubDate>Thu, 25 Aug 2011 20:00:13 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[business]]></category>
		<category><![CDATA[CrossFit]]></category>
		<category><![CDATA[FitLabs]]></category>
		<category><![CDATA[fitness]]></category>
		<category><![CDATA[weight lifting]]></category>
		<category><![CDATA[workouts]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=124</guid>
		<description><![CDATA[For the past few weeks, I've been working on a microISV (Independent Software Vendor) business called FitLabs.  The goal of FitLabs is to measure your fitness.  I've always been a workout fanatic. I've done sports all my life from Cross-country to Track and Field, Crew, Soccer, Basketball, and now CrossFit.  When I joined CrossFit, I realized real quickly how data-intensive it was. I was given a binder and every day I wrote down the workout, the times, the reps, and the weights.  But after a while, I really didn't know how to make the best use of all my journaling and data. ]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/08/fitlabsscreenshot.png"><img class="aligncenter size-large wp-image-142" title="fitlabsscreenshot" src="http://www.thedatascientist.com/wp-content/uploads/2011/08/fitlabsscreenshot-1024x499.png" alt="" width="512" height="249" /></a></p>
<p>For the past few weeks, I&#8217;ve been working on a microISV (Independent Software Vendor) business called <a title="FitLabs - Measure your fitness" href="http://www.fitlabs.co">FitLabs</a>.  The goal of FitLabs is to measure your fitness.  I&#8217;ve always been a workout fanatic. I&#8217;ve done sports all my life from Cross-country to Track and Field, Crew, Soccer, Basketball, and now CrossFit.  When I joined CrossFit, I realized real quickly how data-intensive it was. I was given a binder and every day I wrote down the workout, the times, the reps, and the weights.  But after a while, I really didn&#8217;t know how to make the best use of all my journaling and data.  Its now been a year, and I&#8217;ve grown tremendously.  I feel stronger and healthier and more flexible.  I&#8217;m also happier, but that may be because I just moved.  But despite all this progress I still have some burning questions about my fitness and health like,</p>
<ul>
<li>Am I really stronger? How has my weightlifting gotten better?</li>
<li>If I attempt to lift more weight, can I handle it?  Am I at a local or global minimum?</li>
<li>Am I completing my workouts faster? Am I doing more reps than I used to?</li>
<li>How is my performance compared to other athletes of my height and weight?</li>
<li>How has my diet influenced my health?  Change I change my diet to have a quantitative impact (increase) in my health?</li>
</ul>
<p>I decided to build an application from scratch that could begin to answer these questions.  It&#8217;s nowhere near perfect, but its starting to illustrate some answers.  For example, after loading in my workout journal, I was able to visualize my Front Squat</p>
<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/08/frontsquat.png"><img class="aligncenter size-full wp-image-126" title="frontsquat" src="http://www.thedatascientist.com/wp-content/uploads/2011/08/frontsquat.png" alt="" width="578" height="246" /></a></p>
<p>FitLabs was able to show me how much weight I have been able to move in my front squat lift.  Its leaves a little to be desired like how many reps I did and how much barbell weight I was pulling, but I will improve on that.  A little further down the page, I show some basic statistics like,</p>
<p>&nbsp;</p>
<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/08/frontsquatmarks.png"><img class="aligncenter size-full wp-image-127" title="frontsquatmarks" src="http://www.thedatascientist.com/wp-content/uploads/2011/08/frontsquatmarks.png" alt="" width="579" height="116" /></a></p>
<p>This shows me (ideally) the range of my front squat lifts.  I&#8217;m able to see what my max working weight was in a workout (note, I have excluded strength workouts such as 1-rep maxes), and that record was set back in November.  One month later, I used a drastically lighter weight, which could have been due to a variety of factors.  The point being, when I&#8217;m working out, I don&#8217;t know what these statistics are.  Now when I go to the gym, the visualization is clearly in my head.</p>
<p>FitLabs has some bugs, but I am dedicated to perfecting it because I&#8217;m betting on the fact that if these questions are important to me, they must be important for others.  If you find this interesting, and want to check it out (free of charge) please <a href="http://www.thedatascientist.com/contact/">contact me</a>.  Otherwise you can <a title="FitLabs signup" href="https://dashboard.fitlabs.co/signup.php">signup for FitLabs</a>.</p>
<p>I&#8217;ll be posting about my experiences getting this software business off the ground. I&#8217;m a complete noobie, so I&#8217;m looking forward to hearing from the community!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2011/08/25/today-i-launched-fitlabs-my-first-web-product/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>30 Years Of Music Industry Change</title>
		<link>http://www.thedatascientist.com/2011/08/18/30-years-of-music-industry-change/</link>
		<comments>http://www.thedatascientist.com/2011/08/18/30-years-of-music-industry-change/#comments</comments>
		<pubDate>Thu, 18 Aug 2011 19:46:37 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Around The Web]]></category>
		<category><![CDATA[music]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=116</guid>
		<description><![CDATA[Digital Music News has an interesting visualization of the changing climate of music sales.  The statistics are astounding (yet not really surprising) to see. In 2002, CD sales were the overwhelming majority of music sales at 95.5% market share.  Last year, they made up just 49.1%, having been eclipsed by the Digital Single (I-Tunes).  Its [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/08/chart2002.jpg"><img class="size-full wp-image-118 aligncenter" title="chart2002" src="http://www.thedatascientist.com/wp-content/uploads/2011/08/chart2002.jpg" alt="" width="330" height="300" /></a></p>
<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/08/chart2010.jpg"><img class="size-full wp-image-119 aligncenter" style="border-style: initial; border-color: initial;" title="chart2010" src="http://www.thedatascientist.com/wp-content/uploads/2011/08/chart2010.jpg" alt="" width="330" height="300" /></a></p>
<p><a href="http://www.digitalmusicnews.com/stories/081611thirty">Digital Music News</a> has an interesting visualization of the changing climate of music sales.  The statistics are astounding (yet not really surprising) to see. In 2002, CD sales were the overwhelming majority of music sales at 95.5% market share.  Last year, they made up just 49.1%, having been eclipsed by the Digital Single (I-Tunes).  Its clear why the record industry is clutching to their dying business model &#8212; gone are the times where a customer would spend up to $20 per a CD.  Now people are much more selective about what music they buy, and the Digital Single presents an opportunity to the consumer to be more choosy.  I bet if we look at a graph of music industry revenue over the years, we will see some serious declines.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2011/08/18/30-years-of-music-industry-change/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visualizing Geographic Data</title>
		<link>http://www.thedatascientist.com/2011/07/27/visualizing-geographic-data/</link>
		<comments>http://www.thedatascientist.com/2011/07/27/visualizing-geographic-data/#comments</comments>
		<pubDate>Wed, 27 Jul 2011 00:19:18 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Journal]]></category>
		<category><![CDATA[beautiful soup]]></category>
		<category><![CDATA[css]]></category>
		<category><![CDATA[d3]]></category>
		<category><![CDATA[maps]]></category>
		<category><![CDATA[mapshaper]]></category>
		<category><![CDATA[openstreetmap]]></category>
		<category><![CDATA[polymaps]]></category>
		<category><![CDATA[svg]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=110</guid>
		<description><![CDATA[Ask the people behind any popular web service and they will tell you that a large portion of their user base is located outside of the US.  Web applications are increasingly being made with global audiences in mind, and therefore a need has arisen to represent that geospatial data.  Visualizing geographic data can be a [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/07/polymap.png"><img class="aligncenter size-full wp-image-111" title="PolyMaps" src="http://www.thedatascientist.com/wp-content/uploads/2011/07/polymap.png" alt="" width="628" height="328" /></a></p>
<p style="text-align: left;">Ask the people behind any popular web service and they will tell you that a large portion of their user base is located outside of the US.  Web applications are increasingly being made with global audiences in mind, and therefore a need has arisen to represent that geospatial data.  Visualizing geographic data can be a challenge, even when focusing on just the United States.  Shapes need to found and downloaded, coordinates need to be generated, then the appropriate data must be mapped to it and plotted in a form browsers can recognize.</p>
<p style="text-align: left;">Luckily, the tools and databases for plotting geographic data have gotten much better.  The US Census keeps an accurate record of <a title="Census Boundary Data" href="http://www.census.gov/geo/www/cob/bdy_files.html" target="_blank">boundary data</a>, which you can the plug into <a href="http://mapshaper.com/test/demo.html" target="_blank">MapShaper</a> to create a polygon mapping.  There are also maps freely available at <a title="Open Street Maps" href="http://www.openstreetmap.org/" target="_blank">OpenStreetMap</a> and <a title="Google Map API" href="http://code.google.com/apis/maps/index.html" target="_blank">Google Map API</a>.  You might also find it beneficial to great a GeoJSON file instead of using shapefiles, so they can be used in libraries such as Polymaps.  Use <a href="http://www.gdal.org/ogr2ogr.html" target="_blank">ogr2ogr</a>. An example of usage is <a href="http://stackoverflow.com/questions/2223979/convert-a-shapefile-shp-to-xml-json" target="_blank">here</a>.</p>
<p>Using SVG to represent the data presents a clear advantage as you can manipulate the data and its design using simple CSS rules.  The <a title="Polymaps Javascript Library" href="http://polymaps.org/" target="_blank">Polymaps</a> and <a title="Choropleth" href="http://mbostock.github.com/d3/ex/choropleth.html" target="_blank">D3</a> libraries are excellent for visualizations that can be drawn directly onto the DOM.  They render very quickly.  You can also pre-process an SVG using <a href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">Beautiful Soup</a> and some <a href="http://en.wikipedia.org/wiki/File:USA_Counties_with_FIPS_and_names.svg" target="_blank">Wikimedia</a> templates.  FlowingData has an <a href="http://flowingdata.com/2009/11/12/how-to-make-a-us-county-thematic-map-using-free-tools/" target="_blank">awesome tutorial</a> for creating a beautiful visualization.</p>
<p>If SVG is not your cup of tea, you can also use Flash.  Another library suite with lots of features is <a href="http://www.ammap.com/" target="_blank">AmMaps</a>.</p>
<p>I&#8217;m currently working on a project which will hopefully pull some of these concepts together in an easy-to-use tutorial.  If you&#8217;d like to be notified when it goes live, feel free to sign up for the newsletter below.</p>
<p><strong>UPDATE:  </strong>Well, so much for wishful thinking.  I completed the project, but it <a title="A data experiment gone wrong" href="http://www.thedatascientist.com/2011/08/30/a-data-experiment-gone-wrong/" target="_blank">didn&#8217;t quite go as planned</a>&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2011/07/27/visualizing-geographic-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Does your transit time increase as your home price decreases?</title>
		<link>http://www.thedatascientist.com/2011/07/25/does-your-transit-time-increase-as-your-home-price-decreases/</link>
		<comments>http://www.thedatascientist.com/2011/07/25/does-your-transit-time-increase-as-your-home-price-decreases/#comments</comments>
		<pubDate>Mon, 25 Jul 2011 16:18:57 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Around The Web]]></category>
		<category><![CDATA[maps]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=96</guid>
		<description><![CDATA[If you&#8217;re sick of your morning and evening commutes, you may be wondering how close you can move to your work while living affordably.  Of course, there are many other factors that go into choosing a place to live, but what if you can see the relationship between home price and transit time?  One Bay [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/07/onebayarea.png"><img class="aligncenter size-full wp-image-97" title="One Bay Area" src="http://www.thedatascientist.com/wp-content/uploads/2011/07/onebayarea.png" alt="" width="577" height="409" /></a></p>
<p>If you&#8217;re sick of your morning and evening commutes, you may be wondering how close you can move to your work while living affordably.  Of course, there are many other factors that go into choosing a place to live, but what if you can see the relationship between home price and transit time?  <a title="One Bay Area" href="http://www.onebayarea.org" target="_blank">One Bay Area</a> is a website dedicated to just that. Created by a consortium of agencies, the effort is meant to create a more sustainable future by planning for future land use and transportation.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2011/07/25/does-your-transit-time-increase-as-your-home-price-decreases/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google&#8217;s Ngram Viewer Lets You Visualize Interesting Trends</title>
		<link>http://www.thedatascientist.com/2011/07/23/googles-ngram-viewer-unlocks-trends/</link>
		<comments>http://www.thedatascientist.com/2011/07/23/googles-ngram-viewer-unlocks-trends/#comments</comments>
		<pubDate>Sat, 23 Jul 2011 18:25:53 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Around The Web]]></category>
		<category><![CDATA[Markov Models]]></category>
		<category><![CDATA[Ngram]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=80</guid>
		<description><![CDATA[I just discovered Google's Ngram Viewer via David McCandless' Information Is Beautiful blog.  Its amazing. Well, and so is David's blog.  Google makes it easy to type in any time period to examine the ngrams calculated from all Google Books as they are scanned.]]></description>
			<content:encoded><![CDATA[<p>I just discovered Google&#8217;s <a title="Ngram Viewer" href="http://ngrams.googlelabs.com/" target="_blank">Ngram Viewer</a> via David McCandless&#8217; <a title="Information Is Beautiful" href="http://www.informationisbeautiful.net/" target="_blank">Information Is Beautiful</a> blog.  Its amazing. Well, and so is David&#8217;s blog.  Google makes it easy to type in any time period to examine the ngrams calculated from all Google Books as they are scanned.</p>
<p>Don&#8217;t know anything about ngrams?  Ngrams are essentially word/phrase counts.  They are used extensively in speech recognition in training Markov Models.  Anyways, Google&#8217;s Ngram Viewer is simple and effective and can unlock some interesting trends such as the couple I discovered below:</p>
<p>&nbsp;</p>
<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/07/menwomenngram1.png"><img class="aligncenter size-full wp-image-83" title="Men Women Ngram" src="http://www.thedatascientist.com/wp-content/uploads/2011/07/menwomenngram1.png" alt="" width="553" height="212" /></a>The Feminine Mystique</p>
<p style="text-align: center;"><a href="http://www.thedatascientist.com/wp-content/uploads/2011/07/cartrainngram.png"><img class="aligncenter size-full wp-image-82" title="Car Train Ngram" src="http://www.thedatascientist.com/wp-content/uploads/2011/07/cartrainngram.png" alt="" width="557" height="206" /></a>Technological Revolution</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2011/07/23/googles-ngram-viewer-unlocks-trends/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Essential Tools For A Data Scientist</title>
		<link>http://www.thedatascientist.com/2011/07/22/essential-tools-for-a-data-scientist/</link>
		<comments>http://www.thedatascientist.com/2011/07/22/essential-tools-for-a-data-scientist/#comments</comments>
		<pubDate>Fri, 22 Jul 2011 17:49:35 +0000</pubDate>
		<dc:creator>Chris Bates</dc:creator>
				<category><![CDATA[Journal]]></category>
		<category><![CDATA[amcharts]]></category>
		<category><![CDATA[blas/lapack]]></category>
		<category><![CDATA[d3.js]]></category>
		<category><![CDATA[flare]]></category>
		<category><![CDATA[google visualization api]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[highcharts]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[mahout]]></category>
		<category><![CDATA[matlab]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[numpy]]></category>
		<category><![CDATA[octave]]></category>
		<category><![CDATA[processing]]></category>
		<category><![CDATA[protovis]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[raphael.js]]></category>
		<category><![CDATA[scipy]]></category>
		<category><![CDATA[weka]]></category>

		<guid isPermaLink="false">http://www.thedatascientist.com/?p=48</guid>
		<description><![CDATA[We know that the need for data scientists is strong.  But what exactly do you need to become a data scientist?  When I look back over all the years I’ve been working in the sciences, the tools I have needed have varied depending on the data size.]]></description>
			<content:encoded><![CDATA[<p>We know that the need for data scientists is <a title="A Call For Data Scientists" href="http://www.thedatascientist.com/2011/07/21/a-call-for-data-scientists/" target="_blank">strong</a>.  But what exactly do you need to <strong>become</strong> a data scientist?  When I look back over all the years I’ve been working in the sciences, the tools I have needed have varied depending on the data size.</p>
<h3>Sample Data (Lines)</h3>
<p>A whiteboard – I say this somewhat jokingly.  But the truth is that a simple whiteboard sketch of a sample set of the data will help you brainstorm important questions to ask, or uncover complexities of the data you may need to plan out before writing any code.</p>
<h3>Prototype Data (KB – Low MB)</h3>
<h4><strong>Analysis and Visualization:</strong></h4>
<p><a href="http://www.mathworks.com/products/matlab/" target="_blank">Matlab</a> &#8211; Proprietary software that most universities will have licenses to or allow students to buy cheap copies.  Matlab is basically an IDE with great support for visualizing data in native structures (arrays, matrices, cells).  This comes in handy when working with the many, many wonderful academic toolboxes for such things like Neural Networks, Machine Learning, and Machine Vision.</p>
<p><a href="http://www.gnu.org/software/octave/" target="_blank">Octave</a> &#8211; Open source implementation of Matlab.  Great for when you eventually have to leave the nest of the University but are still hooked on Matlab.  Most Matlab toolboxes are supported and many of the original functions are preserved.</p>
<p><a href="http://www.r-project.org/" target="_blank">R</a><strong> – </strong>Open source statistical computing environment.  I didn’t come from a statistics background but I’m curious to learn it.  Used by many professionals and academics in the field.  See <a href="http://flowingdata.com/" target="_blank">FlowingData</a> for more practical illustrations.</p>
<p><a href="http://processing.org/" target="_blank">Processing</a><strong> – </strong>Visualization development environment.  Creates beautiful visualizations and comes with an IDE.  Check out <a href="http://benfry.com/" target="_blank">Ben Fry’s</a> <a href="http://www.amazon.com/Visualizing-Data-Explaining-Processing-Environment/dp/0596514557" target="_blank">Visualizing Data</a>.</p>
<h3>Online Data (MB – Low GB)</h3>
<p>This is the point where handling data becomes a bit unwieldy.  Perhaps this is when you’d like to <a href="http://www.getforge.com/landing.php" target="_blank">provide analytics as a web service</a>, and you’re not prototyping anymore on your local workstation.  So assuming you have to start migrating your experiments to production deployments on a server you need to start separating the components needed to do the analysis and visualization.</p>
<h4><strong>Storage:</strong></h4>
<p><a href="http://www.mysql.com/" target="_blank">MYSQL (or other DB)</a> – Open source and used by many companies to query and serve data.  Write SQL statements to mine data.  MYSQL has support for some statistical functions but performance may degrade if the data size is near upper bound.</p>
<h4><strong>Analysis:</strong></h4>
<p><a href="http://numpy.scipy.org/" target="_blank">NumPy and SciPy</a> (Python) – Because of its roots in the scientific community, Python has great support for mathematical functions that may be absent in other languages such as PHP.</p>
<p><a href="http://www.cs.waikato.ac.nz/ml/weka/" target="_blank">Weka</a><strong> – </strong>Free Java-based software for data analysis and machine learning and visualization.  Has database connectivity.</p>
<p><a href="http://www.netlib.org/lapack/" target="_blank">BLAS/LAPACK</a> – Core Linear Algebra Fortran packages which NumPy and Matlab are based off.  Use the C API to write your own custom wrapper in any language that suits you.</p>
<h4><strong>Visualization:</strong></h4>
<p><a href="http://flare.prefuse.org/" target="_blank">Flare (Flash)</a><strong> &#8211; </strong>Build animations with Actionscript using the Prefuse toolkit. Supports rich features in data modeling and contains animations for trees, graphs, and tables.</p>
<p><a href="http://www.highcharts.com/" target="_blank">HighCharts (Javascript)</a> – Easily construct standard pie, bar, and line charts using JQuery.</p>
<p><a href="http://www.amcharts.com/" target="_blank">AmCharts (Flash)</a> – Construct pie, bar, and line charts. Also has map support.</p>
<p><a href="http://mbostock.github.com/protovis/" target="_blank">Protovis (Javascript)</a><strong> [Development stopped for D3.js] &#8211; </strong>Creates SVG images that can be inserted into page. Great for creating custom visualizations that might become tedious with other frameworks.</p>
<p><a href="http://mbostock.github.com/d3/" target="_blank">D3.js (Javascript)</a><strong> – </strong>Built directly from the standards of CSS3, HTML5, and SVG and does not require learning a new graphical representation, like Protovis. Operate on the DOM directly and apply data-driven transformations.</p>
<p><a href="http://code.google.com/apis/chart/" target="_blank">Google Visualization API</a><strong> – </strong>Built from HTML5, SVG, and Google Web Toolkit.  Google makes it easy if you want to develop your own Google Trends by supporting many basic types of visualizations like scatter plots, bar charts, and Google Maps integration.</p>
<p><a href="http://raphaeljs.com/" target="_blank">Raphael.js (Javascript)</a> – A JS library that makes it easy to create vector graphics using SVG and VML (deprecated).</p>
<h3>“Big” Data (GB – TB – PB)</h3>
<p>At this stage, data cannot be stored on a single hard disk, so it must be distributed among many machines.  This is an active area of development for large websites, and so far the open source community has adopted Apache Hadoop as the flagship software framework for handling big data problems.</p>
<h4><strong>Storage:</strong></h4>
<p><a href="http://hadoop.apache.org/hdfs/" target="_blank">Hadoop Distributed Filesystem (HDFS)</a><strong> – </strong>Creates replicas of data blocks and distributes them on compute nodes in a cluster.</p>
<p><a href="http://hbase.apache.org/" target="_blank">HBase</a> – The Hadoop database.  It provides random, realtime read/write access to HDFS.</p>
<h4><strong>Analysis:</strong></h4>
<p><a href="https://cwiki.apache.org/confluence/display/Hive/Home" target="_blank">Hive</a><strong> – </strong>The data warehouse system that can be used for offline batch processing, ad-hoc querying, and statistical analysis of large datasets.  Hive uses a SQL-like syntax that makes it easy to rapidly develop queries.</p>
<p><a href="http://mahout.apache.org/" target="_blank">Mahout</a><strong> –</strong> A framework for deploying many machine learning algorithms on large datasets.  The algorithms have been rewritten with scale in mind.  Currently the main use cases include: recommendations, clustering, classification, and text mining.</p>
<h4><strong>Visualization:</strong></h4>
<p>There is not yet a popular framework that can view the distributed data in its raw form. If you know of one, please leave the suggestion in the comments or contact me.  Typically the industry practice is to summarize the data and collapse the feature-set, then load it into a database such as MYSQL for online querying.  From there you can use any of the aforementioned visualization software libraries.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thedatascientist.com/2011/07/22/essential-tools-for-a-data-scientist/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

