Translate

Wednesday, January 16, 2013

Upload apk into Android Market

Source: http://lokeshatandroid.blogspot.in/2012/07/upload-apk-into-android-market.html

Upload apk into Android Market

Steps for Create a certificate for Android Market apk:
1.If you are using Eclipse for Development just right click on your project and click export.
2.Now choose Android and then Export Android Application. In the next step confirm the project that you want to export.
3.Then click next and now you should be able to select create new keystore.
4.Now fill in the required fields and your should be able to sign your app.
5. Be sure to make a backup of the keystore file and remember your password. Losing this will make it impossible to update your application.
6.If you are using the terminal to create a keystore and you have Java SDK installed there should be a program called keytool in /usr/bin (on a unix based system).
7.On Windows the SDK should also come with the keytool but the install location may be different search for keytool.exe on your computer if keytool is not in your path already. With this tool you should be able to create a key in the following way:
keytool -genkey -v -keystore my-release-key.keystore -alias alias_name -keyalg RSA -validity 10000
8.Remember that once you lose your Certificate or it expires you will not be able to sign your application. Make sure that the expiration date is a long long time in the future. 
Reference sites:
                             Link1
 
Publishing Updates on Android Market :
1.At any time after publishing an application on Android Market, you can upload and publish an update to the same application package.
2.When you publish an update to an application, users who have already installed the application may receive a notification that an update is available for the application. They can then choose to update the application to the latest version.
3.Before uploading the updated application, be sure that you have incremented the android:versionCode and android:versionName attributes in the element of the manifest file. Also, the package name must be the same as the existing version and the .apk file must be signed with the same private key.
Here we need to change:
Go to manifestfile and set the version code like this:
 <manifest xmlns:android="http://schemas.android.com/apk/res/android"
 package="your.package.name"
 android:versionCode="2"
 android:versionName="2.0" >
 Here , old versioncode: 1  and new version code:2
4.If the package name and signing certificate do not match those of the existing version, Market will consider it a new application, publish it as such, and will not offer it to existing users as an update.
5.You have to have the same keystore file which you have used to upload the 1st version of application on android market. If you have lost this keystore file then you can't provide update to this application.
Note: Dont forget to keep a backup of your keystore file.

Securing your Tomcat app with SSL and Spring Security

Securing your Tomcat app with SSL and Spring Security

If you've seen my last blog, you'll know that I listed ten things that you can do with Spring Security. However, before you start using Spring Security in earnest one of the first things you really must do is to ensure that your web app uses the right transport protocol, which in this case is HTTPS - after all there's no point in having a secure web site if you're going to broadcast your user's passwords all over the internet in plain text. To setup SSL there are three basic steps...

Creating a Key Store

The first thing you need is a private keystore containing a valid certificate and the simplest way to generate one of these is to use Java's keytool utility located in the $JAVA_HOME/bin directory.

keytool -genkey -alias MyKeyAlias -keyalg RSA -keystore /Users/Roger/tmp/roger.keystore

In the above example,
  • -alias is the unique identifier for your key.

  • -keyalg is the algorithm used to generate the key. Most examples you find on the web usually cite 'RSA', but you could also use 'DSA' or 'DES'
  • -keystore is an optional argument specifying the location of your key store file. If this argument is missing then the default location is your $HOME directory.

RSA stands for Ron Rivest (also the creator of the RC4 algorithm), Adi Shamir and Leonard Adleman
DSA stands for Digital Signature Algorithm
DES stands for Data Encryption Standard

For more information on keytool and its arguments take a look at this Informit article by Jon Svede

When you run this program you'll be asked a few questions:

Roger$ keytool -genkey -alias MyKeyAlias -keyalg RSA -keystore /Users/Roger/tmp/roger.keystore
Enter keystore password: 
Re-enter new password:
What is your first and last name?
  [Unknown]:  localhost
What is the name of your organizational unit?
  [Unknown]:  MyDepartmentName
What is the name of your organization?
  [Unknown]:  MyCompanyName
What is the name of your City or Locality?
  [Unknown]:  Stafford
What is the name of your State or Province?
  [Unknown]:  NA
What is the two-letter country code for this unit?
  [Unknown]:  UK
Is CN=localhost, OU=MyDepartmentName, O=MyCompanyName, L=Stafford, ST=UK, C=UK correct?
  [no]:  Y

Enter key password for 
     (RETURN if same as keystore password): 

Most of the fields are self explanatory; however for the first and second name values, I generally use the machine name - in this case localhost.

Updating the Tomcat Configuration

The second step in securing your app is to ensure that your tomcat has an SSL connector. To do this you need to find tomcat's server.xml configuration file, which is usually located in the 'conf' directory. Once you've got hold of this and if you're using tomcat, then it's a matter of uncommenting:

<Connector port="8443" protocol="HTTP/1.1" SSLEnabled="true"
               maxThreads="150" scheme="https" secure="true"
               clientAuth="false" sslProtocol="TLS" />

…and making it look something like this:

<Connector SSLEnabled="true" keystoreFile="/Users/Roger/tmp/roger.keystore" keystorePass="password" port="8443" scheme="https" secure="true" sslProtocol="TLS"/> 

Note that the password "password" is in plain text, which isn't very secure. There are ways around this, but that's beyond the scope of this blog.

If you're using Spring's tcServer, then you'll find that it already has a SSL connector that's configured something like this:

<Connector SSLEnabled="true" acceptCount="100" connectionTimeout="20000" executor="tomcatThreadPool" keyAlias="tcserver" keystoreFile="${catalina.base}/conf/tcserver.keystore" keystorePass="changeme" maxKeepAliveRequests="15" port="${bio-ssl.https.port}" protocol="org.apache.coyote.http11.Http11Protocol" redirectPort="${bio-ssl.https.port}" scheme="https" secure="true"/>

…in which case it's just a matter of editing the various fields including keyAlias, keystoreFile and keystorePass.

Configuring your App

If you now start tomcat and run your web application, you'll now find that it's accessible using HTTPS. For example typing https://localhost:8443/my-app will work, but so will http://localhost:8080/my-app This means that you also need to do some jiggery-pokery on your app to ensure that it only responds to HTTPS and there are two approaches you can take.

If you're not using Spring Security, then you can simply add the following to your web.xml before the last web-app tag:

<security-constraint>
    <web-resource-collection>
        <web-resource-name>my-secure-app</web-resource-name>
        <url-pattern>/*</url-pattern>
    </web-resource-collection>
    <user-data-constraint>
        <transport-guarantee>CONFIDENTIAL</transport-guarantee>
    </user-data-constraint>
</security-constraint>

If you are using Spring Security, then there are a few more steps to getting things going. Part of the general Spring Security setup is to add the following to your web.xml file. Firstly you need to add a Spring Security application context file to the contextConfigLocation context-param:

<context-param>
          <param-name>contextConfigLocation</param-name>
          <param-value>/WEB-INF/spring/root-context.xml
           /WEB-INF/spring/appServlet/application-security.xml           
          </param-value>
     </context-param>

Secondly, you need to add the Spring Security filter and filter-mapping:

<filter>
    <filter-name>springSecurityFilterChain</filter-name>
    <filter-class>org.springframework.web.filter.DelegatingFilterProxy</filter-class>
  </filter>
  <filter-mapping>
    <filter-name>springSecurityFilterChain</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>

Lastly, you need to create, or edit, your application-security.xml as shown in the very minimalistic example below:

<?xml version="1.0" encoding="UTF-8"?>
<beans:beans xmlns="http://www.springframework.org/schema/security"
  xmlns:beans="http://www.springframework.org/schema/beans"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.springframework.org/schema/beans
           http://www.springframework.org/schema/beans/spring-beans-3.0.xsd
           http://www.springframework.org/schema/security
           http://www.springframework.org/schema/security/spring-security-3.1.xsd">
   
       <http auto-config='true' >
          <intercept-url pattern="/**" requires-channel="https" />    
       </http>
 
       <authentication-manager>
       </authentication-manager>

</beans:beans>

In the example above intercept-url element has been set up intercept all URLs and force them to use the https channel.

The configuration details above may give the impression that it's quicker to use the simple web.xml config change, but if you're already using Spring Security, then it's only a matter of adding a requires-channel attribute to your existing configuration.


A sample app called tomcat-ssl demonstrating the above is available on git hub at: https://github.com/roghughe/captaindebugSource: http://www.captaindebug.com/2012/12/securing-your-tomcat-app-with-ssl-and.html

Rule of 30 – When is a method, class or subsystem too big?

Source: http://swreflections.blogspot.com/2012/12/rule-of-30-when-is-method-class-or.html

A question that constantly comes up from people that care about writing good code, is: what’s the right size for a method or function, or a class, or a package or any other chunk of code? At some point any piece of code can be too big to understand properly – but how big is too big?
It starts at the method or function level.
In Code Complete, Steve McConnell says that the theoretical best maximum limit for a method or function is the number of lines that can fit on one screen (i.e., that a developer can see at one time). He then goes on to reference studies from the 1980s and 1990s which found that the sweet spot for functions is somewhere between 65 lines and 200 lines: routines this size are cheaper to develop and have fewer errors per line of code. However, at some point beyond 200 lines you cross into a danger zone where code quality and understandability will fall apart: code that can’t be tested and can’t be changed safely. Eventually you end up with what Michael Feathers calls “runaway methods”: routines that are several hundreds or thousands of lines long and that are constantly being changed and that continuously get bigger and scarier.
Patrick Duboy looks deeper into this analysis on method length, and points to a more modern study from 2002 that shows that code with shorter routines has fewer defects overall, which matches with most people’s intuition and experience.

Smaller must be better

Bob Martin takes the idea that “if small is good, then smaller must be better” to an extreme in Clean Code:
The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that. Functions should not be 100 lines long. Functions should hardly ever be 20 lines long.
Martin admits that “This is not an assertion that I can justify. I can’t produce any references to research that shows that very small functions are better.” So like many other rules or best practices in the software development community, this is a qualitative judgement made by someone based on their personal experience writing code – more of an aesthetic argument – or even an ethical one – than an empirical one. Style over substance. The same “small is better” guidance applies to classes, packages and subsystems – all of the building blocks of a system. In Code Complete, a study from 1996 found that classes with more routines had more defects. Like functions, according to Clean Code, classes should also be “smaller than small”. Some people recommend that 200 lines is a good limit for a class – not a method, or as few as 50-60 lines (in Ben Nadel’s Object Calisthenics exercise)and that a class should consist of “less than 10” or “not more than 20” methods. The famous C3 project – where Extreme Programming was born – had 12 methods per class on average. And there should be no more than 10 classes per package.
PMD, a static analysis tool that helps to highlight problems in code structure and style, defines some default values for code size limits: 100 lines per method, 1000 lines per class, and 10 methods in a class. Checkstyle, a similar tool, suggests different limits: 50 lines in a method, 1500 lines in a class.

Rule of 30

Looking for guidelines like this led me to the “Rule of 30” in Refactoring in Large Software Projects by Martin Lippert and Stephen Roock:
If an element consists of more than 30 subelements, it is highly probable that there is a serious problem: a) Methods should not have more than an average of 30 code lines (not counting line spaces and comments).
b) A class should contain an average of less than 30 methods, resulting in up to 900 lines of code.
c) A package shouldn’t contain more than 30 classes, thus comprising up to 27,000 code lines.
d) Subsystems with more than 30 packages should be avoided. Such a subsystem would count up to 900 classes with up to 810,000 lines of code.
e) A system with 30 subsystems would thus possess 27,000 classes and 24.3 million code lines.
What does this look like? Take a biggish system of 1 million NCLOC. This should break down into:
  • 30,000+ methods
  • 1,000+ classes
  • 30+ packages
  • Hopefully more than 1 subsystem
How many systems in the real world look like this, or close to this – especially big systems that have been around for a few years?

Are these rules useful? How should you use them?

Using code size as the basis for rules like this is simple: easy to see and understand. Too simple, many people would argue: a better indicator of when code is too big is cyclomatic complexity or some other measure of code quality. But some recent studies show that code size actually is a strong predictor of complexity and quality – that
“complexity metrics are highly correlated with lines of code, and therefore the more complex metrics provide no further information that could not be measured simplify with lines of code”.
In "Beyond Lines of Code: Do we Need more Complexity Metrics" in Making Software, the authors go so far as to say that lines of code should be considered always as the "first and only metric" for defect prediction, development and maintenance models. Recognizing that simple sizing rules are arbitrary, should you use them, and if so how?
I like the idea of rough and easy-to-understand rules of thumb that you can keep in the back of your mind when writing code or looking at code and deciding whether it should be refactored. The real value of a guideline like the Rule of 30 is when you're reviewing code and identifying risks and costs.
But enforcing these rules in a heavy handed way on every piece of code as it is being written is foolish. You don’t want to stop when you’re about to write the 31st line in a method – it would slow down work to a crawl. And forcing everyone to break code up to fit arbitrary size limits will make the code worse, not better – the structure will be dominated by short-term decisions.
As Jeff Langer points out in his chapter discussing Ken Beck’s four rules of Simple Design in Clean Code:
“Our goal is to keep our overall system small while we are also keeping our functions and classes small. Remember however that this rule is the lowest priority of the four rules of Simple Design. So, although it’s important to keep class and function count low, it’s more important to have tests, eliminate duplication, and express yourself.”
Sometimes it will take more than 30 lines (or 20 or 5 or whatever the cut-off is) to get a coherent piece of work done. It’s more important to be careful in coming up with the right abstractions and algorithms and to write clean clear code – if a cut-off guideline on size helps to do that, use it. If it doesn't, then don’t bother.

Monday, January 14, 2013

Comprehending the Mobile Development Landscape

Source: http://thediscoblog.com/blog/2012/12/02/comprehending-the-mobile-development-landscape/



Comprehending the Mobile Development Landscape


There’s no shortage of mobile growth statistics, but here’s a few specific ones paint an overall picture of mobility:
These three facts clearly point out that mobility is a growing, global phenomenon, and that it’s drastically changing how people use the Internet. What’s more, from a technology standpoint, mobile is where the growth is!
But the mobile landscape is as varied as it is big. Unlike a few short years ago, when doing mobile work implied J2ME on a Blackberry, mobile development now encompasses Android, iOS, HTML5, and even Windows Phone. That’s 4 distinct platforms with different development platforms and languages – and I haven’t even mentioned the myriad hybrid options available!
The key to understanding the mobile landscape is an appreciation for the various developmental platforms – their strengths & weaknesses, speed of development, distribution, and, if you are looking at the consumer market, their payout.
Android
Android device distribution, as I pointed out earlier, is growing faster than other platforms, and the Android ecosystem has more than one app store: Google Play and Amazon’s store, just to name the two most popular ones. And by most accounts, Google Play has as many or more apps than Apple’s App Store (careful with this statistic though, see details below regarding payouts).
The massive adoption of Android, however, has lead to fragmentation, which does present some significant challenges with respect to testing. In fact, the reality for most developers is that it is almost impossible to test an app on all combinations of device-OS version profiles in a cost effective manner (this is a growing service industry, by the way).
On a positive note, Java, the native language of Android apps, is a fairly ubiquitous language – some estimates peg as many as 10 million active developers so there’s no shortage of able-bodied Java developers and their associated tools out there.
Thus, with Android, you have a wide audience (both people with Android devices and developers to build apps) and multiple distribution channels. Yet, this large distribution of disparate devices does present some testing challenges; what’s more, it can be more difficult to make money on the Android platform compared to iOS, as you’ll see next.
iOS
iOS, the OS for iPhones and iPads, has a tight ecosystem and an avid user base, willing to spend money, ultimately translating into more money for developers. That is, even though there are far more Android devices globally than iOS ones, the iTunes App Store generates more money than Google Play, which means more money for developers of popular apps. In many respects, users of iOS devices are also more willing to pay a fee for an app as opposed to Android ones.
The development ecosystem for iOS has a higher barrier to entry when compared to something like Java or JavaScript. OSX is a requirement and the cost alone here can be a barrier for a lot of developers; moreover, Objective-C can present some challenges for the faint of heart (manual memory management!). Yet, the tooling provided by Apple is almost universally lauded by the community at large (much like Microsoft’s VisualStudio) – XCode is a slick development tool.
While there isn’t a lot of device fragmentation on iOS, developers do have to deal with OS fragmentation. That is, there are only a handful of Apple devices but quite a lot of different versions living in the field at any given time due to a lagging factor of user upgrades.
The iOS platform certainly offers a direct path to revenue, provided you can build a stellar app; however, compared to Android, this is a closed community, which has the tendency to rub some portion of developmental community wrong. Given you can quickly embrace Objective-C and afford the requisite software, iOS is almost always the first platform app developers target.
HTML5
HTML5 is truly universal and its apps are available on all platforms without any need to port them – JavaScript is as ubiquitous as Java; what’s more, HTML itself has almost no barrier to entry, making HTML5 and JavaScript a force to content with when it comes to finding talented developers and mass distribution. Cost isn’t even really part of the HTML5 equation too – tools and frameworks are free.
Yet, HTML5 apps suffer from a distribution challenge – the major app stores do not carry these apps! Thus, in large part, as an HTML5 app developer, you are relying on a user to type in your URL into a browser. I for one, almost never type in a URL on my iPhone (while I will on my iPad). Lastly, HTML5 is no where near parity with respect to UX compared to native apps (and may never be). This, however, is only a disadvantage if you are building an app that requires a strong UX. There are plenty of great HTML5 apps out there!
HTML5 offers an extremely low developmental barrier to entry and the widest support available – all smart devices have browsers (note, they aren’t all created equal!); however, because there isn’t a viable distribution channel, these apps have limited opportunity to make money.
Windows Phone
Windows is still unproven but could be an opportunity to get in early – first movers in Apple’s App Store without a doubt made far more money than if they had submitted the same apps today. In this case, you if want a truly native experience you’ll build apps on the .NET platform (presumably C#). Windows machines are far cheaper than OSX ones, so there is little financial barrier other than license fees for VisualStudio and a developer fee for the Windows Phone Marketplace.
Indeed, it appears that Microsoft is modeling their app store and corresponding policies off of Apple’s – thus there is a tightly managed distribution channel, presenting an opportunity to reach a wide audience and earn their money. But, at this point, the wide audience has yet to develop.
That’s 4, but there’s still more!
As I alluded to in the beginning of this post, there are 4 primary platforms and myriad hybrid options, such as PhoneGap and Appcelerator, for example. These hybrid options have various advantages and disadvantages; however, the primary concerns one needs to think through are still speed of development, distribution, and payout.
Before you embark on a mobile development effort, it pays to have the end in mind – that is, before you code, have tangible answers for app distribution, development effort, and potential payout as these points will help guide you through the mobile landscape.

Simply Writing Tests Is Not Test Driven Development

Source: http://spin.atomicobject.com/2012/12/06/writing-tests-is-not-tdd/

Simply Writing Tests Is Not Test Driven Development


There is a common misunderstanding in the software world — simply writing tests is test driven development. Test driven development (TDD) is about ensuring that your software is functioning, as well as ensuring that the software’s internals are well designed, reusable, and decoupled.

What is TDD?

Uncle Bob’s 3 basic rules of TDD are:
  • You are not allowed to write any production code unless it is to make a failing unit test pass.
  • You are not allowed to write any more of a unit test than is sufficient to fail; and compilation failures are failures.
  • You are not allowed to write any more production code than is sufficient to pass the one failing unit test.
To summarize Uncle Bob’s rules:
  • Only write code that is tested.
  • Start your tests small, then work your way up.
  • Only write enough production code to make a test pass.

Basics of TDD

The basic process of TDD has 3 steps: Red, Green, and Refactor.

Red

The red step consists of writing a failing test, only one failing test. If you keep getting ahead of yourself and writing multiple tests, borrow an idea from Getting Things Done and get the ideas out of your head so they don’t get in the way of other thoughts. There are a number of different mechanisms you can use: create a to-do list on paper, make an index card, create a series of TODO comments in your files, etc. I find the physical action of crossing off an item on paper or writing a check mark and folding an index card in half gives me more of a sense of accomplishment.
Your first test for a new object should be simple. TDD focuses on emergent design, the opposite of big design up front. Let the design fall out of your code. The purpose of the first test is not about functionality. It’s about flushing out the usage of what you are about to create.
Start with the inputs: “What do I have to feed into this function?” Next, think about the outputs: “What will this function be spitting out?” Then write an assertion, run your test suite, and verify that the test you just wrote is red.
All the other test cases in the red stage should be about capturing functionality. Don’t just test the happy path; think about the craziest thing you could do with the function or object. What happens when I pass in a null parameter? What happens when I pass in a negative value? How about when I pass in a string when it’s expecting an integer?

Green

The green step consist of making the failing test pass as quickly as possible. If more than one test is failing, start with making the test you just wrote pass, and then continue working the reds to greens one at a time. Don’t worry about how the code looks or how efficient it is. Your concern should be with making the test pass so you can move on to ensuring the next bit of functionality is under test.

Refactor

The next step is refactoring, restructuring, and organizing your code. The refactoring step can occur at anytime — after 1 red/green cycle, after 4 red/green cycles, etc. Since you have a number of passing green tests, you can refactor with ease and comfort, knowing that your tests will fail if you regress and lose functionality.
Refactoring shouldn’t only be about restructuring your code and making it more easily readable. Tests need refactoring love and attention too, but don’t refactor code and tests at the same time.

Benefits of TDD

Working Code

One of the primary benefits of TDD is that you have functioning and working code at all times. You spend time narrowing in on pieces of functionality and ensuring that they work as intended.

Fearless Changes

TDD allows for fearless changes. I worked on a number of software projects prior to being enlightened by the magic of TDD. A common thread of thought, looking back, is that I was always deathly afraid of making changes. I spent more time using the application than I did writing code, just to make sure I was maintaining functionality and not causing regressions in specific features. With TDD, that fear is removed because functionality is under test, and you’re able to get near-instantaneous feedback about the system or parts of it. The ability to make fearless changes via refactoring causes the internal quality of your software to improve and eventually bleeds through to being external quality.

Living Documentation

TDD also provides you with a living documentation of the code. If you are anything like me, then when exploring a new library, you want to skip all the fluffy documentation and cut to the chase, looking at examples on how to use it. It is important that we keep this in mind when writing and refactoring our tests — it’s our responsibility to make the test readable and easily understandable. Unlike comments or extremely long manuals, tests are executable and will tell you if they are lying.

Designing Through Code

I am always troubled that one of the D’s in TDD doesn’t stand for design. As I touched on briefly before, the practice of TDD is not entirely about writing tests, ensuring coverage and working software. TDD flips software development on its head. It forces you to think about the problem from the outside in, instead of from the inside out.
Writing tests first forces you to not be worried or concerned about implementation, the primary worry and concern is with using your object or function. Since we are spending a fair amount of time directly interacting with the objects and functions we are writing, architecture and design come to the forefront.

When Not to TDD

What if you’re using a new library or framework and you don’t know how to use it? How can you write tests first if you don’t know how to begin? The answer is you can’t. Create a new project, and use the new library away from your production code base. This new project is known as a spike. Since this isn’t production code, you aren’t violating Rule #1 of Uncle’s Bob’s 3 basic rules. Code until you feel comfortable with the library. When you know how your library works, ignore the spike, go back to your production code and start writing tests.
However, just because you can’t TDD, do not completely throw away the discipline of writing tests. Your tests will serve as a reference for you and (if your spike is in source control) a reference for those who come behind you. Using these tests, you can quickly recall what you have learned, and you will hopefully be able to look back and see a progression in your technique.

Wednesday, January 2, 2013

MapReduce Algorithms – Order Inversion

Source: http://codingjunkie.net/order-inversion/


MapReduce Algorithms – Order Inversion

This post is another segment in the series presenting MapReduce algorithms as found in the Data-Intensive Text Processing with MapReducebook. Previous installments are Local AggregationLocal Aggregation PartII and Creating a Co-Occurrence Matrix. This time we will discuss the order inversion pattern. The order inversion pattern exploits the sorting phase of MapReduce to push data needed for calculations to the reducer ahead of the data that will be manipulated.. Before you dismiss this as an edge condition for MapReduce, I urge you to read on as we will discuss how to use sorting to our advantage and cover using a custom partitioner, both of which are useful tools to have available. Although many MapReduce programs are written at a higher level abstraction i.e Hive or Pig, it’s still helpful to have an understanding of what’s going on at a lower level. The order inversion pattern is found in chapter 3 of Data-Intensive Text Processing with MapReduce book. To illustrate the order inversion pattern we will be using the Pairs approach from the co-occurrence matrix pattern. When creating the co-occurrence matrix, we track the total counts of when words appear together. At a high level we take the Pairs approach and add a small twist, in addition to having the mapper emit a word pair such as (“foo”,”bar”) we will emit an additional word pair of (“foo”,”*”) and will do so for every word pair so we can easily achieve a total count for how often the left most word appears, and use that count to calculate our relative frequencies. This approach raised two specific problems. First we need to find a way to ensure word pairs (“foo”,”*”) arrive at the reducer first. Secondly we need to make sure all word pairs with the same left word arrive at the same reducer. Before we solve those problems, let’s take a look at our mapper code.

Mapper Code

First we need to modify our mapper from the Pairs approach. At the bottom of each loop after we have emitted all the word pairs for a particular word, we will emit the special token WordPair(“word”,”*”) along with the count of times the word on the left was found.
1public class PairsRelativeOccurrenceMapper extends Mapper<LongWritable, Text, WordPair, IntWritable> {
2    private WordPair wordPair = new WordPair();
3    private IntWritable ONE = new IntWritable(1);
4    private IntWritable totalCount = new IntWritable();
5
6    @Override
7    protected void map(LongWritable key, Text value, Context context) throwsIOException, InterruptedException {
8        int neighbors = context.getConfiguration().getInt("neighbors"2);
9        String[] tokens = value.toString().split("\\s+");
10        if (tokens.length > 1) {
11            for (int i = 0; i < tokens.length; i++) {
12                    tokens[i] = tokens[i].replaceAll("\\W+","");
13
14                    if(tokens[i].equals("")){
15                        continue;
16                    }
17
18                    wordPair.setWord(tokens[i]);
19
20                    int start = (i - neighbors < 0) ? 0 : i - neighbors;
21                    int end = (i + neighbors >= tokens.length) ? tokens.length - 1: i + neighbors;
22                    for (int j = start; j <= end; j++) {
23                        if (j == i) continue;
24                        wordPair.setNeighbor(tokens[j].replaceAll("\\W",""));
25                        context.write(wordPair, ONE);
26                    }
27                    wordPair.setNeighbor("*");
28                    totalCount.set(end - start);
29                    context.write(wordPair, totalCount);
30            }
31        }
32    }
33}
Now that we’ve generated a way to track the total numbers of times a particular word has been encountered, we need to make sure those special characters reach the reducer first so a total can be tallied to calculate the relative frequencies. We will have the sorting phase of the MapReduce process handle this for us by modifying the compareTo method on the WordPair object.

Modified Sorting

We modify the compareTo method on the WordPair class so when a “*” caracter is encountered on the right that particular object is pushed to the top.
1@Override
2public int compareTo(WordPair other) {
3    int returnVal = this.word.compareTo(other.getWord());
4    if(returnVal != 0){
5        return returnVal;
6    }
7    if(this.neighbor.toString().equals("*")){
8        return -1;
9    }else if(other.getNeighbor().toString().equals("*")){
10        return 1;
11    }
12    return this.neighbor.compareTo(other.getNeighbor());
13}
By modifying the compareTo method we now are guaranteed that any WordPair with the special character will be sorted to the top and arrive at the reducer first. This leads to our second specialization, how can we guarantee that all WordPair objects with a given left word will be sent to the same reducer? The answer is to create a custom partitioner.

Custom Partitioner

Intermediate keys are shuffled to reducers by calculating the hashcode of the key modulo the number of reducers. But our WordPair objects contain two words, so taking the hashcode of the entire object clearly won’t work. We need to wright a custom Partitioner that only takes into consideration the left word when it comes to determining which reducer to send the output to.
1public class WordPairPartitioner extends Partitioner<WordPair,IntWritable> {
2
3    @Override
4    public int getPartition(WordPair wordPair, IntWritable intWritable, intnumPartitions) {
5        return wordPair.getWord().hashCode() % numPartitions;
6    }
7}
Now we are guaranteed that all of the WordPair objects with the same left word are sent to the same reducer. All that is left is to construct a reducer to take advantage of the format of the data being sent.

Reducer

Building the reducer for the inverted order inversion pattern is straight forward. It will involve keeping a counter variable and a “current” word variable. The reducer will check the input key WordPair for the special character “*” on the right. If the word on the left is not equal to the “current” word we will re-set the counter and sum all of the values to obtain a total number of times the given current word was observed. We will now process the next WordPair objects, sum the counts and divide by our counter variable to obtain a relative frequency. This process will continue until another special character is encountered and the process starts over.
1public class PairsRelativeOccurrenceReducer extends Reducer<WordPair, IntWritable, WordPair, DoubleWritable> {
2    private DoubleWritable totalCount = new DoubleWritable();
3    private DoubleWritable relativeCount = new DoubleWritable();
4    private Text currentWord = new Text("NOT_SET");
5    private Text flag = new Text("*");
6
7    @Override
8    protected void reduce(WordPair key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
9        if (key.getNeighbor().equals(flag)) {
10            if (key.getWord().equals(currentWord)) {
11                totalCount.set(totalCount.get() + getTotalCount(values));
12            else {
13                currentWord.set(key.getWord());
14                totalCount.set(0);
15                totalCount.set(getTotalCount(values));
16            }
17        else {
18            int count = getTotalCount(values);
19            relativeCount.set((double) count / totalCount.get());
20            context.write(key, relativeCount);
21        }
22    }
23  private int getTotalCount(Iterable<IntWritable> values) {
24        int count = 0;
25        for (IntWritable value : values) {
26            count += value.get();
27        }
28        return count;
29    }
30}
By manipulating the sort order and creating a custom partitioner, we have been able to send data to a reducer needed for a calculation, before the data needed for those calculation arrive. Although not shown here, a combiner was used to run the MapReduce job. This approach is also a good candidate for the “in-mapper” combining pattern.

Example & Results

Given that the holidays are upon us, I felt it was timely to run an example of the order inversion pattern against the novel “A Christmas Carol” by Charles Dickens. I know it’s corny, but it serves the purpose.
1new-host-2:sbin bbejeck$ hdfs dfs -cat relative/part* | grep Humbug
2{word=[Humbug] neighbor=[Scrooge]}  0.2222222222222222
3{word=[Humbug] neighbor=[creation]} 0.1111111111111111
4{word=[Humbug] neighbor=[own]}  0.1111111111111111
5{word=[Humbug] neighbor=[said]} 0.2222222222222222
6{word=[Humbug] neighbor=[say]}  0.1111111111111111
7{word=[Humbug] neighbor=[to]}   0.1111111111111111
8{word=[Humbug] neighbor=[with]} 0.1111111111111111
9{word=[Scrooge] neighbor=[Humbug]}  0.0020833333333333333
10{word=[creation] neighbor=[Humbug]} 0.1
11{word=[own] neighbor=[Humbug]}  0.006097560975609756
12{word=[said] neighbor=[Humbug]} 0.0026246719160104987
13{word=[say] neighbor=[Humbug]}  0.010526315789473684
14{word=[to] neighbor=[Humbug]}   3.97456279809221E-4
15{word=[with] neighbor=[Humbug]} 9.372071227741331E-4

Conclusion

While calculating relative word occurrence frequencies probably is not a common task, we have been able to demonstrate useful examples of sorting and using a custom partitioner, which are good tools to have at your disposal when building MapReduce programs. As stated before, even if most of your MapReduce is written at higher level of abstraction like Hive or Pig, it’s still instructive to have an understanding of what is going on under the hood. Thanks for your time.