Free IT Books, Study Guides, Practice Exams, Tutorials and Software
Friday, December 19th 2014
-  Free Books
Free MSDN Mags
Free Oracle Mags
Free software CDs
- Certifications
  Exam Details
  Mock exams
  Study guides
  Exam Details
  Mock exams
  Study guides
  Sample chapters
  Exam Details
  Mock exams
  Study guides
  Sample chapters
  Exam Details
  Mock exams
  Study guides
  Sample chapters
  Exam Details
  Mock exams
  Study guides
  Sample chapters
  Exam Details
  Mock exams
  Study guides
  Sample chapters
  Mock exams
  MCSE guides/exams
  Exam Resources
- Java / J2EE
- .NET
Knowledge Base
  .NET Framework
  Visual Studio.NET
- About
Gayan Balasooriya

Broken links?
Suggest good links
To remove links
 weblogs from Javablogs.com

Java8: How to implement a custom Collector?

As you may already know, Java's Stream API is one of the most significant features introduced in the latest platform version. Aside from the lambdas, which can reduce the amount of boilerplate code in our code base, the Stream API releases us from the burden to be responsible for how collections are traversed, for example. In the pre-Java8 platform versions, we could either use an implementation of Iterator, or use loops (for example for, enhanced for and so on), but now, working with Java8, we can just use a collector and it's nested features to manipulate the content of the collection. This way, we're passing the responsibility for iterating and manipulating to the Stream, by just passing instructions (in the form of lambdas) to how these elements have to be curried.

Retrieving back collection items from a Stream is often done with by calling the eager overloaded methods Stream#collect(Collector collector) and Stream#collect(Supplier supplier, BiConsumer accumulator, BiConsumer combiner). For the some commonly used operations, the Stream API provides implementations of several collectors, which we access by invoking some of the static methods of the Collectors class, such as Collectors#toSet(), Collectors#toList() or Collector#toMap().

Sometimes, however, we may get to the point where the pre-defined Collector implementations may not be suitable, or at least, may not transform the elements from the stream to the exact collection type we would desire. This is something that is very likely to happen. For example, Collectors.toList() may produce an ArrayList, while we need a LinkedList. In such situations, knowing and understanding how to write a custom Collector implementation is crucial.

Let's have the following use-case. We have this data-structure (in specific, a TreeMap):

Map<String, List<String>> peopleByCity = new TreeMap<>();

with the following content:

{ "London" : [ "Steve", "John"],
  "Paris"  : [ "Thierry" ],
  "Sofia"  : [ "Peter", "Konstantin", "Ivan"] }

We'd like to implement a Collector, which transforms the entry set of the given TreeMap to a List<Map.Entry> elements. For the example above, the transformation has to result into a List with the following content:

 London : Steve
London : John
Paris  : Thierry
Sofia  : Peter
Sofia  : Konstantin
Sofia  : Ivan

Implementing a custom Collector is easy! We just need to implement the java.util.stream.Collector interface. By definition, it's generic and has three type parameters.

  • T - the type of input elements to the reduction operation
  • A - the mutable accumulation type of the reduction operation (often hidden as an implementation detail)
  • R - the result type of the reduction operation

The Collector interface introduces five abstract methods, and I will explain the ideas behind all of them. Typically, the Collector is a type of reducer, which often needs a temporary (internal) mutable structure, which holds the temporary state of the transformed items. It's often referred with the term accumulator. Here, an ArrayList would be perfectly suitable for an accumulator, because for each pair of the type [City ; Name] we will add a new entry to the accumulator.

Proceeding to the actual implementation, the class definition would be:

public class KockoCollector<T, V> implements Collector<Entry<T, List<V>>, List<Entry<T, V>>, List<Entry<T,V>>> {
       //Implemented methods

The supplier() method returns a function, which supplies with the accumulator (the mutable result container) for the Collector. Since we picked the ArrayList as a type of our accumulator, the method implementation would be simply:

public Supplier<List<Entry<T, V>>> supplier() {
    return ArrayList::new;

The accumulator() method returns a function, which folds the element from the stream that is currently being processed into the accumulator. In our case, we just stream the person names for every next city and for each person name we add a new AbstractMap.SimpleEntry to the accumulator. (Note that this is perfectly valid, because AbstractMap.SimpleEntry is the super-class for Map.Entry

public BiConsumer<List<Entry<T, V>>, Entry<T, List<V>>> accumulator() {
    return (accum, entry) -> {
                 .forEach(x -> accum.add(new AbstractMap.SimpleEntry<T, V>(entry.getKey(), x)));

The third method we have to implement is the combiner(). This method is used strongly when working with parallel streams. It's purpose is to combine the internal temporary collector-accumulators of the stream batches that are being processes in parallel. The implementation of this method can be left empty, if the Collector is not supposed to work with parallel streams. Otherwise, it's mandatory to describe how the parallel pieces will be merged together. We'd like our custom Collector to work on parallel streams, so we provide implementation for combiner(). A single call to this method will return a BinaryOperator, which implementation merges together the content of two accumulators, x and y, by simply adding the content of the one to the content of the other.

public BinaryOperator<List<Entry<T, V>>> combiner() {
return (x, y) -> {
return x;

We're almost ready implementing our custom collector. We picked the type of our accumulator, we explained how our collector will merge the content of parallel collector-accumulators, etc. The only that's left, is to pick our final return type. This is what the finisher() method is used for. It returns a Function which takes the collectors internal accumulator and converts it to the type that our collector is supposed to produce when finishing work with the stream elements. Sometimes, however, returning the accumulator from the finisher() is perfectly valid and this happens in the cases when we actually don't need to convert the accumulator to some other type. Our case is one of these and therefore the finisher() implementation is pretty simple:

public Function<List<Entry<T, V>>, List<Entry<T, V>>> finisher() {
return accumulator -> accumulator;

The Collector interface introduces one more abstract method. That's the characteristics() one, which returns a Set<Collector.Characteristics>, containing meta-information about the collector. The Collector.Characteristics enum has only three values: CONCURRENT, IDENTITY_FINISH, UNORDERED. We're always required to provide at least one of these in the resulting Set. If our Collector was thread-safe (it isn't), we'd have added the CONCURRENT constant to the Set. It's just unordered, because it doesn't guarantee that the collector will preserve the encounter order of the stream.

public Set<java.util.stream.Collector.Characteristics> characteristics() {
return EnumSet.of(Characteristics.UNORDERED);

That's pretty much it. As you've seen, implementing custom collectors is not too difficult and it's actually quite fun.

Think about what kind of collectors can you implement for the project you're currently working on. Can you share your ideas?

Please leave a comment below if you find this article useful or if you have questions, as well.


JavaEE Tip #6 - Resources

The @Resources annotation makes it possible to define resources that you are going to lookup at runtime. The example describes a way to lookup a datasource that is mentioned in the @Resources annotation on the class.

    @Resource(name="ShoppingCartDB", type=javax.sql.DataSource),
    @Resource(name="ShoppingCartMail", type=javax.mail.Session)
  public class ShoppingCartBean {

    public List getItems() {
      DataSource shoppingCartDB = (DataSource) ctx.lookup("ShoppingCartDB");
      Connection conn = shoppingCartDB.getConnection():
      return null;


MongoDB: Text search vs. dedicated text search engines

By Kyle Banker, Peter Bakkum, Shaun Verch, Douglas Garrett, and Tim Hawkins
MongoDB in Action, Second Edition

Save 39% on MongoDB in Action, Second Edition discount code jnmongdbat manning.com.

Dedicated text search engines can go beyond indexing just web pages to indexing extremely large databases. Text search engines can provide capabilities such as spelling correction, suggestions as to what you’re really looking for, and relevancy measures, things many web search engines can do as well. But in addition, dedicated search engines can provide further improvements such as facets, custom synonym libraries, custom stemming algorithms, and custom stop word dictionaries.

Faceted search in particular is something that you see almost any time you shop on a modern large e-commerce website, where results will be grouped by certain categories that allow the user to further explore. For example, if you go to the Amazon website and search using the term “apple” you’ll see something like the page in figure 1.

Figure 1: Search on Amazon using the term “apple” and illustrating the use of faceted search

#A Show results for different “facets” based on department
#B List of most common facets
#C Show all facets / departments

On the left side of the web page, you’ll see a list of different groupings you might find for Apple-related products and accessories. These are the results of a faceted search. Facets make it easy and efficient to turn almost any field into a type of category. In addition, facets can go beyond groupings based on the different values in a field. For example, in figure 1 you see groupings based on weight ranges instead of exact weight. This approach allows you to narrow the search based on the weight range you want, something that’s pretty important if you’re searching for a portable computer.

Facets allow the user to easily drill down into the results to help narrow their search results based on different criteria of interest to them. Facets in general are a tremendous aid to help you find what you’re looking for, especially in a product database as large as Amazon, which sells more than 200 million products. This is where a faceted search becomes almost a necessity.


Unfortunately, many of the capabilities available in a full-blown text search engine are beyond the capabilities of MongoDB. But there’s good news: MongoDB can still provide you with about 80 percent of what you might want in a catalog search, with less complexity and effort than is needed to establish a full-blown text search engine with faceted search and suggestive terms. What does MongoDB give you?

  • Automatic real-time indexing with stemming
  • Optional assignable weights by field name
  • Multilanguage support
  • Stop word removal
  • Exact phrase or word matches
  • The ability to exclude results with a given phrase or word<.li>

NOTE: Unlike more full-featured text search engines, MongoDB doesn’t allow you to edit the list of stop words. There’s a request to add this: https://jira.mongodb.org/browse/SERVER-10062.

All these capabilities are available for the price of simply defining an index, which then gives you access to some decent word-search capabilities without having to copy your entire database to a dedicated search engine. This approach also avoids the additional administrative and management overhead that would go along with a dedicated search engine. Not a bad trade-off if MongoDB gives you enough of the capabilities you need.

Now let’s see the details of how MongoDB provides this support. It’s pretty simple:

  • First, you define the indexes needed for text searching.
  • Then, you’ll use text search in both the basic queries as well as aggregation framework.

One more critical component you’ll need is MongoDB 2.6 or later. MongoDB 2.4 introduced text search in an experimental stage, but it wasn’t until MongoDB 2.6 that text search became available by default, and text search related functions became fully integrated with the find() and aggregate() functions.


Before taking a detailed look at how MongoDB’s text search works, let’s explore an example using the e-commerce data. The first thing you’ll need to do is define an index; you’ll begin by specifying the fields that you want to index. Here’s a simple example using the e-commerce products collection.

    {name: 'text',          #A
     description: 'text',   #B
     tags: 'text'}          #C
#A Index name field
#B Index description field
#C Index tags field

This index specifies that the text from three fields in the products collection will be searched: name, description, and tags. Now let’s see a search example that looks for gardens in the products collection:

> db.products
    .find({$text: {$search: 'gardens'}},        #A
          {_id:0, name:1,description:1,tags:1})

    "name" : "Rubberized Work Glove, Black",
    "description" : "Black Rubberized Work Gloves...",
    "tags" : [
        "gardening"                        #B
    "name" : "Extra Large Wheel Barrow",
    "description" : "Heavy duty wheel barrow...",
    "tags" : [
        "gardening",                      #C

#A Search for text field gardens
#B gardening matches search
#C gardening matches search

Even this simple query illustrates a few key aspects of text search and how it differs from normal text search. In this example, the search for gardens has resulted in a search for the stemmed word garden. That in turn has found two products with the tag gardening, which has been stemmed and indexed under garden.

Fig1.jpg52.06 KB
Fig3.jpg30.38 KB
Fig2.jpg83.97 KB

JavaEE Tip #5 - Resource

With the @Resource annotation you define the resource you want to inject. Note this annotation is an older style annotation. Going forward the recommendation is to use the CDI @Inject annotation.

In code

  UserTransaction utx;


Working with geospatial data

By Radu Gheorghe and Matthew Lee Hinman, Elasticsearch in Action

Geospatial data is all about making your search application location aware. For example, to search for events that are close to you, or to find restaurants in a certain area, or to see which park’s area intersects with the area of the city center, you’d work with geospatial data.

We’ll call events and restaurants in this context points, because they’re essentially points on the map. We’ll put areas, such as a country or a rectangle that you draw on a map, under the generic umbrella of shapes. Geospatial search is all about points, shapes, and various relations between them:

  • Distance between a point and another point—If where you are is a point, and swimming pools are other points, you can search for the closest swimming pools. Or you can filter only pools that are reasonably close to you.
  • A shape containing a point—If you select an area on the map, like the area where you work, you can filter only restaurants that are in that area.
  • A shape overlapping with another shape—For example, you want to search for parks in the city center.

This article will show you how to search and sort documents in Elasticsearch, based on their distance from a reference point on the map. You’ll also learn how to search for points that fall into a rectangle and how to search shapes that intersect with a certain area you define on the map.

Points and distances between them

To search for points, you have to index them first. Elasticsearch has a geo point type especially for that. You can see an example on how to use it in the code samples, by looking at mapping.json.

NOTE The code samples for this article, along with instructions on how to use them, can be found at https://github.com/dakrone/elasticsearch-in-action.

Each event has a location field, which is an object that includes the geolocation field as a geo_point type:
"geolocation" : { "type" : "geo_point"}

With the geo point type defined in your mapping, you can index points by giving the latitude and longitude, as you can see in populate.sh:

"geolocation": "39.748477,-104.998852"

TIP You can also provide the latitude and longitude as properties, as an array, or as a geohash. This doesn’t change the way points are indexed; it’s just for your convenience, in case you have a preferred way. You can find more details at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/ma....

Having geo points indexed as part of your event documents enables you to add distance criteria to your searches in the following ways:

  • Sort results by the distance from a given point—This makes the event closest to you appear first.
  • Filter results by distance—This lets you display only events that are within 100 kilometers from you, for example.
  • Count results by distance—This allows you to create buckets of ranges. For example, you can get the number of events within 100 km from you and the number of events from 100 km to 200 km and so on.

Adding distance to your sort criteria

Using a get-together example, let’s say your coordinates are 40,-105 and you need to find the event about Elasticsearch closest to you. To do that, you need to add a sort criteria called _geo_distance, where you specify your current location, as shown in the following listing.

Listing 1 Sorting events by distance

curl 'localhost:9200/get-together/event/_search?pretty' -d '{
  "query": {                     #A
    "match": {                   #A
      "title": "elasticsearch"   #A
    }                            #A
  },                             #A
  "sort" : [
      "_geo_distance" : {                   #B
        "location.geolocation" : "40,-105", #C
        "order" : "asc",                    #D
        "unit" : "km"         #E

#A The query, looking for “elasticsearch” in the title
#B The _geo_distance sort criteria
#C Your current location
#D Ascending order will give closest events first
#E Each hit will have a sort value, representing the distance from your location in kilometers

Sorting by distance and other criteria at the same time by using scripts

A search like the previous one is useful when distance is your only criteria. If you want to include other criteria in the equation, such as the document’s score, you can use a script. That script can generate a final score based on the initial score from your query plus the distance from your point of interest.

Listing 2 shows such a query. You’ll use the function_score query, which will first run the same match query as listing A.1, looking for events about Elasticsearch. Next, the script will take the initial score and divide it by the distance. This way, an event will score higher the closer it is to you. To refer to the distance from a point, you’ll use the arcDistanceInKm() function, where you’ll specify where you are, for example, doc['location.geolocation'].arcDistanceInKm(40.0, -105.0).

Taking distance into account when calculating the score

curl 'localhost:9200/get-together/event/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query": {                   #A
        "match": {                 #A
          "title": "elasticsearch" #A
        }                          #A
      "script_score": {            #B
        "script": "if (doc['"'location.geolocation'"'].empty){    #C
_score                                                           #C
                   } else {
_score*40000/doc['"'location.geolocation'"'].arcDistanceInKm(40.0, -105.0) #D

#A The query looking for “elasticsearch” will return a score
#B The script_score will calculate the final score based on the script you run
#C If there’s no geolocation field, you leave the score untouched
#D Otherwise, you divide the score by the distance, to get higher scores for lower distances, and you multiply everything by 40000, to bump the score of all events with geo information past the ones without it

You might be tempted to think that such scripts bring the best of both worlds: relevance from your query and the geospatial dimension. Although the function_score query is very powerful indeed, there are two aspects to be aware of:

  • Tuning the score—How important is distance compared to relevancy? What should you do with documents that don’t have geo information? These questions are tricky to answer.
  • Performance—Running a script like the one in listing A.2 is expensive in terms of speed, especially when you have lots of documents.

If these two aspects bother you, then you may want to search your events as usual and filter only those that are within a certain distance.

Filter based on distance

Let’s say you’re looking for events within a certain range from where you are, like in figure 1.

You can filter only points that fall in a certain range from a specified location.

To filter such events, you’d use the geo distance filter. The parameters it needs are your reference location and the limiting distance, as shown here:

% curl 'localhost:9200/get-together/event/_search?pretty' -d '{
  "query": {
    "filtered": {
      "filter": {
        "geo_distance": {
          "distance": "50km",
          "location.geolocation": "40.0,-105.0"

In this default mode, Elasticsearch will calculate the distance from 40.0,-105.0 to each event’s geolocation and return only those that are under 50 km. You can set the way the distance is calculated via the distance_type parameter, which will go right next to the distance parameter. You have three options:

  • sloppy_arc (default)—It calculates the distance between the two points by doing a faster approximation of an arc of a circle. For most situations, this is a good option.
  • arc—It actually calculates the arc of a circle, making it slower but more precise than sloppy_arc. Note that you don’t get 100% precision here either, because the Earth isn’t perfectly round. Still, if you need precision, this is the best option.
  • plane—This is the fastest but least precise implementation because it assumes the surface between the two points is plane. This option works well when you have many documents and the distance limit is fairly small.

Performance optimization doesn’t end with distance algorithms. There’s another parameter to the geo distance filter called optimize_bbox. bbox stands for “bounding box,” which is a rectangle that you define on a map that contains all the points and areas of interest.

Using optimize_bbox will first check if events match a square that contains the circle describing the distance range. If they match, Elasticsearch filters further by calculating the distance.

If you ask yourself if the bounding box optimization is actually worth it, then you’ll be happy to know that for most cases it is. Verifying whether a point belongs to a bounding box is much faster than calculating the distance and comparing it to your limit.

It’s also configurable. You can set optimize_bbox to none and check if your query times are faster or slower. The default value is memory and you can set it to indexed.

Are you curious about what the difference between memory and indexed is? We’ll discuss this difference in the beginning of the next section. If you’re not curious and you don’t want to obsess on performance improvements, sticking with the default should be good enough for most cases.

When you index a point, one way to search for it is by calculating the distance to another point, which is what we’ve discussed so far. The second way to search for it is in relation to a shape, which we’ll look at next.

Does a point belong to a shape?

Shapes, especially rectangles, are easy to draw interactively on a map, as you can see in figure 2. It’s also faster to search for points in a shape than to calculate distances, because searching in a shape only requires comparing the coordinates of the point with the coordinates of the shape’s corners.

You can filter points based on whether they fall within a rectangle on the map.

There are three types of shapes on the map that you can match points to:

  • Bounding boxes (rectangles)—These are quite fast and give you the flexibility to draw any rectangle.
  • Polygons—These allow you to draw a more precise shape, but it’s difficult to ask a user to draw a polygon, and the more complex the polygon is, the slower the search.
  • Geohashes (squares defined by a hash)—These are the least flexible, because hashes are fixed. But, as you’ll see later, they’re typically the fastest implementation of the three.

Bounding box filter

To search if a point falls within a rectangle, you’d use the bounding box filter. This is useful if your application allows users to click a point on the map to define a corner of the rectangle and then to click again to define the opposite corner. The result could be a rectangle like the one from figure 2.

To run the bounding box filter, you specify the coordinates for the top-left and bottom-right points that describe the rectangle:

% curl 'localhost:9200/get-together/event/_search?pretty' -d '{
  "query": {
    "filtered": {
      "filter": {
        "geo_bounding_box": {
          "location.geolocation": {
            "top_left": "40, -106",
            "bottom_right": "38, -103"

The default implementation of the bounding box filter is to load the points’ coordinates in memory and compare them with those provided for the bounding box. This is the equivalent of setting the type option under geo_bounding_box to memory.

Alternatively, you can set type to indexed, and Elasticsearch will do the same comparison using range filters. For this implementation to work, you need to index the point’s latitude and longitude in their own fields, which isn’t enabled by default.

To enable indexing latitude and longitude separately, you have to set lat_lon to true in your mapping, making your geolocation field definition look like this:

"geolocation" : { "type" : "geo_point", "lat_lon": true }

NOTE If you make this change to mapping.json from the code samples, you’ll need to run populate.sh again to re-index the sample dataset and have your changes take effect.
The indexed implementation is faster, but indexing latitude and longitude will make your index bigger. Also, if you have more geo points per document—such as an array of points for a restaurant franchise—the indexed implementation won’t work.

Geohash cell filter

The last point-matches-shape method you can use is by matching geohash cells. They work as suggested in figure 3: the Earth is divided into 32 rectangles/cells (dividing the latitude in 4 and the longitude in 8). Each cell is identified by an alphanumeric character, its hash. Then, each rectangle, for example, d, can be further divided into 32 rectangles of its own, generating d0, d1, and so on. You can repeat the process virtually forever, generating smaller and smaller rectangles with longer and longer hash values.

Figure 3: The world divided in 32 letter-coded cells. Each cell is divided into 32 cells and so on, making longer hashes.

Because of the way geohash cells are defined, each point on the map belongs to an infinite number of such geohash cells, like d, d0, d0b, and so on. Given such a cell, Elasticsearch can tell you which points match with the geohash cell filter:

% curl 'localhost:9200/get-together/event/_search?pretty' -d '{
  "query": {
    "filtered": {
      "filter": {
        "geohash_cell": {
          "location.geolocation": "9xj"

Even though a geohash cell is a rectangle, this filter works differently than the bounding box filter. First, geo points have to get indexed with a geohash that describes them, for example, 9xj6. Then, you also have to index all the ngrams of that hash, like 9, 9x, 9xj, and 9xj6, which describe all the parent cells. When you run the filter, the hash from the query is matched against the hashes indexed for that point, making a geohash cell filter which is very fast.

To enable indexing the geohash in your geo point, you have to set geohash to true in the mapping. To index that hash’s parents (ngrams), you have to set geohash_prefix to true as well.

TIP Because a cell will never be able to perfectly describe a point, you have to choose how precise (or big) that rectangle needs to be. The default setting for precision is 12, which creates hashes like 9xj64sswpkdq with an accuracy of a few centimeters. Because you’ll also be indexing all the parents, you may want to trade some precision for index size and search performance.

Understanding geohash cells is important even if you’re not going to use the geohash cell filter because in Elasticsearch, geohashes are the default way of representing shapes. We’ll explain how shapes use geohashes in the next section.

Shape intersections

Elasticsearch can index documents with shapes, like polygons showing the area of a park, and filter documents based on whether parks overlap other shapes, such as the city center. It does this by default through the geohashes that we discussed in the previous section. The process is described in figure 4: each shape is approximated to a group of rectangles defined by geohashes. When you search, Elasticsearch will easily find out if at least one geohash of a certain shape overlaps a geohash of another shape.

Figure 4: Shapes represented in geohashes. Searching for shapes matching shape 1 will return shape 2.

Indexing shapes

Let’s say you have a shape of a park that’s a polygon with four corners. To index it, you’d first have to define a mapping of that shape field—let’s call it area—of type geo_shape. With the mapping in place, you can start indexing documents: the area field of each document would have to mention that the shape’s type is polygon and show the array of coordinates for that polygon, as shown in the next listing.

Listing 3 Indexing a shape

curl -XPUT localhost:9200/geo                          #A
curl -XPUT localhost:9200/geo/park/_mapping -d '{      #B
  "properties": {                                      #B
    "area": { "type": "geo_shape"}                     #B
  }                                                    #B
}'                                                     #B
curl -XPUT localhost:9200/geo/park/1 -d '{
  "area": {                                            #C
    "type": "polygon",                                 #C
    "coordinates": [                                   #D
      [[45, 30], [46, 30], [45, 31], [46, 32]]         #E

#A Creating a new index to index the park areas
#B Put the mapping for parks. geo-shapes will be indexed in the area field.
#C A polygon is indexed in the area field
#D Coordinates for the polygon
#E This first array describes the outer boundary. Optionally, other arrays can be added to define holes in the polygon.

NOTE Polygons aren’t the only shape type Elasticsearch supports. You can have multiple polygons in a single shape (type: multipolygon). There are also the point and multipoint types, one or more chained lines (linestring), and rectangles (envelope).

The amount of space a shape occupies in your index depends heavily on how you index it. Because geohashes can only approximate most shapes, it’s up to you to define how small those geohash rectangles can be. The smaller they are, the better the resolution/approximation, but your index size increases because smaller geohash cells have longer strings and—more importantly—more parent ngrams to index as well. Depending on where you are in this trade-off, you’ll specify a precision parameter in your mapping, which defaults to 50m. This means the worst-case scenario is to get an error of 50m.

Filtering overlapping shapes

With your park documents indexed, let’s say you have another four-cornered shape that represents your city center. To see which parks are at least partly in the city center, you’d use the geo shape filter. You can provide the shape definition of your city center in the filter, like it is in the following listing.

Listing 4 geo shape filter example

curl localhost:9200/geo/park/_search?pretty -d '{
  "query": {
    "filtered": {
      "filter": {
        "geo_shape": {
          "area": {                                              #A
            "shape": {                                           #B
              "type": "polygon",                                 #C
              "coordinates": [                                   #C
                [[45, 30.5], [46, 30.5], [45, 31.5], [46, 32.5]] #C
              ]                                                  #C

#A Field to be searched on
#B You’ll provide a shape in the query
#C Shape provided in the same way as when you index

If you followed listing A.3, you should see that the indexed shape matches. Change the query to something like [[95, 30.5], [96, 30.5], [95, 31.5], [96, 32.5]], and the query won’t return any hits.

shaded2.jpg21.67 KB
Figure1.jpg101.61 KB
figure4.jpg36.53 KB
shaded1.jpg82.21 KB
shaded2.jpg21.67 KB
F2.png228.22 KB
F3.jpg84.98 KB

JavaEE Tip #4 - PreDestroy

With the @PreDestroy annotation you annotate a single method in your class that you want to be called just before your object (EJB, JSF managed bean, CDI bean) is taken out of service.

In code

  public void destroy() {
    // put your destroy code here.


JavaEE Tip #3 - PostConstruct

The @PostConstruct annotation makes it possible to run initialization code just before something (EJB, JSF managed bean, CDI bean) is put into service.

In code

  public void init() {
    // put your initialization code here.

Note the the method annotated with @PostConstruct runs only once! So if you are exposing a session bean that gets passivated and subsequently activated this code is NOT run again.


Managing Marketing Campaigns with Magnolia CMS

To change up this blog's format a bit, I'm presenting to you: a video blog post! It's only a couple minutes long and shows you how you can manage your marketing campaigns efficiently with Magnolia. Marketing campaigns take a lot of planning and collaboration, and they're typically made up of a lot of different types of content. That's what we're trying to approach with this app: it lets digital marketing teams gather all the content for a given campaign, send it through a review cycle, and publish it in stages (or all at once, depending on your strategy).

Watch the demo below and let us know what you think!

SBT: Why a build tool? + 39% savings

By Joshua Suereth and Matthew Farwell, SBT in Action

Save 39% on SBT in Action with discount code sbtjn14 at manning.com.

If you've spent any time working in a team of software developers, you'll have noticed one thing. People make mistakes. Have you ever heard of the old joke about asking 10 economists a question and getting 11 different answers? It's like that when you've got a list of things to do: if you have 10 things to do, you'll make 11 mistakes doing them.

Another thing you'll notice is that people are different. Bob isn't the same person as Ted. When we look at Bob’s workstation, everything is organized and tidy. Everything is where it should be. No icons on his desktop. Ted, however, can't actually see his screen background for the icons. Don't know why he bothers having one in the first place.

One thing is absolutely certain. If you ask Bob to do a build of the product, you'll get a different product than if you were to ask Ted. But you can’t tell which would have the fewer bugs though.

A few years ago, one of our project teams had an “automated” build. It was 10 or 12 windows batch files, which built and deployed a web application; it compiled the java, built the web application and deployed it to server. This took a developer about 1.5 - 2 hours to do correctly; it was a lesson in how not to automate. Each developer had to do this for each change. Then, two developers came in at the weekend and rewrote these scripts, using a build tool called Apache Ant. The time for a full build dropped from 1.5 hours to (wait for it) 8 minutes. After the experience with the batch scripts, the increase in speed and reliability due to the usage of Ant was very welcome. One of the fundamental principles of Agile development is getting rapid feedback, and this applies to building your code as well. If you have to wait an hour for a deployment, it becomes less and less interesting to experiment and try new approaches. Subsequently, with gradual refactoring the build time had been reduced to 3 minutes for a full build, including running the more than 1000 unit tests.

This shorter time really encouraged experimentation. You could try something out really easily. You may say that there isn't much difference between 8 minutes and 3 minutes, but, for fun, we'll do the calculation: let's say a developer builds 5 times a day, there were 6 in our team, so each day we saved 5 * 5 * 6 minutes = 2.5 person hours. Over a year, (200 days), this is 500 person hours, or about 60 person days.

However, speed wasn't the only reason we wanted a better build system; we also wanted reproducibility. If a build is reproducible, then it doesn't matter if Bob builds the project and deploys it to the integration server, or if Ted does it. It produces the same result. When the build was rewritten as above, it didn't matter who did the final build. It didn't matter what software was installed on the developers machine, which versions of the libraries they had, which version of java they happened to have installed, etc. If there was a problem with that version of the software, then we could reproduce exactly the build with which there was a problem; so at least we had a chance to fix the bug.

Later, at a different company, there was another automated build system. His name was Fred. His job was to build the product, in all of its various combinations. Every day, he would come in, check out all of the sources and enter the numerous commands to build the product. And then do it again, with a slightly different combination for a slightly different version of the build. If any of the builds failed, it was his job to tell the developers that something went wrong. This was his job. Now Fred was an intelligent person, and frankly, he deserved better - and indeed he could have spent his time much better elsewhere. In fact, he ended up writing his own build system to automate the builds. And he could have saved the time taken to write and maintain that as well if he’d used a standard build tool.

So, hopefully you’ll agree that we need build automation. You can perfectly well automate your build with a shell script (or batch script). But why have a specific tool for this purpose? Well, it comes back to a number of factors:

  • Reproducibility - automating your build gives you time to do real work
  • Less to do - a good build tool should give you a framework to work in. For instance, you should have sensible defaults. Maven is a very good example of this, the default is to use src/main/java for java source code and src/test/java for java test code. You shouldn't need to specify every last option to every last javac command.
  • Experience - This is a very underrated factor. We can stand on the shoulders of giants. Or at least on the shoulders of people who've done similar things over and over again. A good build tool is a distillation of developer wisdom. You can use the collected experience in a build tool such as Apache Ant, Apache Maven, or indeed sbt to enforce good practices in my project. For instance, Maven and sbt automatically run your tests before you package. If any of the tests fail, then the build doesn't succeed. This helps you increase quality in the project, which is generally seen as a good thing.
  • Portability - A good build tool should protect you from differences between systems. If you develop on a Windows machine, but the nightly build happens on a Linux box, you shouldn't really have to care about that. And (again), it should produce the same result whatever the machine. If you’re using a shell script for the build, then it definitely won't work on both windows and Linux.

Git and the Distributed Repository System

By Rick Umali, Learn Git in a Month of Lunches

Save 40% on Learn Git in a Month of Lunches with discount code lgitjn14 at manning.com.

Git, the open-source version control system built for speed and efficiency, implements a distributed repository system. Being distributed means that there is no one central repository for your source code. Instead, any repository can be the “official copy” of the source. Let’s break down what it means to be distributed.

We can start by looking at the more traditional model of source control: centralized repositories. Many of the common version control systems have a centralized server that houses the repository. Commits send your changed files to this server. If you wanted to work on the file, you would check it out of the repository, like you would check a book out of a

Centralized repositories put the code in a castle. Developers have to be given access to read and write to the repository. For some version control systems, specialized access is required for features such as branching, tagging, or backups. Moreover, the centralized server must be “up and running” in order for developers to do their work.

Figure 1 The “single point of contention” that developers have to deal with: the version control system that houses the repository.

Git inverts this. Git does not require a central server to be installed anywhere. With Git, every developer is given his or her own repository. Everyone is given a castle. This means each developer can get to any part of a source code’s history, compare versions, make branches, and perform any other operation that would normally require network access with a centralized version control system. This is very liberating, and is an idea that takes some time getting used to.

You’ll hear developers brag that with Git you can commit changes to a repository even while you’re flying in an airplane. I remember a few years ago I doing some work on a flight from Boston to Minneapolis. I realized that I couldn’t connect to my version control system so I had to wait until I landed before committing the work I was doing on the flight. These days, you can find wifi on airplanes, but at what cost? You can't entirely depend on good performance on an airplane wifi.

With Git, there is no need to worry about cost or network performance; you can do everything to the repository because the repository is entirely local to you.

Figure 2 Distributed version control systems are very liberating because every developer has a copy of the entire repository.

Being distributed allows source code to be shared very widely. Large open-source projects like Drupal and Linux have thousands of developers in many locations, some with sparse Internet connectivity. With Git, all of these developers can make their contributions with the same ease as the project leader.

This may sound like a free-for-all, and to some extent, a centralized model alleviates this because of its structure. However, since Git doesn’t require a central repository, many projects have self-organized in ideal ways. Some projects have small development populations so that a single project leader can manage all the commits that might be made to a repository, but many projects have multiple people that help with commits. Git supports both models equally well. All large projects need organization and conventions, and projects that use Git are no exception, but since Git is decentralized, each developer has full control on his or her local copy.

Repository backups are free when you use Git. Since everyone working on a project will have the repository, you don’t have to worry about losing your work. You can get a backup by copying someone else’s repository. People can still store the repository in a common location, but no one person’s repository is more important than anyone else’s.

Let’s suppose we had a Git repository named math.git. In order to clone this repository, you would use the git clone command, making the picture that you have here.

Figure 3 Making an initial clone in your directory from the math.git repository

If you had a colleague named Bob, he might make a clone of this repository in his directory, again using git clone math.git. He makes this math clone in his directory (bob/math).

Figure 4 Bob creating a clone of math.git in his directory.

Notice how math.git is acting as a “centralized” copy. But because Git supports cloning from any source, Bob could just as easily clone from your repository directory, if he could access it.

In fact, because that is the case, you could delete the math.git repository entirely, provided that Bob’s and your repository is up to date. Each repository is a perfect clone of one another, and this key characteristic is what makes Git distributed.

Fig1Git.jpg31.39 KB
Fig2Git.jpg33.62 KB
Fig3Git.jpg8.76 KB
Fig4Git.jpg11.82 KB

All brand names,logos and trademarks in this site are property of their respective owners.

-  Free Magazines

Free Magazine
-  News
  Wireless Java
Industry News
  CNET News
  CNET E-Business
  CNET Enterprise
-  Weblogs
James Gosling's
-  Tell A Friend
Tell others
Free eBooks |  About |  Disclaimer |  Terms Of Use |  Privacy Policy
Copyright 2001-2006 Gayan Balasooriya.   
All Rights Reserved.