Engineering Copyright & Content Solutions @copyrightdev - Tumblr Blog

We’ve Moved!

We’re now blogging at http://engineering.copyright.com. Please join us there.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Artifactory Java Client API Examples (Part 2)

Picking up where I left off with my previous Artifactory Java Client API Examples post, I now offer a Java example which utilizes the Artifactory Java Client API and the Artifactory Query Language (AQL). Introduced in Artifactory 3.5.0, the AQL provides a flexible and high performance search.

Please note, as written on Stackoverflow.com by fundeldman (to whom I owe my gratitude for giving me the direction I was looking for)…

“The Artifactory Java Client does not support AQL queries natively. You can however use the generic REST call interface it provides to create an ArtifactoryRequest pointing to the AQL API endpoint.”

First, I construct the AQL query (to find all artifacts that match a given name (pattern) and repo and that were created after a given date)…

String artifactoryRepo = “my-release-local”; String name = “ivy-*.xml"; SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss"); String createDateMarginStr = simpleDateFormat.format(createDateMargin); StringBuilder builder = new StringBuilder(256); builder.append("items.find("); builder.append("{\"repo\" : {\"$eq\" : \"" + artifactoryRepo + "\"}}"); builder.append(",{\"name\" : {\"$match\" : \"" + name + "\"}}"); builder.append(",{\"created\" : {\"$gt\" : \"" + createDateMarginStr + "\"}}"); /* * Please note, users without admin privileges have the * following restriction: * * The following three fields must be included in the include * directive: name, repo, and path. * * Note, however, that once this restriction is met, you may * include any other accessible field in the include directive. */ builder.append(").include(\"repo\", \"path\", \"name\")"); String aqlQuery = builder.toString();

Next, I construct the ArtifactoryRequest…

ArtifactoryRequest aqlRequest = new ArtifactoryRequestImpl() .method(ArtifactoryRequest.Method.POST) .apiUrl(“api/search/aql”) .requestType(ArtifactoryRequest.ContentType.TEXT) .responseType(ArtifactoryRequest.ContentType.JSON) .requestBody(aqlQuery);

Notice the fact that the apiUrl does not begin with a forward slash. This was not immediately obvious as this was not the case in my previous experiences. See Part I.

Next, I create the ArtifactoryClient (with artifactoryUrl, username, and password) and execute the REST request...

Artifactory artifactory = ArtifactoryClient.create(artifactoryUrl, username, password); Map aqlResponse = artifactory.restCall(aqlRequest);

Note, the username and password are required for AQL queries.

And lastly, I parse the results…

List paths = new ArrayList(); List<Map<String, String>> resultsList = (List<Map<String, String>>) aqlResponse.get("results"); resultsList.stream().forEach(resultMap -> rupReceiptPaths.add(resultMap.get("path") + "/" + resultMap.get("name")) );

Gradle Dependencies

To build via Gradle, please note the following dependencies.

compile("org.jfrog.artifactory.client:artifactory-java-client-api:1.2.2") compile("org.jfrog.artifactory.client:artifactory-java-client-services:1.2.2")

References

The Artifactory REST API https://www.jfrog.com/confluence/display/RTF/Artifactory+REST+API

The Artifactory Query Language https://www.jfrog.com/confluence/display/RTF/Artifactory+Query+Language

Looking for Artifactory Query Language example in Java http://stackoverflow.com/questions/35525097/looking-for-artifactory-query-language-example-in-java

Tom Muldoon

Software Architect

#artifactory #java #aql

Overriding Default Spring MBeanExporter Behavior

There are a variety of approaches to integrating JMX with Spring. I have found the easiest way is to introduce an @EnableMBeanExport annotation in a Spring configuration class. It will take care of registering all managed resources in your Spring context with the local JMX Agent. This convention is typical for consumers who do not need any applied customization or overrides to JMX registration, configuration, or behavior. However, interesting but also problematic circumstances arise when managed bean classes extend a common implementation base class or when you want explicit control over which managed beans are registered within your application. The purpose of this post is to present a recent use case scenario and the applied implementation.

We recently introduced a shared library which provides a clean and consolidated implementation approach for both managing in-memory caches and externalizing cache service operations within JMX. The following class diagram captures the base hierarchy and structure:

A consuming application incorporated the shared library within their project and refactored a legacy Country Code Service to extend the AbstractCacheService and renamed it CountryCacheService. An instance of the GenericCacheStatusMBean, named “countryCacheMBean” was added to their Spring context. The cacheService property on the GenericCacheStatusMBean references the new CountryCacheService:

However upon running a JMX console in a deployed environment the application group noted two unexpected MBean related issues:

The JMX registered, “countryCacheMBean”, has an incorrect full qualified path and type name :com.common.caching.impl:name=countryCacheMBean,type=GenericCacheStatusMBean. If other caches were introduced within the application, they would all share the same type and qualified package name. This naming convention could introduce confusion to Application Admins and IT Operations Team members who monitor the application.

Other Managed Beans introduced by internal and external third party library dependencies were also displayed in the JMX console.

Digging into the @EnabledMBeanExport annotation class itself and traversing back a bit through the Spring framework code, the underlying reason for each bullet point item result can be easily explained. The EnableMBeanExport imports an MBeanExportConfiguration which in turn registers an AnnotationMBeanExporter. The AnnotationMBeanExporter is a framework provided subclass extension of the MBeanExporter and a convenience class which establishes implementation strategies for both autodetection and registration of Managed Beans. The auto-detection strategy is set to a MetadataMBeanInfoAssembler class, which scans the entire Spring Context for any beans with source level JMX annotations. The MetadataNamingStrategy class is set as the AnnotationMBeanExporter’s naming policy registrar. It will either apply the objectName property if it is set on the bean’s ManagedResource annotation OR use a string which appends the class name to the full package path as the name to register with the JMX Agent.

With an understanding of what was going on underneath the Spring JMX Integration covers, we needed to find a way to override the default autodetection and naming policy for consumers of this new shared library. Here are the steps taken along with the actual code applied:

Step #1: Create a subclass of the MetadataMBeanInfoAssembler and override the includeBean method so that only GenericCacheStatusMBeans are JMX auto-detected.

public class CacheMetadataMBeanInfoAssembler extends MetadataMBeanInfoAssembler … @Override public boolean includeBean(Class<?> beanClass, String beanName) { return GenericCacheStatusMBean.class. isAssignableFrom(beanClass) && super.includeBean(beanClass, beanName); }

Step #2: Remove the @EnableMBeanExport from our “cache” @Configuration class and create our own MBeanExporter sub-class, CacheMBeanExporter. The CacheMBeanExporter was later added as a @Bean definition in a @Configuration class.

public class CacheMBeanExporter extends MBeanExporter { ... }

Step #3: setAutodetectMode(MBeanExporter.AUTODETECT_ASSEMBLER) within the CacheMBeanExporter constructor. Register/set the CacheMetadataMBeanInfoAssembler as the assembler on this subclass. During MBeanExporter.registerBeans() processing if the autodetectMode is set to MBeanExporter.AUTODETECTASSEMBLER, a call to the MBeanExporter.autodetectMBeans(), will perform an assembler callback to the includeBean method. This callback will be to our CacheMetadataMBeanInfoAssembler.

/** * Default ctor. */ public CacheMBeanExporter() { setAutodetectMode(MBeanExporter.AUTODETECT_ASSEMBLER); setAssembler(new CacheMetadataMBeanInfoAssembler(new AnnotationJmxAttributeSource())); }

Step #4: GenericCacheStatusMBean should also implement the SelfNaming interface. The MBeanExporter.getObjectName is invoked for the purposes of registering the bean name with JMX. By implementing the SelfNaming interface, the GenericCacheStatusMBean can derive the appropriate JMX name to register the bean. We just reused/returned the pre-existing cacheBeanName property!

public class GenericCacheStatusMBean implements ICacheStatusMBean, SelfNaming @Override public ObjectName getObjectName() throws MalformedObjectNameException { return new ObjectName(cacheBeanName); }

Brett Edminster

Solutions Architect

#java #spring #jmx #mbean

Getting Drools 5.x to Operate Smoothly with Java 8

As mentioned in my earlier post, “Getting JasperReports 5.x to Operate Smoothly with Java 8”

Upgrading from Java 7 to 8 can involve a lot of moving parts, and there are a number of versions of common libraries that may not work well with Java 8, but most of them can be made to work with Java 8 with a little effort.

Here is another library – Drools 5.x – with upgrade issues. The standard suggestion is to upgrade to Drools 6, but there are enough changes involved in that effort that you may not wish to do that upgrade at the same time. This post will allow you to effectively decouple your Java upgrade from your Drools upgrade. Before going into more details of the workaround or proposed solution, let’s see what the out-of-the-box run-time behavior of Drools 5.x is with Java 8 when compiling rule templates / decision tables where data providers are spreadsheets:

Caused by: Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: wrong class format at org.drools.template.parser.DefaultTemplateRuleBase.readRule (DefaultTemplateRuleBase.java:148) at org.drools.template.parser.DefaultTemplateRuleBase. (DefaultTemplateRuleBase.java:62) at org.drools.template.parser.TemplateDataListener. (TemplateDataListener.java:74) at org.drools.decisiontable.ExternalSpreadsheetCompiler.compile (ExternalSpreadsheetCompiler.java:95) at org.drools.decisiontable.ExternalSpreadsheetCompiler.compile (ExternalSpreadsheetCompiler.java:81) at org.drools at com.copyright.rup.apc.manuscript.service.impl.rule. RulesCompiler.compile(RulesCompiler.java:67) ... 1 more Caused by: org.eclipse.jdt.internal.compiler.classfmt. ClassFormatException at org.eclipse.jdt.internal.compiler.classfmt.ClassFileReader. (ClassFileReader.java:372)

If you observe the exception stack trace, the exception message “wrong class format” and a little more detail on the root cause – “org.eclipse.jdt.internal.compiler.classfmt.ClassFormatException” – clearly indicates that the exception is thrown by a class file reader when encountering an error in decoding information contained in a .class file. Now, let’s check who is responsible for this compilation process. If we look at the package for the exception class and also look at the dependency graph for Drools libraries:

+--- org.drools:drools-core:5.5.0.Final | +--- org.mvel:mvel2:2.1.3.Final | +--- org.drools:knowledge-api:5.5.0.Final | | \--- org.slf4j:slf4j-api:1.6.4 -> 1.7.12 | +--- org.drools:knowledge-internal-api:5.5.0.Final | | +--- org.drools:knowledge-api:5.5.0.Final (*) | | \--- org.slf4j:slf4j-api:1.6.4 -> 1.7.12 | \--- org.slf4j:slf4j-api:1.6.4 -> 1.7.12 +--- org.drools:drools-compiler:5.5.0.Final | +--- org.drools:drools-core:5.5.0.Final (*) | +--- org.eclipse.jdt.core.compiler:ecj:3.5.1 | +--- org.mvel:mvel2:2.1.3.Final | \--- org.slf4j:slf4j-api:1.6.4 -> 1.7.12

There are two libraries with issues in this tree: JDT Core Batch Compiler version 3.5.1 and MVEL version 2.1.3.Final. The issue with JDT Core is more straightforward - Java 8 requires at least version 4.4.

Here is gradle script snippet with the fix we tried which works to some extent.

compile 'org.eclipse.jdt.core.compiler:ecj:4.4.+' compile ("org.drools:drools-core:5.+") compile ("org.drools:drools-compiler:5.+") compile ("org.drools:drools-decisiontables:5.+")

Technically, we are asking gradle to resolve “org.eclipse.jdt.core.compiler:ecj” to the most current version in the 4.4.x series. You can read the Gradle documentation for more details on dependency resolution.

So, this fixes the first issues with JDT Core, but it just surfaces another.

Exception in thread "main" java.lang.VerifyError: (class: ASMAccessorImpl_4458843621386333353870, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2650) at java.lang.Class.getConstructor0(Class.java:2956) at java.lang.Class.newInstance(Class.java:403) at org.mvel2.optimizers.impl.asm.ASMAccessorOptimizer. _initializeAccessor ASMAccessorOptimizer.java:725) at org.mvel2.optimizers.impl.asm.ASMAccessorOptimizer. compileAccessor(ASMccessorOptimizer.java:859) at org.mvel2.optimizers.impl.asm.ASMAccessorOptimizer. optimizeAccessor(ASMAccessorOptimizer.java:243)

At this point, we considered replacing JDT completely with Janino, but we looked at the bottom of the stack trace, and decided to see what we could fix at the MVEL level. And digging through MVEL code identified the root cause in the lines highlighted below.

Because of this code when running with Java 8, the OPCODES_VERSION been considered as OpCodes.V1_2, which is causing the above exception. Actually, this is handled correctly in very recent versions of MVEL code, but we cannot go ahead with MVEL latest version - MVEL2.2.8 or later, as it may not work well with Drools 5.x. So, we updated the code as below. We can do this because MVEL source code is released under the Apache license, which allows us to modify the code:

If you observe, we are checking to see whether the Java version is 1.6 or 1.7 or 1.8; we are considering OPCODES_VERSION as 1.6, which is compatible and works in this context and it’s not necessary to pull all complete MVEL Java 8 specific implementations. There are already a few requests to MVEL to release our proposed fix as part of 2.1.x series but there are no actions yet taken. So, we decided to build the patched version of MVEL 2.1.3.Final with the below fix and published as “MVEL-2.1.3.Final-Patch” in our enterprise repository and started consuming it along with the above-mentioned fix across CCC projects.

Here is the final Gradle script snippet with above-mentioned approach.

compile 'org.eclipse.jdt.core.compiler:ecj:4.4.+' compile ("org.drools:drools-core:5.+"){exclude group: 'org.mvel' } compile ("org.drools:drools-compiler:5.+") {exclude group: 'org.mvel' } compile ("org.drools:drools-decisiontables:5.+") {exclude group: 'org.mvel' } compile ("org.mvel:mvel2:2.1.3.Final-Patch")

With this approach, we de-coupled the Java 8 upgrade from Drools 6 upgrade. I hope this solution may help people who may be stuck with these issues and have run into challenges upgrading to Drools 6 along with Java 8.

References:

https://github.com/mikebrock/mvel/issues/27 https://github.com/mvel/mvel/pull/84 https://bugzilla.redhat.com/show_bug.cgi?id=1078146 https://bugzilla.redhat.com/show_bug.cgi?id=1199965

Mohan Kornipati

Software Architect

#java 8 #drools 5 #mvel #jdt #ecj #gradle

Getting JasperReports 5.x to Operate Smoothly with Java 8

If you search for this topic on the Internet, you may find a few related solutions, but I would like to share details on how we solved this issue. Before going into more details of the workaround or proposed solution, let’s see what the out-of-the-box run-time behavior of JasperReports 5.x is with Java 8 during report compilation.

Caused by: net.sf.jasperreports.engine.JRException: Errors were encountered when compiling report expressions class file: 1. The type java.lang.CharSequence cannot be resolved. It is indirectly referenced from required .class files value = String.format(str("label.event_owner"), ((java.lang.String)parameter_userName.getValue())) + ", " + //$JR_EXPR_ID=9$ <----------------------------------------------------------------------------------------> 1 errors at net.sf.jasperreports.engine.design.JRAbstractCompiler.compileReport(JRAbstractCompiler.java:204) at net.sf.jasperreports.engine.JasperCompileManager.compile(JasperCompileManager.java:240) at net.sf.jasperreports.engine.JasperCompileManager.compile(JasperCompileManager.java:226) at net.sf.jasperreports.engine.JasperCompileManager.compileReport(JasperCompileManager.java:481)

If we observe the exception stack trace, it clearly states that it fails to compile the Jasper report template. Now, let’s talk about how this compilation process works. JasperReports API offers a façade class called, net.sf.jasperreports.engine.JasperCompileManager which has set of static methods for compiling for Jasper report templates. Documentation for this class says,

This class exposes all the library's report compilation functionality. It has various methods that allow the users to compile JRXML report templates found in files on disk or that come from input streams. It also lets people compile in-memory report templates by directly passing a JasperDesign object and receiving the corresponding JasperReport object. Other utility methods include report template verification and JRXML report template generation for in-memory constructed JasperDesign class instances. These instances are especially useful in GUI tools that simplify report design work. The facade class relies on the report template language to determine an appropriate report compiler. The report compilation facade first reads a configuration property called net.sf.jasperreports.compiler. to determine whether a compiler implementation has been configured for the specific report language. If such a property is found, its value is used as compiler implementation class name and the facade instantiates a compiler object and delegates the report compilation to it. By default, JasperReports includes configuration properties that map the Groovy, JavaScript and BeanShell report compilers to the groovy, javascript and bsh report languages, respectively. If the report uses Java as language and no specific compiler has been set for this language, the report compilation facade employs a built-in fall back mechanism that picks the best Java-based report compiler available in the environment in which the report compilation process takes place.

And if you look at the “default.jasperreports.properties” or any other configuration properties of JasperReports 5.x, we don’t find any specific configurations for Java compiler, therefore it uses available Java compilers in the environment. Now if we look at the dependency graph for JasperReports

You can find “eclipse:jdtcore:3.1.0” -Eclipse Incremental Java Compiler. After validating our observations against the issue reported as part of #3498-0 and proposed workarounds, we tried upgrading the version of jdtcore specified by Jasper, by over-riding the dependency in our Gradle build, and it worked.

Here is a Gradle script snippet with the fix we tried that works like a charm!

compile 'org.eclipse.jdt:core:3.1.+' compile('net.sf.jasperreports:jasperreports:5.+'){ exclude group: 'eclipse', name: 'jdtcore' }

Technically, this means “exclude the version of jdtcore provided by Jasper, and replace it with the most current version of the 3.1.x series”

Or you can even use “org.eclipse.jdt.core.compiler:ecj:4.4” and follow the same approach for Maven too. Probably the difference between ecj and jdtcore is a topic for another post. Once we are done with complete migration to Java 8, we are planning to upgrade the version of JasperReports to 6.x. I look forward to subsequent posts to share a few more Java 8 migration experiences.

References:

http://community.jaspersoft.com/jasperreports-library/issues/3498-0

http://community.jaspersoft.com/questions/844403/how-run-jasperreports-java-8

Mohan Kornipati

Software Architect

#java 8 #jasperreports #jdt core #JasperCompileManager

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Tips & Tricks: Most Common Values in a Table Column (PostgreSQL)

Recently I was looking for an easy way to get a sense about the values stored in each column of a PostgreSQL table, for some early exploratory analysis. I wanted a way to get this information quickly without having to scan the table, because this particular table had millions of rows. From having worked with other relational databases, I knew that statistics about tables and rows are typically stored in the data dictionary. All modern relational database systems have cost-based query optimizers and basic statistical information is the lifeblood of a cost-based optimizer.

I discovered the pg_stats view for PostgreSQL.

With a little experimentation, I came up with the following query, to pull just the columns that I was interested in, and for just the table I wanted to explore:

select tablename as "Table", attname as "Column", most_common_vals as "Most Common Values" from pg_stats where schemaname = 'public' and tablename = 'call_center' order by tablename, attname

Which returns the following results (results truncated):

There are some things to keep in mind before using pg_stats to view this kind of information.

The statistics are populated by the ANALYZE command. If this has never been run on a particular table, there won’t be any information for you. Also, if the command hasn’t been run in some time, the statistics will be out-of-date.

Don’t count on this information being completely accurate. For example, with a large table, ANALYZE uses a random sample of rows in the table to generate the statistics.

There are some things to keep in mind when using ANALYZE manually, which you may be tempted to do to get “up-to-date” statistics.

The autovacuum daemon will keep run periodically to keep such statistical information reasonably current (assuming that your DBA hasn’t disabled the daemon), so you may have no need to run ANALYZE manually.

If the autovaccum daemon has been disabled, a good rule of thumb is to run ANALYZE on tables once a day. However, if a given table has had a lot of CRUD activity on it, you may choose to run ANALYZE more often.

You can run ANALYZE manually and it requires only a read-lock on the table in question. There could be some negative impact on overall system performance, though, so if you can, run ANALYZE during non-peak times.

There’s an excellent discussion of ANALYZE in the always-excellent PostgreSQL manual.

While I was using the pg_stats view for exploring some characteristics of a new database, you can no doubt envision other situations for looking into the information contained in this view. For example, if you wanted to derive a realistic subset (see my previous post “Subsets Through the Window”), having information about distinct values is extremely valuable in deriving subset criteria. Information from the pg_stats view, like average column width can help with capacity planning exercises. No doubt you can think of even more use cases.

Glenn Street

Data Architect

#postgresql

Subsets Through the Window

Many of you are no doubt familiar with SQL window functions. These functions provide capabilities similar to those provided by the GROUP BY clause in a SQL statement, but instead return result sets that preserve each row, rather than collapsing them into a single summary row. Window functions have been part of the SQL Standard since SQL:2003. This page of the PostgreSQL manual describes the rich set of these functions that PostgreSQL supports. Many relational database systems support window functions. I recently came up with what was to me, at least, a novel approach to generating a subset of rows from a database table by using the row_number() function.

The idea of using a subset of data is a powerful one in the realm of Test Data Management. One of the techniques described very well in the blog post "5 Best Practices for Test Data Management" (as well as in other sources) is to extract a subset of production data to build a testing database. There are many ways that such extracts could be created, but the row_number() window function provides a quick and easy one for accomplishing this task.

I'll refer to the well-known "Adventure Works" database, which I've ported to PostgreSQL. The version of "Adventure Works" I used is an older iteration of the database and was originally migrated to MySQL. I brought it to our database platform of choice, PostgreSQL.

For this example, I'll be working with the DIM_CUSTOMER table. Here is the structure of that table.

The requirement we have is to grab a subset of this data (18,484 total rows), in a way that provides data from each geographic group ("GeographyKey") in the original table. We want our subset to include data from each GeographyKey to more closely resemble production. We'd like to have no more than five rows from each group, though.

Here is the query I came up with to achieve this objective:

select * from ( select *, row_number() over (partition by "GeographyKey”) from dim_customer ) as records where row_number <= 5

As you can see from the output snippet below, this query returns exactly what we are looking for. For each GeographyKey, there are no more than five rows returned, which is enabled by using the row_number() function and then limiting on it in the outer query.

We can get insight into how PostgreSQL executes this query by looking at the query plan:

In order to return the partitioning by GeographyKey we requested, PostgreSQL performed a full scan of the DIM_CUSTOMER table, then sorted the data by GeographyKey; it then aggregated that sorted data using the window function row_number(), and then finally limited the result in each “bucket” to five rows.

Here’s the query plan (using explain analyze to actually execute the query and get true running times):

QUERY PLAN ------------------------------------------------------------------ Subquery Scan on records (cost=4312.80..4867.32 rows=6161 width=254) (actual time=60.658..77.614 rows=1496 loops=1) Filter: (records.row_number <= 5) Rows Removed by Filter: 16988 -> WindowAgg (cost=4312.80..4636.27 rows=18484 width=246) (actual time=60.649..74.576 rows=18484 loops=1) -> Sort (cost=4312.80..4359.01 rows=18484 width=246) (actual time=60.628..64.211 rows=18484 loops=1) Sort Key: dim_customer."GeographyKey" Sort Method: external merge Disk: 4808kB -> Seq Scan on dim_customer (cost=0.00..853.84 rows=18484 width=246) (actual time=0.008..11.447 rows=18484 loops=1) Planning time: 0.108 ms Execution time: 103.503 ms

So, in my test database, this query executes in about 103 ms.

It’s often extremely useful to add an index on the “partition by” column, in this case GeographyKey. After creating the index on “GeographyKey”, the query plan now looks like this:

Now the query plan is a bit more efficient, because the full-table scan of table DIM_CUSTOMER is gone. Instead, PostgreSQL performed an index scan, followed by aggregating the data for the row_number() function, and finally limited the result set in each GeographyKey partition to only five rows. Note that not only is the full-table scan gone, the sort operation that was required before we added the index IX_GEOGRAPHY_KEY is no longer needed.

The corresponding explain analyze put shows a marked performance improvement:

QUERY PLAN --------------------------------------------------------------------- Subquery Scan on records (cost=0.29..3673.80 rows=6161 width=254) (actual time=0.047..36.024 rows=1496 loops=1) Filter: (records.row_number <= 5) Rows Removed by Filter: 16988 -> WindowAgg (cost=0.29..3442.75 rows=18484 width=246) (actual time=0.043..32.793 rows=18484 loops=1) -> Index Scan using ix_geography_key on dim_customer (cost=0.29..3165.49 rows=18484 width=246) (actual time=0.037..14.337 rows=18484 loops=1) Planning time: 0.309 ms Execution time: 36.150 ms

Instead of an execution of 103 ms, the query now executes in a total of about 36 ms.

Perhaps you’d like to have a random sample of rows within each geographic region, still limited to five rows. You can achieve this by using the order by random() construct in the inner query, like so:

select * from ( select *, row_number() over (partition by "GeographyKey" order by "GeographyKey", random()) from dim_customer ) as records where row_number <= 5

Here are snippets of the output from two different runs of this query, showing that the result set differs each time.

First run:

Second run:

If you are using PostgreSQL 9.5+, you have the ability to work with a sample of the table before applying your window function. The TABLESAMPLE qualifier for an SQL FROM clause is part of the SQL Standard, as of SQL:2003. PostgreSQL has implemented a version of it, starting with 9.5.0. An excellent discussion is available on the PostgreSQL Wiki page TABLESAMPLE Implementation.

To apply our windowing criterion over a random 5% sample of the data, the query looks like this:

select * from ( select *, row_number() over (partition by "GeographyKey") from dim_customer tablesample system(5) ) as records where row_number <= 5

This query will generate the sample of data (using a “sample scan”) before doing any other work, meaning that it will be very fast.

QUERY PLAN --------------------------------------------------------------------- Subquery Scan on records (cost=186.76..214.48 rows=308 width=254) (actual time=1.366..2.179 rows=715 loops=1) Filter: (records.row_number <= 5) Rows Removed by Filter: 199 -> WindowAgg (cost=186.76..202.93 rows=924 width=246) (actual time=1.360..1.959 rows=914 loops=1) -> Sort (cost=186.76..189.07 rows=924 width=246) (actual time=1.342..1.390 rows=914 loops=1) Sort Key: dim_customer."GeographyKey" Sort Method: quicksort Memory: 464kB -> Sample Scan on dim_customer (cost=0.00..141.24 rows=924 width=246) (actual time=0.009..0.575 rows=914 loops=1) Sampling: system ('5'::real) Planning time: 0.658 ms Execution time: 2.302 ms

Because we worked only with a subset of the data in the original table, this version is by far the fastest of the queries that we’ve looked at in this article. Note that, in this case, the index we created earlier doesn’t help us, because the number of rows in the subset is so small (only 924).

Window functions are a very powerful way to manipulate data when you want to perform some kind of summary calculations, but wish to retain details of each row. We've shown here that they can also help build a realistic testing environment. One thing to keep in mind is that the row_number() function is non-deterministic, meaning that you are not guaranteed always to receive the same result set from the query. For our purposes, this was not a problem.

Glenn Street

Data Architect

#postgresql #subset #test data

A Train to SAFety: An ongoing series; Part 2: Types of trains

In the first post of this series, we discussed the concept of a train, a collection of teams working together on related goals that collaborate around understanding goal dependencies and planning activities. A secondary characteristic of trains is that all the teams on a train are working in a similar fashion and their progress can be measured using similar metrics. These trains are a key part of the program management view of the SAFe framework. However, while trains are a useful metaphor for a program, not all trains are the same. This post gives examples of types of trains, how they operate, and the factors that drove us to develop these kinds of trains.

Scrum Trains are familiar to most people doing agile software development. Cross-functional teams work in short time boxes, estimating effort, writing code and tests simultaneously and demonstrating functionality to product owners. The major metrics are team velocity, stories punted from one sprint to the next, defects opened and closed, and code quality. This is the kind of train most agilists are most familiar with, and all other kinds of trains are mostly understood in how they differ from Scrum Trains.

Implementation Trains are used for teams that do setup, configuration and maintenance of customers on software. These teams support everything from setting up a new customer with user accounts, integration configurations, etc. to minor changes customer configurations. So, the amount of work per request is highly variable. Customers also make requests with variable levels of urgency and amount of lead time, which means that first-in/first-out processing of work items won’t fit.

Therefore, a time-boxed approach where work is broken down into small enough pieces so that accurate estimation can be done and new functionality regularly demonstrated doesn’t fit well. However, almost all of the work follows a predictable set of steps. So these sorts of teams we’re moving to a Kanban approach. Performance is measured by throughput, accuracy of estimation, defects opened and closed and frequency of hitting work-in-progress constraints.

This arrangement removes sprint-related time-box constraints that don’t necessarily align to customer implementation schedules. It will gives feedback about relative team sizes, and warnings about impending work crunches. The presence of early review states in a Kanban workflow may even give better advance warning of quality issues at the program level than Scrum.

Maintenance Trains are used for teams that work on legacy code bases that are in a low-effort, maintenance portion of their lifecycle. There’s usually no major additions to functionality, and not enough work to support even a single person dedicated to the application. What work is done is a slow accumulation of code and configuration changes made to fix defects or adjust to changes in shared services or infrastructure. Typically, no other teams depend on work done by maintenance teams, and their level of dependence on other teams is minimal. So, the principal questions in measuring team performance are: Were all the issues raised in a given release cycle closed? Was it done effectively? What was the quality of the work?

These are the slow freight trains of the ecosystem. When a product owner feels that sufficient work has accumulated to warrant a release to production, then that team identifies their release date, based on outstanding work and our release calendar. Then, the work is completed, and regression testing performed and released to production.

There are definitely weaknesses in these trains. The fact that requirements can sit on a list for a significant period of time, with knowledge of the drivers getting stale is a source of concern. Also the fact that testing and acceptance is deferred until late in a release cycle is decidedly anti-agile. However, the total amount of effort on these trains is usually small relative to the total amount of work and is well-understood by the teams who have been working on it for years. So, trading the risks for the flexibility of working on a system on an intermittent, on-demand fashion can be a net positive.

How to best measure these teams? More by looking at trends then by comparison to an absolute standard. Looking at time between releases, level of effort between releases, testing effort , defects opened and closed gives visibility into the health of the code base and can tell when it might be time to initiate a major technical debt paydown or re-write.

Platform Trains hold teams that serve multiple functions. They build and maintain the SAFe Architectural Runway. Depending on an organization’s work this may involve work on development of shared libraries, frameworks and services, proof of concepts, continuous integration and delivery, orchestration and provisioning. The common thread linking them is that they’re all at least one step removed from immediate business value. The work can operate in both pull (e.g., product team requests a new feature to be added to a shared library) and push (e.g., a new Gradle plugin is released for adoption) modes. This makes delivery a function of a complicated interplay across teams. The teams also tend to be cross-functional, and staffed with people who have multiple responsibilities. The amount of effort involved in a given piece of work is also highly variable. This makes maintaining a predictable velocity challenging.

The best fit we’ve found for these trains so far is Kanban. The variability in the size and nature of work items isn’t a great fit, but having work queues with common, high-level steps, having people pull work items from queues, and measuring how long they stay in the queue does seem to give us some ability to improve resource allocation.

Dependencies between families of trains tends to be of manageable complexity. Requirements flow down from Implementation trains to Scrum trains and Maintenance trains, which feed requirements to Platform trains. Work products tend to flow in the opposite direction - from Platform trains up to Scrum and Maintenance trains, and from there to Implementation trains. Communication of these requirements and products can be done through program increment/release planning process.

Matt Kleiderman

Director of Architecture

#SaFE #scaled agile #agile #agile development

Artifactory Java Client API Examples

For its Artifactory product, JFrog offers a Java client API as a convenience to using its REST API but as far as I can tell, all of the REST API examples online are written in Groovy...

def T get(String path, Map query, ContentType responseContentType = ANY, def responseClass = null, Map headers = null) { rest(GET, path, query, responseContentType, responseClass, ANY, null, headers) }

So, here’s a couple of quick examples written in Java (using version 0.17 of the Java client API) which leverage the Artifact and GAVC searches, respectively.

Example 1: Artifact Search (Quick Search)

First, I instantiate an instance of the ArtifactoryImpl class. The cast exists because, for some reason, the get, post, put, and delete methods are not declared on the Artifactory interface (which the class implements). Odd? I think so.

ArtifactoryImpl artifactoryImpl=(ArtifactoryImpl) ArtifactoryClient.create(“http://artifacts.company.com”);

I then set the query path…

String path = "/api/search/artifact";

And the query parameters...

Map queryMap = new HashMap(); queryMap.put(“name”, pArtifactName); queryMap.put("repos", pArtifactoryRepo);

And the responseContentType…

String responseContentType = groovyx.net.http.ContentType.JSON;

And the headers…

String usernamePassword = pUsername + ":" + pPassword; String authorizationHeader = "Basic " + Base64.getEncoder().encodeToString(usernamePassword. getBytes("iso-8859-1")); Map headersMap = new HashMap(); headersMap.put("Authorization", authorizationHeader);

And finally, I invoke the ArtifactoryImpl.get method…

Object results = artifactoryImpl.get(path, queryMap, responseContentType, null, headersMap);

Example 2: GAVC Search

The Artifactory API also allows you to search by GroupId, ArtifactId, Version, and Classifier (GAVC) as well as using Artifact Search.

First, I instantiate an instance of the ArtifactoryImpl class. As in Example 1, the cast exists because the get, post, put, and delete methods are not declared on the Artifactory interface (which the class implements).

ArtifactoryImpl artifactoryImpl = (ArtifactoryImpl) ArtifactoryClient.create(“http://artifacts.company.com”);

I then set the query path…

String path = "/api/search/gavc";

And the query parameters...

Map queryMap = new HashMap(); queryMap.put(“g”, pArtifactGroup); queryMap.put(“a”, pArtifactName); queryMap.put(“v”, pArtifactVersion); queryMap.put(“c”, pClassifier); queryMap.put("repos", pArtifactoryRepo);

And the responseContentType…

String responseContentType = groovyx.net.http.ContentType.JSON;

And the headers…

String usernamePassword = pUsername + ":" + pPassword; String authorizationHeader = "Basic " + Base64.getEncoder(). encodeToString(usernamePassword.getBytes("iso-8859-1")); Map headersMap = new HashMap(); headersMap.put("Authorization", authorizationHeader);

And finally, I invoke the ArtifactoryImpl.get method…

Object results = artifactoryImpl.get(path, queryMap, responseContentType, null, headersMap);

References

For information regarding the Artifactory REST API and, in particular, the Artifact Search and GAVC Search:

Artifact Search

GAVC Search

Update 2016-05-02 JFrog has now updated the artifactory-client-java documentation with examples in Java. See the README for more information.

Tom Muldoon

Software Architect

#artifactory #java #rest api

MongoDB to PostgreSQL JSONB via Talend

In a post last year, I described using Talend Big Data to move documents from a MongoDB database to a PostgreSQL database. The aim of that exercise was to pick apart the JSON document as stored in MongoDB and store certain keys into individual PostgreSQL columns. But what if you want to store the entire JSON document in MongoDB for possible future processing with PostgreSQL? Since release 9.2, first available in September 2012, PostgreSQL has offered a JSON data type. In the subsequent 9.4 release (December 2014) PostgreSQL added a more advanced JSONB type, "a more capable and efficient data type for storing JSON data" than the original JSON data type.

Both JSON data types have strong advantages over storing a MongoDB document as a simple text field. First, PostgreSQL will assure that the JSON you're trying to store is valid. In addition, PostgreSQL offers a number of operators to work with JSON documents, including ones that allow you to traverse the document to find individual keys and their values. As opposed to the JSON data type, the JSONB data type has the advantage of being stored in a binary format that eliminates reparsing the document on retrieval. This can mean faster reads compared to the original JSON data type.

The PostgreSQL manual recommends that

In general, most applications should prefer to store JSON data as jsonb, unless there are quite specialized needs, such as legacy assumptions about ordering of object keys.

So, let's say that you are convinced that you want to store data that you formerly kept in MongoDB in a spiffy new JSONB column in your PostgreSQL database. Talend can do that, can't it?

The answer is a resounding yes, but it's a bit trickier than you might expect because of the way the JSON document can be interpreted by PostgreSQL on input.

The basic idea of the Talend job to move data from MongoDB to PostgreSQL is very simple. First, set up a tMongoDBInput component to read the document from your MongoDB database. Next, add a tPostgreSQLOutput component to your job. Finally, link the two components with a row (Main) connector. Here's an example of what this would look like in Talend Open Studio for Big Data.

Unfortunately, this job will fail, because on insert, the JSON document isn't recognized as a specialized text type. Therefore by default, PostgreSQL's JDBC driver treats it as a VARCHAR. In fact, the PostgreSQL JDBC driver doesn't support JSONB directly (the JDBC standard doesn't support a JSON data type yet). What you'd like to do is simply pass the JSON as a Java String and have PostgreSQL recognize it on input, automatically casting the String to a JSONB data type.

There is a way around this problem, though, by using the JDBC connection parameter stringtype. This is, in fact, how you can work with the PostgreSQL JSONB data type from your own Java programs. By setting stringtype=unspecified on your JDBC URL, PostgreSQL will silently cast a String to JSONB ("parameters will be sent to the server as untyped values, and the server will attempt to infer an appropriate type"). So, all we have to do is add the parameter to the URL in the Talend tPostgreSQLOutput component, correct? Unfortunately, the Talend tPostgreSQLOutput component doesn't allow you to modify the JDBC URL that it generates. But, because Talend offers so many different components, including some more general ones, we can use a tJDBCOutput component in place of the specific tPostgreSQLOutput component.

The job would now look like this:

The tJDBCOutput component allows you to connect to any database that supports JDBC and for which you have the appropriate driver jar file. Of greater interest for this article is that the tJDBCOutput component allows you to specify your own JDBC URL, including any specific connection parameters that the driver supports. In our case, we want a URL like this: "jdbc:postgresql://localhost/postgres?stringtype=unspecified". Here's an example of the full configuration of the tJDBCOutput component:

The results of running this revised job are exactly what we were looking for. Here's a snippet of the resulting PostgreSQL table with our JSON data neatly stored in a JSONB column:

Glenn Street

Data Architect

#postgresql #talend #json #json data #mongodb

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Full Text Searching in PostgreSQL--Can it Measure Up?

At CCC, we have many different uses in our applications for full text searching and have had excellent results with both Solr and, more recently, Elasticsearch. However, the operational overhead of deploying sharded Solr or Elasticsearch clusters for internal-only applications sent us looking for a simpler solution.

We have used full text search functionality that is built into a commercial relational database in some of our legacy products, but that was in the days before the Cloud, before our thorough embrace of open source software, and before other technology changes. As we have switched all new development to using PostgreSQL for relational database storage, the question we set out to answer was “Is full text search in PostgreSQL good enough in certain cases?”

PostgreSQL offers a robust solution for full text search. It has support for generating lexemes, stemming, GIN indexes (more about those later), as well as built-in operators for working with and manipulating full text data. GIN stands for “Generalized Inverted Index”. I refer folks interested in these details to the always-excellent PostgreSQL manual pages on full text search.

Some Background

A full text document in PostgreSQL is a string column or combination of columns that has been converted to the PostgreSQL data type tsvector. A tsvector looks like this:

Note some of the transformations that have happened to the original string:

The stop words “It”, “was”, “the”, and “of” have been removed.

Words have been normalized into lexemes; thus “times” is represented as “time”.

The numbers next to the lexemes are the position in the original string where the corresponding word occurs, excluding spaces.

Since “times” occurs twice in the string, its position is noted by a comma-separated list of locations in the string (6,12).

The building block of text search in PostgreSQL is a tsquery that matches a tsvector, using the match operator, @@.

PostgreSQL Configuration

We did our research using AWS EC2, so that we could experiment easily with different host characteristics. Memory is an important factor in the speed of full text searching, as the more data that can fit into memory, the faster the response time for searches.

We chose an r3.4xlarge instance, which has 122GB of RAM and 16 virtual CPUs. We ran PostgreSQL 9.4.5, which was the most recent production release at the time of writing this post.

We made some modifications to the basic postgresql.conf file, as follows:

Following the advice of the PostgreSQL manual, we sized our shared_buffers to no more than 40% of the total system memory.

PostgreSQL offers a couple of different ways to support full text search with database objects. We tried both, to see if there were significant differences between the two. These two approaches are:

Simply create a GIN index on any of the string columns, or combination of columns, against which you want to perform full text search. This approach is the simplest and requires the least amount of additional storage, as well as not requiring additional database objects.

Create a new column that contains the tsvector representation of the “document” you want to search. Then, add a GIN index to this column. Note that in order to keep the tsvector column in-sync with the original column when updates and additions occur, you will also require a trigger. This seems like a drawback compared to the fact that a GIN index alone automatically stays in-sync with the data in its underlying column.

Optimization

It would be ideal if we could keep the GIN index in memory for fastest access. By default, there is no mechanism to instruct PostgreSQL to “pin” objects into the buffer cache. However, there is an optional PostgreSQL module that allows you to do this, pg_prewarm.

Once you install this module, adding an object to the buffer cache is as simple as the following:

Unlike similar functionality in some databases, the object pushed into the buffer cache is not guaranteed to remain there over time. PostgreSQL’s normal management of the cache may cause the object to “age out”. However, if you are planning to execute many similar queries, it’s much less likely that PostgreSQL will eject the index or table from the cache with the passage of time.

How can we verify that the GIN index is in memory? Use another optional module, pg_buffercache to examine the current contents of the buffer cache. Then to verify that your index is cached, issue this query:

This query returns the list of relations (e.g., tables and indexes) in the buffer cache, along with the number of buffers stored in the cache. Here is a snippet of the output, which shows that the GIN index we want is stored in memory for faster access.

Since the size on disk of the index gin_work_metadata_main_title is about 25GB, we can see that the index entirely fits into the buffer cache (3338241 buffers * 8K blocks ≅ 25GB).

Testing

To test the performance our PostgreSQL full text search configuration, we created a JMeter test plan.

We used 5 threads, to simulate simultaneous users submitting search queries, but added a random pause between requests of from 4.5 to 5.5 seconds (using a Gaussian Random Timer), to introduce a more realistic delay interval (real users aren’t likely to be searching constantly within a tiny interval). In order to add more realism, we parameterized the SQL queries to read search terms from a CSV file, rather than repeating the same search each time. The searches returned a maximum of 10,000 records each time.

Approach 1: Using only a GIN Index

Our first set of queries executed a search against a VARCHAR column, which had a GIN index added to it to support full text searches.

For each row that satisfies the query criteria, PostgreSQL must convert the VARCHAR column into its tsvector representation.

This table shows the results or our tests:

Approach 2: Using a TSVECTOR Column with a GIN Index

To enable our second test, we created a tsvector column and set it to the equal the tsvector representation of the original column (main_title in this case). We then created A GIN index on that new tsvector column. We then queried the tsvector column.

The results, for the same test configuration are shown in the next table.

It’s not surprising that the second test shows results that are slightly faster, on average, than the first test. Having the tsvector column saves PostgreSQL from converting a string to tsvector for each row that satisfies the query criteria.

We plotted the % of peak density (using a probability density function) for both approaches. As the chart below shows, adding the tsvector column with a GIN index not only makes the response faster, but also makes it more consistent - the red line is underneath the blue line on both ends of the distribution.

The question thus becomes whether the tradeoff of additional storage and database objects, including a trigger to keep the primary column(s) in-sync with the tsvector column(s) is worth the performance improvement, as compared to simply querying a string column that has a GIN index created on it. The answer, of course, will depend on your particular use case and the expectations of application users.

Conclusion

Obviously there are limits to PostgreSQL’s full text searching which make it unlikely that you would want to use it for large-scale externally-facing search applications. However, for an internal-only application, PostgreSQL’s full text search is more than up to the job.

Glenn Street

Data Architect

#postgresql #search #full-text search #jmeter

Small Steps to Test Reliability

More than once, I have evaluated a large body of automated test code. The usual customer complaint was lack of reliability- the tests sometimes ran, sometimes called (non-existent) defects and sometimes crashed. So much for their sizable investment in automated testing… Most commonly, after examining a lot of code I decided that it was lack of attention to the details of test structure that caused so many problems- the tiniest steps in the tests were the least reliable.

An automated test is just like any other coding project; one must understand the fundamentals, which in this case are the fundamentals of building a test, before using code to solve a problem.

Let’s start with the simplest component of testing which can declare a test status of PASS or FAIL. I call it the Probe and Verify pair.

The Probe and Verify Pair

The Probe operation instigates some predictable behavior on the part of the Software Under Test (SUT). The Verify operation examines available state and artifacts which can be used to verify the correct behavior of the SUT. If the Verify operation does not recognize the state and/or artifacts that it expects, a FAIL or ERROR is declared. Otherwise, a PASS is declared. It’s simple, right? Well, the answer would be both yes and no. A ‘real’ test has many Probe and Verify steps strung together in an order pre-determined by the author of the test activity, with intervening actions to move the SUT through the operations targeted by the test. The hard part of writing a test is all in those actions that move you from one Probe and Verify pair to the next. In fact, the bulk of most tests are made up of those intervening actions that allow a test to progress from one Probe and Verify pair to the next. Since those intervening actions are a huge part of the test, making them reliable, robust, repeatable and predictable will contribute in equal measure to making the test as a whole reliable, robust, repeatable and predictable.

Knowing, not Assuming

When a test moves from one Probe and Verify to the next, there are intervening activities that prepare the SUT for that next Probe operation. We can call those intervening activities Probe Setup, or just setup. Setup for a test is not going to surprise you, but we’re looking at the finest grained setup. Those are individual actions which set up an individual probe and how they contribute to the qualities of reliability and repeatability and predictability and robustness.

Here’s an example. You want to test the newly fixed Change Password Page for accepting valid and invalid passwords (I know, there are better ways, but bear with me). You must enter the users’ current account information, including current password, click a check box “I Accept Terms and Conditions”, then fill in the new password (twice).

It sounds simple. This example is imagined, slightly contrived and very much condensed. Let’s take a look at the test steps:

Navigate to the Change Password Page

Click the “Terms and Conditions” check box

Fill in user Name, old Password

Fill in the New Password box with a non-conforming password (too short, too long, etc.)

Verify that the “Password not Changed” dialog box appears.

Dismiss “Password not Changed” dialog

Fill in the New Password box with a conforming password

Verify “Password Changed” dialog

The test fails with the “Password not Changed” dialog. That seems appropriate until you get to the part where you supply conforming new passwords. The “Password Not Changed” dialog appears again. It must be a defect. A defect report was filed indicating that the Change Password Page did not accept valid passwords.

Full disclosure: the problem was that the “Terms and Conditions” check box was read-only when the page came up and its starting state was unchecked. The Click() method could not change it to checked. The “Password Not Changed” dialog box was presented because the “Terms and Conditions” were not accepted. The new contents of the New Password text box were irrelevant.

The result is that we have posted a defect report for the product, the SUT, instead of realizing that the test was doing something wrong.

The correctness of the test was never challenged because it was used in the past. What went wrong?

This is what happened: The latest changes to the Change Password Page included a fix for a defect where a user could change their password without accepting the terms and conditions. That was fixed by requiring the “Terms and Conditions” check box to be checked before a password can be changed. However, there was a secondary ‘improvement’ by the developer where the check box was disabled until the user name and current password were filled in. After that, the check box was enabled and could be checked. However, the order of operations in the test code checked the check box first then filled in the other fields. The test no longer matched the behavior of the SUT.

We could have avoided all that pain with the principle of “Know, don’t Assume”. The test writer assumed that a simple Click() operation would be successful and never checked the return code which would have indicated the error. The test could have stopped at step (2). Assuming that the Click() was successful, the rest of the test was executed and the error dialog showed when a valid password was presented.

If the test writer used the principle of “Know, don’t Assume”, s/he would have checked the result of the Click() and the test would have flagged the failed Click() of the check box instead of a defect in the password validation. Further, the discrepancy between the changed product code and the test code would have been immediately apparent as a test problem, not a product problem and would have been fixed at the first execution of the test with very little trouble.

Real Code: Know vs. Assume

Now let’s look at an example from real code written by an experienced software engineer rather recently. I’ll use pseudo code with simplified operations. This test writes to a field in an interactive web page and verifies that the SUT works by verifying that the data written to the field (eventually) got to a database.

Navigate to Data Entry Page

Access Data Entry Field

Enter traceable data into Data Entry Field

Submit Page

Connect to Database, return appropriate field from appropriate record

If (field shows entered data) : Data Entry Page Test PASS else FAIL

This all seems fairly trim and direct- if you enter data at one place, you should see the predicted result, in this case a particular value in a database record. If not, declare that Data Entry Page has a defect.

However, what if step (5) sees a transient infrastructure error, perhaps network, storage or configuration and the database access was unsuccessful? The test logic will provide an empty or defective result to step (6) and the test will declare Data Entry Page as defective. The worst thing a test can do is to declare an SUT failure when the problem was an internal error in the test. Transient or not, likely or not, a database access problem will trigger the defect cycle for Data Entry Page. If the whole process is automated, then a defect would be filed in the defect tool, a Developer would be assigned and work done to figure out that nothing is wrong with the SUT. The Developer would close the defect report with No Problem Found (NPF). After that, the testing would have to be restarted, this time making sure that there are no failures that are not SUT failures, usually by having a tester manually invoke and manually monitor the test. The tester would have to manually inspect the logs and results and manually post the PASS or FAIL as required. This really defeats the goals of automated testing and costs more in time and people than straight manual testing.

Certainly we would hope that test internal errors are few, but I used this example to show that the test developer assumed that the database access would always work and database accesses need not be checked for proper operation. That assumption was invalid. If the database operation was checked for proper operation, the test writer would know that the database access was successful or not and would declare a FAIL in the SUT only if there was truly a failure in the SUT. Even if the database access throws its own specialized exception, the test would lose control, which in automated environments is almost as bad as declaring a false SUT defect because it forces human intervention and stops or interferes with subsequent automated tests.

In testing, the difference between assuming and knowing can cause a lot of problems as well as reducing the value and effectiveness of automated testing and increasing the over-all time and cost of testing.

What Should Happen for Test Internal Errors

The most common reason that test writers don’t always check their assumptions is that they don’t know what to do if some internal failure happens. The problem shows like this: Tests are expected to return SUT PASS or SUT FAIL, but neither of those is true when there is an internal test error. In fact, you don’t actually know anything about the SUT for that test. The result of that test activity is void, because that test lost control and can’t really declare either that the SUT PASSed or FAILed. If your test environment accepts only PASS or FAIL, as happens with many CI environments, then you need to recognize and deal with test FAIL codes that may not be SUT problems, but internal test failures.

For this particular situation, I like to use a third return code called TEST_INTERNAL_ERROR. Test writers need to be able to return that third test result- TEST_INTERNAL_ERROR. Accommodating the TEST_INTERNAL_ERROR can be done by a direct addition of that third result code, or it can be done by qualifying a FAIL with extra conditional information. A FAIL with a “test_internal_error result” modifier could be used to avoid triggering any unwanted defect cycle activities and instead trigger review of that test.

Handling Internal Test Errors in Your Code

It’s actually quite simple when you realize that once you have an internal error that isn’t handled or retried your test has lost control of the state of the test activity and simply and clearly cannot proceed. That’s it- the test is broken and must stop (or let the next test run after reporting this internal failure).

Below is a snippet from real code which, for obvious reasons, has been retired. It shows a utility method which sets up a search operation with a variable search term. Note carefully, that userClicks () can fail for any number of page-rendering problems where the target object is not available or not ready or otherwise just wrong. userClicks () returns a success/fail result code to the caller but as you can see the result is not checked. Even the selectValueFromDropDown(…) method relies on userClicks(), so it too can fail silently. In this code, every line could fail silently and the caller would never know. The test writer assumed that every call, every time, would work correctly. That assumption was not true in practice.

If userClicks() threw a TestInternalErrorException, then this code would be put under control and we would be able to do the right thing when the SUT works but the test environment doesn’t:

Here I’ve put all of the formerly silent failures under control with a minimum of extra code by having userClicks(..) emit an exception on error. The try/catch/finally allows the test writer to cluster a group of statements in a meaningful way but keep the whole operation under control. This construct allows the test writer to be able to specifically identify the true cause of the failure and avoid incorrectly declaring a FAIL for the SUT.

The catch block then logs the error. The best place to log the problem is where it occurred--because that’s where the reason for and the information about the true failure is. In any case, the exception allows the code to be interrupted at the true point of failure and the problem recognized at that point, not where some subsequent verification of a SUT failure happens. Then, the ‘finally’ block gives you a place to assert control over the situation whether there is an exception or not and clean up any system changes that may have occurred prior to the failure.

It is highly recommended that you incorporate the ‘finally’ block into your thinking and into your code. In testing, knowing the state of things is critical, and the ‘finally’ block gives you what you need to assert control over the situation whether there is an exception or not.

The test writer is no longer assuming that all of this test probe setup code always works flawlessly. The test writer can know when this sequence happens flawlessly as hoped, and when it doesn’t. This mechanism puts the test code under control and allows the test writer the opportunity to end the test cleanly with a clear indication that there was a problem not in the SUT, but in the test or its environment. There will be no false defect reports for the SUT, but there may be a valid defect report against the test code or an investigation into the reliability of the test environment.

Telling the World: the Test Failed, Not the SUT

This is the hard part! Many test running environments accept only PASS or NOT PASS as a binary condition. This environment can make no distinction between a product SUT failure and a test failure. In this case, for failures there must be a post-process which processes some artifact of the testing, usually a log or console output, which can recognize a TestInternalException and/or its standard log line. Slight post-processing is pretty common in test environments and this can really raise the reliability and predictability of hands-off, lights-out, no-human-involved automated testing.

After all, that is the goal: have your automated tests so reliable, robust, repeatable and predictable that you can run them anytime without human intervention and get the right result, even if the test environment fails you. This technique of “Knowing, not Assuming” for your tiniest test steps will go a long way to helping you get there.

Kevin Whitney

Automation Architect

#automation #testing #software testing

The Difference between Search and Text Mining

Our CTO and VP of Engineering & Product Development, Haralambos “Babis” Marmanis recently published a post at the blog of The Association of Learned & Professional Society Publishers on Why Publishers Need to Know the Difference between Search and Text Mining.

#search engine #text mining

Search Relevancy

If you had asked me two years ago to describe the difference between a structured versus full text search you would have received a novice answer. However, when I inherited the technical stewardship of an application whose primary function was to surface publication title and article search results, I had to become a quick study in this arena. My resource bible was (and still is) Manning's "Taming Text", as it proved an invaluable resource. During this same time frame we were also undertaking a search technology leap so to speak, migrating from Solr to ElasticSearch. The project initiative was bold with an aggressive time frame. Just as I was about to introduce tagging and exclusion filters within our existing Solr facet implementation, I was immersed in an ElasticSearch ramp-up on aggregations, multi-tenancy, schemaless types, and the rich query DSL. All in all, a great proverbial baptism by fire. The end-product, RightFind Professional™, delivers on its promise of fulfilling an integrated workflow solution and a one stop shop for researchers to find articles within their local holdings and subscriptions or purchase additional content. However, we have to ask ourselves how relevant and meaningful are the search results we currently deliver to our varied end-users across the gamut of industries and academic institutions who utilize RightFind Professional™.

Relevancy is the black box within that big search engine. Matching is relatively easy and most everyone understands the underlying boolean logic. Just skimming the theoretical surface of relevancy isn’t too intimidating either. Relevance can be simply defined “as the numerical output of an algorithm that determines which documents are most textually similar to the query.”1 Term Frequency/Inverse Document Frequency (TF/IDF) for single term and the Vector Space Model for multiterm queries are conceptually pretty straight forward. But the practical application and deeper dive in can be quite daunting and intimidating. Naturally the first question you must ask yourself is when to apply boosting? You have two choices at your disposal, index or query time boosting. The consensus in the search relevancy world, is that query time boosting shall be favored over index boosting in most situations. Index boosting can be literally applied when new documents are inserted into the index. If searching across multiple indices you could also apply boosts per index level when creating new indices within your cluster. Index boosting during index time usually involves applying custom logic or rules. Within our index builder code base, my preference was not to incorporate any such business rules or logic. Invariably they would be subject to change and could be rather haphazardly or arbitrarily defined. Plus any changes within this logic would entail a full re-index of 140+ million works. In addition some boosts must be applied at query time because there simply isn’t enough information at index time to calculate the boost. Borrowing from another Copyright Clearance Center product’s Solr implementation, we actually employed a hybrid index and query time approach to boost documents based on the existence of a category rank field. Articles sourced and loaded from select Publishers or Aggregators were stamped with a category ranking of “1” within our works metadata database tables and later propagated along during the index building. Within our ElasticSearch technology migration we effectively carried over the same hybrid index and query time solution. The query time implementation involved applying a boost on the should match clause associated with determining whether a work document had a category ranking of “1”. Obviously not a comprehensive boosting/relevancy strategy, but it addresses one core business requirement. Time now to evaluate some other query time based solution approaches....

In our back-end search implementation we introduced our own Elasticsearch client library wrapper. It’s a jar dependency which insulates our search index consuming applications from having to know the intricacies of Elasticsearch and having to craft programmatically together all those complex query builders and aggregations. Within our Elasticsearch client library we expose the boost parameter on all our query type classes, but only the one product team utilizes it (refer to the preceding paragraph and the category ranking match boost). And currently we do not have wrapper support for what is a foundational relevancy building block in the FunctionScoreQueryBuilder class. Two important pieces of the FunctionScoreQuery we have already proven pretty good good at, the query and applied filter components. Now it’s time to define which function(s) we could use to calculate a score, incorporating score_mode and boost_mode properties to effectively combine the output of these functions. The appropriate function(s) would ultimately be driven by our business use cases and surveying the various needs within our end-user communities. Maybe we employ a FunctionScoreQuery to boost articles that have been published within the past 6 months or year? In combination with a decay function and accounting for some pre-defined origin/time threshold and scale, we could add a lot of value here for end-users who tend to favor the latest and greatest. Actually another compelling use case has recently surfaced in discussions we have had with our business concerning boosting articles that are already in a particular organization’s digital library (aka holdings) or bundled within a widely used subscription held by a given organization. In essence, a boosting by popularity. It’s not your traditional article page hit or view count, but definitely in the same ballpark. However, there is a catch here, if we simply boost the score using a popularity type metric, we could completely swamp the effect of our full-text article “title” scores. To mitigate this problem, it’s strongly recommended that in the popularity or likes boosting scenario, that one utilizes a logarithm to temper the effect.2

I have already alluded to the Term Frequency/Inverse Document Frequency (TF/IDF), as it again serves as the default text relevance score in both Solr and ES. But reliance on this default can prove problematic and produce results rankings that are way off the mark. I recently came across a new term, signal modeling, which is the data science/analysis behind turning “relevance scores into smarter, domain specific signals that quantifiably measure important criteria to you and your data.”3 Signal modeling might encompass analysis which ultimately drives the definition and enablement of domain specific synonyms and stopwords which comprise the terms within your index, or tweaking the default TF/IDF scoring features, or amalgamating fields into larger fields leveraging copyFields. There are a slew of other signal modeling techniques at your disposal. But the prevailing message in my research has been that fields are not just for storage and retrieval, they are also containers for enabling scoring.4

On the subject of relevancy best practices and signal modeling, title fields are often portrayed as the example problem child or statistical outlier. TF/IDF is not always your friend here. Term Frequency does not always matter and that aboutness has no correlation to it. A good title search uses phrase matches. Phrase matches are predicated on term positions being enabled on a title field mapping. But the Catch-22 is that term positions are only available when term frequencies are enabled. Possible solutions I came across included writing a custom Lucene Similarity Plug-In. Seriously? I don’t like that one. Or better yet, define two title fields, one with term frequencies and positions disabled and the other with both features enabled turned on. A given query could conceivably include a match on the disabled flavor and a match phrase on the field with both features enabled. The same query could consist of boosts/weights to both queries, to in effect tune the influence of term frequencies versus phrase.

There are suggested strategies for handling other aspects of the TF-IDF scoring, including norms and the IDF ratio itself. This introduces an interesting word entry into the lexicon of relevant search, “Pantheon”. A Pantheon is defined as a “list of topical areas and or subjects in a specific domain, professionally curated by domain experts.”5 I think of it as a domain or industry specific term dictionary. Compiling and updating these domain specific lists could be itself problematic. But that’s countered with the argument that there are legitimate sources from which we could pull in this information. A perfect example, and one particular relevant to both medicine and Copyright, is PubMed’s, MeSH, MeSH being the official NLM controlled thesaurus for indexing articles on PubMed.

It is an inexact science and there is neither a silver bullet or comprehensive go to resource. This is why I am very excited to dig into what looks like a very promising book release in Manning's Taming Search (aka Relevant Search). I downloaded the MEAP first chapter and quickly found comfort in the fact that others in the search space have been either confronted or confounded by what it means to deliver relevant results and how not to simply turn a blind eye. A blind eye or better yet, blind faith in the search engine’s default settings. As I peruse through the first chapter of "Taming Search", I see a consistent theme emerging, as a solution will invariably incorporate a mix of technology and human and or business domain factors. Determination of what constitutes relevancy, is an on-going, collaborative, and continuous feedback loop. There is also a dramatic shift between from being the search engineer to what the book defines as a “relevance engineer”. There are business rules to account for and the workflow within the system itself establishes and detects patterns which in turn could be harnessed into relevancy factors and criteria. All of these things contribute to the overall domain specific relevance model and relevance strategy employed. In many respects, the modern search engine must have a human like quality in being able to interpret what we are really asking for. This may sound like something out of a science fiction novel, but it is the here and the now and part of the very evolution of machine learning and the Information Retrieval sciences. All in all, the subject of relevancy forces us all to revisit the application’s overall product strategy and re-assess the user community(ies). There’s a lot of analysis involved here as our team (business and engineering) must collectively identify the most important pieces of data to focus on for relevancy. And for those data elements identified, how do we inform the search engine about them? And finally we need to balance the weights of each piece of data against all others within the context of the end-user’s query. Within this final step, decisions are largely driven by a combination of Machine Learning and Classification. There are aspects of these data elements, called features, which are incorporated within the very algorithms that provide the relevancy decisions and determinations.

For me, there is still much to learn and abstractions which need to be more concrete and fully understood. However, unlike a few years ago, when I was more less a search apprentice, I have now a more solid foundation and several iterative search engine implementations under my belt. I look forward to in subsequent posts to share and impart my insights and practical knowledge gained within the realm of search relevancy.

Brett Edminster

Architect for Search and Enterprise Integrations

Other References:

"Theory Behind Relevance Scoring"

"Advanced Scoring in elasticsearch"

"Optimizing Search Results in Elasticsearch with Scoring and Boosting"

"Scoring Algorithms - Qbox." 2015. 8 Oct. 2015 (https://qbox.io/blog/tag/scoring-algorithms) ↩︎

"Advanced Scoring in Elasticsearch | Voxxed." 2014. 8 Oct. 2015 (https://www.voxxed.com/blog/2014/12/advanced-scoring-elasticsearch/) ↩︎

"OSC — Data Modeling For Search Relevance -- Signals ..." 2015. 8 Oct. 2015 (http://opensourceconnections.com/blog/2015/05/15/relevance-data-modeling/) ↩︎

"Solr & Elasticsearch -- Modeling Signals to Build Real ..." 2015. 8 Oct. 2015 (https://dzone.com/articles/solr-elasticsearch-semantic-search-101-crafting-si) ↩︎

"OSC — Title Search: when relevancy is only skin deep." 2015. 8 Oct. 2015 (http://opensourceconnections.com/2014/12/08/title-search-when-relevancy-is-only-skin-deep/) ↩︎

A Train to SAFety: An Ongoing Series

At CCC we have been building custom software to support our business for over 20 years. As our mission has evolved, software and services delivered via software have proliferated and become a more significant part of our business. This has led to a number of different development projects started at different times, under different methodologies and different business climates. We adopted Agile as a company-wide practice for new projects, but now even some projects started as agile have reached a phase in their lifecycles where they don’t require a dedicated scrum team, but they are still supported and occasionally released – which means yet another style of project.

When we adopted agile, we also adopted a product platform approach, and built program management around the creation of the various parts of the product platform. This gave us some consistency in technology as well as in project management. However, when teams needed to co-ordinate, special-purpose scrum of scrums tended to proliferate, and took on lives of their own. Also, so much activity was taking place within the program that information that should have reached an audience outside an individual scrum team could get lost in the details of status reports. This still left some long-running legacy development efforts outside of the program and integrating them into the program became even more difficult as we starting getting towards the long tail of one-off projects where the benefits of technology standardization were expensive to achieve and customer demand for changes in delivery models was weak.

When the product platform program started showing growing pains, we started looking around for alternate models that could handle the complexity of our development efforts, and we started looking into the Scaled Agile Framework (SAFe) and saw a lot to like in its concept of an Agile Release Train – which represents a collection of teams working together on related goals, that are aligned to a source of value to the enterprise. But what is the impact for team members working on stories on teams?

Train Essentials

Having your team put on a train shouldn’t be a traumatic event. The goal is to reduce the pain points associated with our polyglot approach to development by streamlining communication paths and improving consistency across projects – not to introduce another layer of complexity.

Because the teams on a train are working towards similar goals, it’s common for one team to require something from another team on the train. So, teams on the same train regularly scrum together as part of their process. They can negotiate contracts between themselves quickly and verify their work via continuous integration and delivery. Conversely, teams on different trains don’t have the same kind of mutual dependencies and contracts between trains tend to involve more than two parties and may require more planning and preparation. The key here is in identifying which teams should share a train, and which ones shouldn’t. We’ve done this based on surveying our teams and looking at the patterns of shared library usage in our build files. It’s also important to allow for, and provide a mechanism for teams to “jump trains” and switch at appropriate times based on the project’s life cycle.

One common reason for jumping trains is a project has moved from proof-of-concept to working towards MVP to maintenance and finally to retirement. Consistency in planning and processes within a train is important to facilitate communications and collaboration on the train, but different trains can operate differently. We’re experimenting with trains that do time-boxed iterations for projects that are doing regular releases to production, but projects that are doing proof-of-concepts for new products, or working on retiring legacy projects we’re planning on shifting to a flexible, shared work-queue arrangement to better handle shared resources and unpredictable requirements.

This concept of allowing teams to jump trains raises a different challenge with which we’re still grappling: How can we measure progress, investment and quality consistently between trains? Future updates in this series will present our thinking and results as they become available, along with other developments as we further adapt and flesh out SAFe for our use.

Matt Kleiderman

Director of Architecture

#agile development #agile

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Automated Testing in Startup Mode

One of my friends joined an early-stage startup and we had a brief talk about testing his sophisticated and distributed software product as it grew beyond unit testing. These are some points that we came up with for the startup in a hurry.

The framework is not your goal - testing is your goal

You will quickly get beyond unit tests for higher-order testing. Beware of spending too much time on the "framework". People get caught up in building or even extending a fancy framework long before they really know the nature of their testing and then they spend way too much time tending the framework instead of writing tests.

Moral: getting the testing done is first priority. Write your tests first. Then provide the minimum framework to get that testing done. When you have enough test experience, you'll know what you need from a framework.

Confidence in doing enough of, and the right testing

In a distributed test environment for a complicated product, you have huge numbers of possible test scenarios. Get your testers trained on combinatorial testing really early. It provides good training on how to think about effective testing. The NIST document listed below provides a readily available tool to identify a reasonable and valuable subset of possible tests. Millions of potential test combinations can be reasonably reduced to appropriate working combinations. See http://csrc.nist.gov/groups/SNS/acts/documents/SP800-142-101006.pdf

A large number of automated tests can quickly become a burden

Start-ups tend to simply run all tests because they don't have that many. However, very quickly that startup will develop a large body of tests where the names or directory names of the tests are the only indicators of what the test code actually tests.

Eventually running this body of tests in a CI environment contributes enough friction that you want to be selective about the testing that is done. You will need a directory (human-readable list) of your tests and what they actually test so that you (being a tester, a developer or a manager) can trade off the run-time cost versus the coverage. Remember the higher-order tests, such as functional, integration and system tests, will cost more in terms of development and run time and benefit accordingly from proper documentation as to what and how they test.

The best means of keeping an up-to-date directory of tests and their test activity is simply internal code documentation which is exposed via Java doc, PyDoc (better ePyDoc), etc.

Kevin Whitney

Automation Architect

Full-Stack Automated Deployments

When we set out to create a standard development and delivery platform four years ago, providing Continuous Delivery-style automated application deployment was a key design requirement. This was for all the usual reasons: greater efficiency, increased velocity, reduced risk, lower cost, improved reliability.

Achieving automated deployment has been a multifaceted effort, and the first visible results came from employing Chef cookbooks and Jenkins jobs to manage deployment and configuration of our portfolio of custom Java applications. We presented on this topic at the 2014 Jenkins User Conference in Boston.

In parallel, we were laying the groundwork for automated database migrations with tools such as Liquibase and Gradle. Management of database schema changes is a source of difficulty for many organizations, and including it in our “push-button” deployment process has provided significant savings of time and effort. Recently we shared the database side of our automated deployment story at the 2015 Gradle Summit in Santa Clara. The slides and video recording are available.

Now we have full-stack automated deployments and are reaping the expected benefits:

Empowerment of individual product development teams via self-service deployments

Significant reduction in DEV-to-TEST time delay

Reduced workload for IT Operations team

More reliable and frequent software releases

Time to invest in additional high-value initiatives

Dan Stine CCC Platform Engineering

#gradle #liquibase #jenkins #continuous_delivery #chef

Trending Blogs

Last Seen Blogs

Engineering Copyright & Content Solutions