NoSQL discussion

NoSQL databases are not new. But they’ve definitely gained traction in the last few years. Designed with distribution and scalability in mind, NoSQL technologies lend themselves to fill the gap where traditional RDBMSs fall short.

Traditional RDBMSs generally support vertical scaling: to improve performance of the database, the host machine performance must be increased – i.e. having a faster CPU, faster hard disks etc.  While these changes may be simple to implement for a small organisation, the associated costs scale exponentially with the purchase of more powerful hardware and licenses. Vertical scaling is also not elastic, it doesn’t seem worthwhile downgrading a server after high period of load and adding/upgrading the hardware often requires a planned downtime.

NoSQL databases offer the ability to scale horizontally. That is, that database performance can be increased by adding additional machines to a cluster. Load and data can be distributed between the nodes, reducing the load and demand on any single machine.

Both NoSQL databases and traditional RDMBSs support replication (data hosted on multiple nodes) and sharding (data split between nodes). So what makes NoSQL special?  Relational Database Servers typically were built to optimise data consistency and availability and have had distribution added as an afterthought, whereas NoSQL databases were built with distribution in from the ground up.

This leads nicely onto the CAP theorem – coined by Eric Brewer in 2000. A distributed system cannot simultaneously provide all of the three quality attributes:

  • Consistency – A read sees all previously completed writes
  • Availability – Reads and Writes always succeed
  • Partition Tolerance – Guaranteed properties are maintained – even if a complete network of machines are not available

RDBMSs implement both consistency and availability, and as they do not need to implement partition tolerance, they can readily achieve both consistency and availability. Checking for network partitions is an expensive task, and requiring for all reads/writes to a database would add significant latency to requests. As such, NoSQL databases are built with the assumption that the network will suffer from partitions, and that this performance characteristic is required. Given this constraint, the database must either implement consistency or availability characteristics during a network partition. However, these choices are not binary, and tradeoffs can be made between consistency and availability.

In an AP database (availability and partition tolerance characteristics selected), when a network partition occurs, servers both sides of the partition are still able to offer reads and writes over the data. Writes to the databases are uncoordinated, meaning that the one database effectively acts as two disjoint databases until the network partition is resolved.

Choosing consistency does not mean that data is not available when the network is partitioned. Reads may be permitted. However, a strictly consistent database would have to ensure data on both sides of a partition are consistent, meaning that writes may be denied.

So you might think, I want my data to be consistent? Why would I chose an available database?

Imagine a hotel booking website which is hosted in 2 regions US and EU that is subjected to network partition. With a consistent database, no customer would be permitted to book hotels until the partition is restored. This would cost the business money as revenue is lost due to this condition. An error would have to be presented to the customer stating that making a reservation would not be possible at the time. With an available database, two customers on both sides of the partitions would be able to book rooms. However, there is a chance that the hotel could be over booked when the network is restored and the changes merged between the partitions. This is an acceptable behaviour of the system: as hotels typically operate over-booking policies, this would reflect the real-world domain model and ensure that the hotel is always booked towards its capacity.

Several types of NoSQL stores exist (ordered  by model simplicity and scaleability (simplest and most scalable first))

  • Key-Value Stores: example: Project Voldemort: a large distributed, persistent hash table. High availability at loss of convenience. Not possible to conduct in-database joins. Usage pattern is as an application cache.
  • Column Stores: example: HBase and Cassandra. Data is often stored in column families rather than rows. Typically sparse.
  • Document Stores: example: MongoDB and OrientDB. Documents are a set of key-value pairs, however can have more complex structure. These databases better support aggregate/structured information.
  • Graph: example: Neo4j. Data is stored on a object graph. Its easier to traverse relations and explore aggregates local to an object, but harder to compute aggregate results over the entire database.

One of the advantages of NoSQL is that there is no longer a data impedance in developing applications: a DAL or ORM does not need to be generated to convert a schema to object code. NoSQL offers development agility through being able to create or modify object types without the need to migrate all existing objects which are stored in the database. However, there are still challenges, as the unsupervised schemas now means that developers are responsible for managing different versions of objects.

Although the new NoSQL technologies support replication and are more scalable than relational databases, they are not a be-all-and-end-all solution to replace them. Relational databases offer a more structured database, that is easier to query. As relational technologies are very mature, there is strong consistency between the most common offerings such as Oracle, Microsoft SQL and MySQL.

One may wish to migrate to using NoSQL to increase development agility and also better handle larger loads or volumes or data. Migrating an existing application to use only one of these techniques seems like a costly venture. We need to consider cost/benefit advantages and tradeoffs. It may be more suitable to augment an application with one of these technologies to improve an high-load part of the application, considering which technology meets our needs.

Comparing subgraph isomorphisms using correlation matrix memories. Relaxation by Elimination

In a previous blog post, I discussed a text n-gram search engine using correlation matrix memories. This simple, yet useful, can also be applied to compare subgraph isomorphisms which is an NP-complete problem.

Graphs are a useful datatype for describing relational models: a car made by a certain manufacturer and have a specific colour. In a graph data store, nodes represent attributes and arcs represent the relation between these as given in the following example:

car

With the advent of graph databases like Neo4J and social networks, there is a clear need to be able to match graphs. However, graphs in the real world will never yield an exact match generating the need to compare subgraph isomorphisms – evaluating whether a subset of the graphs match.

As this problem is NP complete, running time for the problem can be very high for large graphs. Using a heuristic approach may reduce the time taken to find a solution.

Relaxation by Elimination is a technique developed by Jim Austin to run specifically on binary correlation matrix memories.

By inspection, it is clear that the two social networks listed below are equal. However, a computer comparing these two graphs would have to compare every node and every arc.

social

To mitigate ambiguity arising from nodes with the same name, each node is labelled a unique value before being stored in a database. To query whether the left graph matches the right, a list of possible matches is generated. Here, two nodes are labelled ‘James’. Relaxation by Elimination must be used to discover which of the two nodes is an appropriate match:

matches

Each node on the query graph, is compared with target graph. The count of the matching arcs is compared. The top node ‘James’ matches A with two of the arcs (AB and AC), whereas it only matches C with one arc (AC).

matches2

The weakest matches can be individually excluded using thresholds (e.g. set threshold to 1) until only the most probable selection is the result.

To use Correlation Matrix Memories, node value and distance must be encoded as a binary vector. A typical encoding for arc lengths is to use a fuzzy binning method:

Value Range Vector Fuzzy Vector
0-1 [1 0 0 0 0 0 0] [1 1 0 0 0 0 0]
2-3 [0 1 0 0 0 0 0] [1 1 1 0 0 0 0]
4-5 [0 0 1 0 0 0 0] [0 1 1 1 0 0 0]
6-7 [0 0 1 0 0 0 0] [0 0 1 1 1 0 0]

Rather than examining how many similar arcs  exist for matching nodes, we can use a binary encoding instead, stating that “James” has a chance of being either nodes A or C. This can be encoded as a support vector: [A B C] = [1 0 1].

Computing the outer product of the edge length and support vector produces a matrix which can be used as an input to the correlation matrix memory.

CodeCogsEqn

This matrix is then pulled into a single column vector by concatenating all columns: [1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0]’.

To obtain the counts of matching edges for each node, the sum of correlation matrix memories are used. For example for Node C. 2 CMMs are produced which are trained using the support matrix NTD for all nodes. Now, with the query graph, each support matrix for each node is queried, and the resulting binary support vectors summed ([A B C]) ([1 0 1] + [0 0 1 = [1 0 2]), corresponding to the node on the graph’s matching adjacency table, which can be used as input for the relaxation by elimination algorithm.

Typical uses for this method include social network matching – do two users have the same group of friends? molecular model matching – are two molecules made of the same structure of atoms?

Comparing the Zachman Framework, TOGAF and MoDAF

Enterprise Architecture was introduced to address system complexity and poor business alignment. Typically:

  • IT systems have become unmanageably complex or too costly to maintain
  • IT systems are hindering an organisations ability to respond to market conditions in a timely and cost-effective manner
  • Mission-critical information is out of date or incorrect

Zachman Framework

In 1987, Zachman introduced a ‘framework for information systems architecture’ to address and manage the complexity of distributed systems. By looking at issues holistically and from different perspectives, Zachman’s vision was to increase a businesses agility and increase the value from implementing a system.

The Zachman Framework is better described by a taxonomy for organising architectural artefacts (e.g. design documents/models/specifications) by target audience and issue rather than being called a framework which better reflect principles and practices for developing and maintaining the enterprise architecture repository.

This framework can be best visualised as a grid of concerns for each stakeholder within the business: what (data), how (function), where (location), who (people), when (time), why  (motivation) are listed along the top. Level of abstraction for each concern are listed down the side which describe a refinement from a plan to a functioning system:  scope (contextual model for the planner), enterprise (conceptual model for the business owner), system model (logical model for the designer), technology model (physical model for the implementer), detailed representation and finally the functioning system. At each cell, a model is used to describe information at the level of abstraction suitable to the target audience: e.g. a system model in the ‘who’ column may contain the human interface architecture, and yet at the technology model, this column will contain a presentation architecture.

What How Where Who When Why
Scope
Enterprise
System Model
Technology Model
Detailed Representation
Implementation

Each model in the framework is additive and complimentary. Together, the collection of models forms a holistic view of the organisation that no one model can, possibly due to limits on levels of abstraction or expressiveness of a single type of model.

An artefact generated should only reside in one cell of the Zachman framework. If an artefact can be described by more than one cell on the taxonomy, perhaps questions should be raised about the quality or level of detail described in the artefact.

Completing the Zachman framework requires 36 models to be generated which sufficiently describes the system from the perspective of every stakeholder.

Zachman grid improves quality of the Enterprise Architecture by:

  • Ensuring every stakeholder’s perspective has been considered
  • Ensuring each artefact has a specific focus point
  • Ensuring traceability of each business requirement to its implementation

The Zachman Framework, however, is not a complete solution. There are far too many issues that the Zachman Framework does not discuss: it does not describe the process for creating the architecture, or for evaluating the fitness for purpose of the proposed architecture.

TOGAF Framework: The Open Group Architecture Framework

The Open Group Architecture Framework describes enterprise architecture into four categories:

  • Business Architecture – describes the processes that a business uses to meet its goals
  • Applications Architecture – describes how applications interact
  • Data Architecture – describes how enterprise data is stored and access
  • Technical Architecture – hardware and software that supports applications

Naturally, applications architecture can be designed to meet the business architecture requirements,

One of the most important parts of the TOGAF framework is the Architecture Development Method (ADM): the process that describes how the enterprise architecture can be captured and maintained.

Models in TOGAF range from generic to specific: TOGAF describes that these lie on an Enterprise Continuum. ADM describes how generic models can be refined and have appropriate specificity added to meet the needs of the target stakeholder. The generic architectures are called “Foundation Architectures” and can be used by any enterprise. These are progressively refined to common system architectures – may only be relevant to a subset of organisations. Industry architectures describe patterns relevant to a domain. And finally, Organisational architectures are specific to an organisation.

TOGAF ADM describes a preliminary phase and a cycle of processes:

  • Phase A: Architecture Vision
  • Phase B: Business Architecture
  • Phase C: Information Systems Architecture
  • Phase D: Technology Architecture
  • Phase E: Solutions
  • Phase F: Migration planning
  • Phase G: Implementation Governance
  • Phase H: Architecture change management

The preliminary phase ensures buy-in from the organisation’s stakeholders and evaluates the organisation’s suitability to create and digest the architecture being created. This may involve adapting TOGAF to meet an organisation’s needs. TOGAF is non-prescriptive and purposefully allows steps or phases to be skipped, partially completed or altered.

MoDAF

MoDAF is architecture framework developed by the British Ministry of Defence that captures information and allows it to be presented in standard viewpoints. The viewpoints are used by decision makers to help understand and document complex issues.

Viewpoints describe:

  • Strategic Viewpoint (StV) – desired business outcome and capabilities required to achieve it
  • Operational Viewpoint (OV) – the processes, information and entities needed to fulfil the capabilities.
  • Service Oriented Viewpoint (SOV) – services (units of work supplied by providers to consumers) to support the processes described in OV.
  • Systems Viewpoint (SV) – implementation of Operational and Service Oriented Viewpoints; defining the solution
  • Acquisition Viewpoint (AcV) – dependencies and timelines to deliver solution
  • Technical Viewpoint (TV)- standards applied to the solution
  • All Viewpoint (AV) – definition and glossary of the architecture.

MODAF describes a linear processes from establishing intended use of the project to documenting the results in a similar vein to the TOGAF architecture.

Is cloud computing a buzzword?

John McCarthy’s vision of computation being sold as a public utility was presented to the public at a speech to MIT in 1961. The advent of time-sharing in computer systems allowed compute resources to be provisioned to business users across a shared resource pool within an organisation. Now that internet connections have developed, becoming stable, fast and affordable, the central compute resource is no longer constrained to being on a business’ site. This post evaluates the impact of migration to public cloud providers in comparison to in-house ownership and management of computing resources.

Renting services from a public cloud provider allows business to grow their compute and storage capabilities in a scalable manner, reducing the need for large capital outlays and risks which would be associated with purchasing hardware. This will benefit companies and users which are either growing in a rapid manner or do not have an accurate growth forecast, who need a system that can scale with them and consistently meet their computational needs.The pay-as-you-go model associated with public cloud makes this model a viable choice for both small and large businesses alike, as money is only spent on services which are used. The potential savings for cloud users are greatest for small businesses (who would otherwise not benefit from the economies of scale provided by large data centres), and infrequent users (whose average usage is vastly disparate from their peak usage).

While the pay-as-you-go model does reduce initial up-front costs, the costs associated with cloud computing are ongoing and will exist for the entire lifespan of the business. Ultimately, hardware costs can be amortized over a finite period of time; additionally, assets can be sold once they are no longer required by the business. It is also true that costs associated with power and HVAC will be applicable to owned assets, whereas these are included in the cost of public cloud. However, while these costs may grow considerably for large datacentres, this cost may be negligible in comparison to the number of workstations/terminals that may be serviced by servers held in house in larger organisations.

Hosting internal business applications on a server held offsite may present issues, both in migration and operation. When migrating data to a cloud provider, completing and verifying an initial upload may saturate the company’s internet connection, causing issues for the rest of the office. Ensuring that a suitable migration plan is in place to prevent loss of data over this transition period may be difficult to enforce, due to ongoing demands to read and write to the data during the migration period.

The use of cloud computing poses similar risks for the uptime of a business’s mission critical infrastructure; while an onsite compute resource pool can be subject to connection failures and server downtime, cloud computing users are also subject to issues that the use of third party providers creates, with potential scheduled (as well as unscheduled) downtime at inopportune times to the user. While both Amazon AWS and Azure do provide multiple availability zones for virtual machines backed with an SLA guaranteeing that no more than one zone will undergo maintenance at any time, this requires purchasing additional services at an increased expense from the providers. Although scheduled maintenance is given with plenty of warning, there have been a number of high profile incidents where a cloud provider’s failure has caused downtime for numerous large brands. Due to the nature of metered billing from cloud providers, hosting data offsite also incurs additional costs which would not otherwise be present when hosting internally.

It is clear that the cloud does provide elasticity in its scale, which is suitable for public internet facing applications. Migrating web sites to public cloud however, may be a significant challenge. Although Amazon and Microsoft provide templated virtual machines or suitable runtime environments to host applications, middleware and the databases, runtime environments may not be formally stored in a software’s configuration management, and some network software may be increasingly sensitive to the additional latency and unpredictability associated with the cloud environment.

Some business apps may have a more constant and predictable load; this does not necessitate the use of scaling up in response to demand that the cloud provides, and can be hosted internally either on hardware or a virtual environment.

User Centred Design with Personas – Bringing People into the Picture

Alan Cooper’s publication The Inmates Are Running the Asylum, published in 1998 introduced personas a design tool for interactive applications. Personas gained popularity across the software industry due to their effectiveness in elicitation of user requirements and needs for a system.

Personas are fictional characters that represent a segment of the target market for an application. These have names, a backstory, a set of goals relevant to the system, and a set of needs that the system should be able to address.

Although personas are fictional, the characters are usually developed through interviewing or observation of the target market. Generation of these artefacts is an essential part of user-centred design and enables product manager and developers to be familiar with the needs of users while eliciting requirements and developing the application.

Personas do seem to be a counter-intuitive or counter-logical part of design though. Being rooted in the user community, there may be a large number of features or requests that arise from users which may add little value to the application or may even be contrary to the application model. Ending up with a set of personas that are a sum of all desired features may be a sign of the target market not being familiar with the scope or goals of the product. Although it may seem like the users are “back-seat drivers” generating irrelevant feature requests, it may still be useful to evaluate the gap between the current project/brief and target markets’ expectations to identify where additional value can be added to an application.

Generating a persona

Personas can be tricky to write – most of the time these can either be too bloated with irrelevant details about the user, or missing important needs about a specific user group.

Naturally, it can be difficult to capture all the relevant information about the target market in personas in the first attempt. However, as with all agile methods, these personas can be maintained, added to and adjusted throughout the product lifecycle as new needs arise.

Typically persona-like docuemnts generated suffered from four major problems:

  1. Characters were not believable or had been designed by comittee, their relationship to the data was not clear
  2. The characters were not communicated well. Typically they were recorded in a resume-like document blown up to be a poster.
  3. There was no understanding about how the characters were used. There was nothing that spoke to all disciplines or all stages of the development cycle
  4. The project was a grass-roots effort with no support from the business. Typically there was no budget to support the development of persons.

Microsoft Research published a strategy for persona generation as an extension of what Cooper discussed in his Inmates book.

  • Personas originate from market segmentation studies. The highest priority segments are fleshed out from user-studies, focus groups, interviews and market research. Metrics around market size, revenue and competitive placement determine which market segments can be enriched using persons.
  • Using existing related market research (from internal or external sources) helps augment the personas with information backed with qualitative or quantitive data to support design choices.
  • International market information and accessibility information is included within each persona rather than creating full-on disabled Personas. Microsoft only typically create one “anti-persona” – a persona for a user outside of the intended application market.
  • Persona generation is a multi-disciplinary team effort. A persona creation team would generally include product planners, interaction designers, usability engineers, market researchers and technical writers. The team would have 2 or more dedicated people on maintaining persona profiles from market research. Where lighter efforts are made solely on existing user research, the personas that are generated are considerably less detailed
  • Common facts extracted from research are grouped together and form priority information for the generation of the personas. The groups of findings are used in writing narratives that tell the story of the data.
  • In the user stories, qualitative data and observed anecdotes are used. One of the goals in Microsoft Research personas are to back every statement in the persona backed from user observation or collected data.
  • Microsoft use a central document for each persona as a storehouse for key information about the Persona (data, attributes, photos, reference materials). These documents do not contain all of the feature scenarios. They typically contain the goals, features and typical activities that motivate justify the scenarios that appear on the feature spec.
  • Links between persona characteristics and supporting data are made explicit in the persona documents.

A persona document typically contains:

 A day in the life  Following the user throughout a typical day
Work activities  A look at the users job description and their role at work
 Home and leisure activities  Information about what the user does outside of work
 Goals, fears and aspiration  Using the above information to understand the concerns about the user in his/her life, career and business.
 Computer skills, knowledge and abilities Learning about the users’s computer experience
 Market size and influence Understanding the impact that users similar to this user have on the business
Demographic attributes  Key demographic information about the user, family and friends
 Technology attributes Reviewing the user’s past, present and future perspectives on technology
 Quotes Hear what the user has to say
References Documentation to support this persona
  • Once a basic persona is written, Microsoft add photos of models in some of the described situations to the personas to illustrate and communicate the persona. Stock photos are not used as these typically only contain one or two photos typically in a context which is not described in the persona.
  • Cooper describes personas as a discussion tool “would Dave use this feature?”. This is a valuable tool however does not really provide a quantitative description on how valuable a feature is. Microsoft use a spreadsheet that documents how the subjective importance of  features/requirements on personas are mapped into the requirements document.
  • Once personas are generated, their use is not only limited to the generation of feature/requirements. The product and feature specifications utilize the personas, along with vision documents, story boards, demos, test cases, QA testing and documentation writing.
  • The usage and importance/relevance of personas is monitored in the product lifecycle too. To prevent a persona which reflects an insignificant/irrelevant segment of the target market from being overused, their use is typically tracked within a spreadsheet.

This approach from Microsoft appears thorough in both recording the information about the user and how the captured information is being used throughout the application lifecycle to support design decisions for the product.

Roman Pilcher discusses a slimmed-down persona template that is used in his requirements elicitation process. Typically he captures:

  • Name
  • Looks
  • Characteristics and behaviours that are relevant to the product
  • Demographics, job, lifestyle, spare time activities, common tasks
  • Why the persona would by or use the product
  • What problems should the product solve for the user
  • What benefits the persona wants to achieve

I believe that these attributes about the users’s goals could be used to augment personas generated by Microsoft’s method. Once these personas have been generated, the goals can used to support a “big picture” of the product which can the be used to create scenarios, storyboards, workflows, epics, and high level design mock-ups.

 

How Royal Mail can (automatically) read misspelt addresses and deliver to the correct address

Royal Mail, the largest postal service in the UK, delivers over 1.8 billion letters each year. With addresses which hard written, sorting the mail in a fast and efficient manner relies on computer technology which can scan and sort the mail without any human intervention.

Scanning the mail and extracting an address is quite an easy and well documented task. However, matching a handwritten address to a real postal address presents a number of challenges which can be solved with machine learning.

Problems arise from both incorrect labelling and incorrect scanning of the envelope.   The words written in the address can be spent incorrectly, be omitted or given in the wrong order. The address may also be illegible causing incorrect characters to be read when the address is scanned.

Using a CMM, these addresses can be matched against a database of UK addresses (called the postal address file) with relative ease.

Correlation Matrix Memory (CMM) Neural Networks are mathematical models that are applied to real world problems to solve pattern association problems – just like our address problem.

There are three different types of data that can be mapped on to CMMs – Symbols (text, individual letters) used for search problems, Numbers (continuous or real values) used for signal processing and Graph Data (relational data between items).

The scanned address is split into groups of letters using an n-gram encoding. All the possible n-grams are mapped to a number which is used as the input for the correlation matrix memory.

The CMM is a binary matrix of associations between an input pattern and outputs (as indicated by the diagonal lines in the image below). When an input pattern is given into the matrix, a number of associated output patterns will be activated. A single output pattern is chosen by summing the number of elements from the input pattern which activate that output and choosing the highest scoring pattern.

cmm

A given postal address will be split into a 3-letter n-gram. The city SOUTHAMPTON can be split into: SOU, OUT, UTH, THA, HAM, AMP, MPT, PTO and TON. Over an entire address there may be between 60-100 n-grams generated which form a pattern which can be associated with a single real world address.

The weights in the CMM are mappings from each n-gram to one of the UK addresses. Given that there are 27 million addresses in the UK and 17575 possible 3-letter n-grams, the size of this mapping could grow to 47Gbits. However, as only the weights that are set are stored rather than all weights, the storage overhead is significantly reduced.

Assuming that only 100 n-grams are associated on average with an address, the storage requirement falls 2.7Gbits which is small enough nowadays to be held entirely in memory.

As addresses with close geospatial proximity will also have close lexical proximity, close and similar addresses may generate false positive results from the CMM, however, as the post is hand-delivered by a human at the end of the chain, items can be appropriately delivered.

The effects of the fall in oil price on the US shale market

Since June 2014, the price of oil has fallen over 60% to about $50 per barrel of Brent Crude. This reduction in price to below the commodity’s pre 2008 value has has had a number of political, financial and economic effects; especially on the US shale market.

The classic model of supply and demand provides an interesting starting point when analysing the price. If the devaluation is a result of a fall in demand, this would present a worrying indicator as to the state of the global economy. Oil is the most traded commodity and contract in the world and if the reduction in price is due to the change in demand we should consider whether presents an advantage, lowering the cost of production and import, or whether this presents a threat, either as an indicator to a reduction of production or as a financial pressure to the debt that is leveraged against the oil market.

Goldman Sachs analysts state that there is actually an excess in supply caused by a change in the reaction function used by OPEC to moderate the introduction of new oil into the market. As demand falls, the usual course of action is to reduce production to maintain prices, however the Saudis appear not to be reducing supply shocking the market. Unless there are knock-on effects, this presents a benefit to the market; however, there are two sides to the story and the reduction in demand may play a stronger role in the setting of this price.

Even with volatility, oil prices, for a good 25 years, from the mid 1980s to the end of 2007 traded at the 35$ per barrel – about one third value for the $110 level that the value has fallen from. In real terms, this has been the highest sustained level (excluding wars and conflicts) due to gross increase in demand from China and other developing countries shortly after 2008 causing a demand shock that oil producers have taken a number of years to catch up to. 

With the oil producers unable to increase supply in the short term, the price had increased 3x shortly after the financial crisis. One could speculate that this increased price may be a factor contributing to the miserable growth to western economies up until this date. Having said that, the new increase in supply has now pointed to one positive factor highlighting the medium-term elasticity of the oil producers owing to a 90% increase in US shale output. 

For the last 3 or 4 years, producers have been working to increase supply, however concerns could be raised at how quickly the change in price occurred, given that its taken the US 4 years to increase output from 5mb/d to 9mb/d and 6 months for the price to fall from $110 to $50.

The increase in US production has been masked by reductions in north african oil. In 2011, the Libyan civil war thwarted production, with output falling from 2mb/d to next to nothing at two points in a space of 3 years, and fighting in the Nigeria River delta has caused a reduction in the rate of export due to the number of companies needing to declare force majeure. 

The high price of oil lead to an increase in supply throughout the world – including OPEC, Libya and Nigeria, however it wasn’t until August until the supply surged with OPEC increasing supply by 891kb/d and Nigeria increasing exports by 380kb/d in August 2014 leading to an excess supply situation. 

This current surplus of supply appears to mimic parts of the 1980s oil glut where after a sustained period of abnormally high oil prices, rates to below a nominal low of $8 There are five parallels that can be drawn on. Firstly, in the 1980s, high price per barrel introduced a range of new suppliers into the market as is today with new shale production. Secondly, there was an aggregate demand weakness in The West. Thirdly, Saudi Arabia refused to cut supply in both cases causing OPEC to come under strain. Fourthly, it is likely that Russia will suffer most as a result of the oil price correction, and finally, developed nations with strong import markets stand to gain the most from the correction in oil price.

In the 80s, the oil price was allowed to fall through an excess in supply from the Saudi to its all time low value deliberately to bring into line members of OPEC who were refusing to cut production in response to a weaker demand following threats of peak oil. Once oil production was cut, the price levelled at $18 per barrel. 

The difference this time is that the Saudis are using their dominant position to put a squeeze on the US shale oil industry. Not only has the decline in the oil price added a financial pressure to “stripper” oil wells – small wells with very little output, but also to reduce the attractiveness of investing in new locations. As shale oil wells typically have a very short lifetime, a lack of new wells could cause the US shale industry to dry up in the mid-term; appearing to be a subtly tactful move from the Saudis. 

The typical break-even for shale oil is in the range of $40-$70 per barrel depending on location. The financial outlook in the short-term is very bad as the market is teetering on the edge of bankruptcy. This wouldn’t have an effect on short-term production, if anything would extend the production, as creditors, in the event of bankruptcy, would seize the assets and try and maximise the value from the assets. However, the problem is that in the mid-term, new shale wells will not be financed if oil continues to remain at its current price. 

Although it may not currently be as financially viable as it used to, shale oil will still be around for a while to come. The elasticity it provides where production can be scaled is much more powerful than the conventional oil wells. We saw how post-2008, the US was able to support its economy through a near doubling of shale output over 3 years in response to high prices following demand from China. There is no question about it that as the US tries to reduce its dependency on foreign oil, the elastic supply that shale provides will guarantee it’s own future.

Addendum: 

Drawing upon the parallel with the effect of the fall in oil price to Russia, the largest unknown, possibly negative surprise, will be the political, financial and economic effects that would arise from a crippled Russian oil economy. Russia needs the oil price to hover around $100 per barrel to balance its budget. At current prices, the economy will come under strain. Putin will have to respond in some shape or form some time soon: reserves will have to be drawn down and budget cuts would have to follow.

Russia could use this as an opportunity to cleanly extricate themselves from the mess they created in Ukraine, however this appears unlikely for the simple reason that Putin seems to have wagered his political future on foreign policy bravado. It appears more likely that Putin will have to clamp down even more domestically to entrench his position. A drop in oil price would lead to financial hardship and an increase in the levels of despotism as the government struggles to maintain the political status quo. 

Raspberry Jam York – National STEM Centre, November 2014

Last weekend was spent at the National STEM Centre in York at the Raspberry Pi Jam. A fascinating weekend event celebrating all things Raspberry Pi.

I held a stand with a number of computer science students from the University of York where simple projects that could be used as teaching aids in schools were showcased to the public.

A view of the york stand

Encouraging to see interest in technology starting at such a young age

Hundreds of children and parents visited our stand asking questions about how the projects worked and were made. It was encouraging to see that children as young as 12 were able to read Python code and even suggest improvements for my project. It looks as if those who are familiar with the raspberry pi and have been experimenting with it were very able to apply their knowledge and skills to understanding how the projects at the university stand were working. Something that I hadn’t expected at such a young age.

A view from the centre atrium

The Halloween Game

My offering at the University of York stand was a simple ball-toss game with 3 cups and a display showing the score and a cup to aim for. The project is simple, but covers a lot of material that could be found on GCSE Electronics and Computing syllabuses.

Game

With a sensor in each cup, the project provides a suitable introduction to electronics to the school children. These electronics are linked to the Raspberry Pi using the Pibrella interface board – it makes things easier and safer than directly using the GPIO pins on the Pi.

The second half of the project was a Python program which captured the score and drew it on a simple display. For this, I used python turtle graphics as it’s included on the raspberry pi by default and also links in nicely to the syllabus.

8am start

Getting down to the STEM Centre to set up was an interesting experience. Hours of preparation went into the display, however despite the well-rehearsed assembly, some of the projects required hacking and tinkering right up until the doors opened!

Setting up the stand

Moving away from eBay. How it makes things… cheaper

There’s no doubting eBay’s dominant market position for trading second-hand goods around Europe and North America. But from my own experiences, I’ve found that eBay is no longer a viable option for selling my unwanted goods, no longer suitable for selling as a trade and no longer the cheapest place for buying new goods.

Let’s rewind a few weeks back to the bank holiday weekend at the end of summer where, after a long clear out, I was left with stacks of items that I could either sell online or at my local car boot. Not being one to want to waste my time, I thought eBay would be the place I’d want to sell my wares. So within the day, I had 24 listings uploaded with high-res pictures, accurate descriptions and a friendly note as to why I was selling. As I wanted a quick buck on a few commodity items, I listed 7 of my listings as buy it now sales, and the remainder as auction style sales.

The first few sales went without a hitch and items were out in the post on the same day that I received payment. Just as I had my hopes up, the problems started rolling in.

Items lost in the post

The first issue I had with eBay with my sales was an item getting lost in the post. As the item I was selling was relatively low cost, I posted it untracked. But being posted untracked meant that there was no way of proving that it had arrived at the buyer’s end. After a week of negotiating with the buyer and asking him to wait for the extended deadline for Royal Mail (15 working days), a case was opened against me. Being in this position meant that I was forced to give a refund as my PayPal fees from my other sales were frozen until the case was closed. Not a particularly nice position to be in. It often seems that eBay ratings and protection seem to rule in favour of the buyer in most circumstances.

After refunding the payment and waiting a further week, I submitted the claim form to the Royal Mail. This itself is quite a drawn out procedure, having to attach the proof of postage, eBay page, PayPal transaction and original receipt for the item being posted – justifying the value of the item you’re claiming for. I’ve posted this to their FreePost address and am still waiting for a response – which could take between 30 and 60 days. Great.

Not wanting to be stung again with postage issues. I opted to send the remainder of the items using tracked services. I ended up adding tracking to all 6 of these items. Tracking costs at least £1 – for the remaining 20 items the costs quickly added up to something that I wasn’t expecting to pay.

Buyer didn’t have PayPal

The next headache was a new buyer on eBay who didn’t appear to have their own PayPal account. I wouldn’t have minded cash on collection. But from what I had received from them, I could see that this looked like someone who was either incredibly stupid, or about to scam me. I listed the item on GumTree at the price it sold for on eBay, selling it within a day. That seems like an easy resolution.

I’ll pay you next week

After my laptop sold, I didn’t receive payment from the buyer. A week later, I opened an unpaid item case and that didn’t really help. I just received a message asking if I’d accept payment in a week’s time. I offered the item up as a second chance listing, but no-one really wanted it at the price the laptop sold for.

Frustratingly, I had to go through the whole auction process again. It would have been really frustrating if the laptop sold for less than the finishing price at the first auction. Luckily, however, it got an extra £30 out of the ordeal so I didn’t mind so much. But having not received my payment for 2.5 weeks, this was a stressful time.

The insertion fees, the seller fees and the PayPal fees

When you sell on eBay, you end up paying 3 times to do so. Firstly, there’s a low fixed cost to post a listing. That’s fine.

The thing that gets you though is the final value fee, which for my auctions seemed to work out between 5-10%. For high value items, such as the laptop I sold, the cost of selling it was higher than the cost of the postage.

And if that wasn’t enough, PayPal also charge 20p + 3.4%. All these fees add up and the costs are almost prohibitively expensive. eBay seem to pride themselves on the service and protection that is provided to buyers, and from what I have experienced, I can see that this has been done at the detriment to the services provided to private sellers.

Having seen that items sold on eBay by business are generally sold at a higher price, all this makes me think that the current fee structure is causing businesses to increase sale prices to recoup some of the costs of selling on eBay.

I think for now at least, I’ll stick to conventional shopping and selling on GumTree. It seems to be a lot less of a hassle.

Introduction to Map Reduce

Making algorithms and solutions to problems scalable allows increases in processing performance that can grow faster than Moores law. If the speed of a CPU doubles every 18 months, it may not be able to grow to meet the needs of the current environment, however multicore and multiserver architecture solutions are now employed by some of the largest companies online to much through billions of pages of information.

Sadly performance speedup isn’t really a linear function of the number of processing cores, this is in part due to scheduling and communication limitations. The trick is to solve big problems on multiprocessor solutions and smaller problems on fast single processor solutions. It is important that when a large problem is faced, a scalable architecture or algorithm is used to process the data or calculate the result.

Parallel programming is more difficult than serial programming, There are a couple of paradigms.

Shared memory

where a worker locks part of the data, does the work and then releases the node when done. There is an overhead in locking/unlocking data

Message passing

The dataset is partitioned and partitions are assigned to workers. Workers may need to pass results of their work to other processors – this is much more scalable but is slightly more difficult to process.

The map reduce programming paradigm is an abstraction of the message-passing paradigm where work is pipelined at a higher level.

Map Reduce

Map Reduce has two phases, which are specified by the programmer: map, where data is taken as input and divided into sub problems and reduce where the results are assembled.

The programmer doesn’t have to worry about the number of processing nodes or the communications, this is handled by a map reduce framework. All the programmer needs to do is define the problem in terms of the mapping step and the reducing step which makes it powerful for parallel computing.

The most common example of map reduce is calculating the number of words in documents – calculating word frequency in documents. The map reduce system randomly distributes documents to a number of mappers. The mapper doesn’t really care which documents it gets, it just acts on data that is passed to it and returns a list of words with the frequency. The reduce phase in this problem example is to sum the frequency of each word occurrence. This is something that the map nodes can’t do as the mappers only act on a subset of data. The reduce nodes are able to summarize the data given to them by the mappers.

Formally speaking:

Mappers:

  • Take in k1, v1 pairs
  • emit k2,v2
  • k2,v2<-map(k1,v1)

Reducers:

  • Receive all pairs for some k2
  • Combines these in some manner
  • k2, f(…v2…) <- reduce(k2, …v2…)

The map reduce platform is responsible for routing pairs to reducers such that all the key value pairs for a given pair are routed to a single reducer.

It’s that simple. I’ll go into some examples in subsequent posts.