Towards integrating sample surveys in India

Large-scale household surveys in India are mainly undertaken by the NSSO and the NCAER. In this article, Pronab Sen, Chairman of the National Statistical Commission, highlights the need for systematic convergence between the two organisations, as well as other smaller ones, in the conduct of surveys. He discusses the requirements and challenges of such an integration process, including issues of data sharing and dissemination.

Historically, there are two organisations that have had the capacity to conduct large-scale sample surveys in the country – the National Council of Applied Economic Research (NCAER), and the National Sample Survey Organisation (NSSO)¹. Their origins and objectives have been very different, which is reflected in the types of surveys that they conduct.

Large-scale sample surveys in India: NSSO vs. NCAER

The NSSO, although originally a unit of the Indian Statistical Institute (ISI), was essentially set up to meet the data needs of the government. Until fairly recently, NSSO’s survey activities were driven primarily by the requirements of the Planning Commission for its planning models, and by the Central Statistical Organisation (CSO) for the National Accounts. Subsequently, its ambit has widened, and it has been providing survey data to other government ministries as well. Nevertheless, the information and data it generates are generally aimed at meeting two primary requirements of policymakers: (a) to be able to identify or characterise a problem; and (b) to be able to track certain parameters over time. In both cases, the requirements are usually met by cross-sectional², descriptive statistics. As such, therefore, there has been little or no effort at tracking of qualitative variables or of behaviour, except in so far as these can be estimated from econometric analysis using the auxiliary variables available in the surveys.

The NCAER was set up with government support to undertake qualitative research on the Indian economy for both public and private users. It has always had a more diversified clientele, including wings of the government whose requirements could not be met by the NSSO. As a result, although its reach and variety of surveys have been more limited than that of NSSO, it has addressed types of data that are of interest to researchers. This has been essentially driven by an important institutional difference between the two: the NCAER is primarily a research body which does data collection for its research needs, whereas the NSSO is purely a data collection agency. This distinction has a number of implications regarding the functioning of the two organisations.

First and foremost, there is the issue of domain knowledge. As a research body, the NCAER not only has domain knowledge in the area of the survey, it has the possibility of retaining the learnings of each survey and building upon them. The NSSO, however, does not in itself have domain knowledge in any particular area. While the NSSO has had a long tradition of ‘working groups’ made up of both survey statisticians and domain experts, the problem is the lack of retention of the learning, except in terms of the survey methodology. Thus, a new NSSO survey on an issue that has already been covered will require the setting up of a new working group with new domain experts, which also means that the two surveys may not always be strictly comparable.

As a data collection agency, the NSSO has permanent field staff at both the supervisory and investigator levels, which enables it to retain and build upon its field experience. As a result, it can draw upon a wealth of knowledge on what is feasible in the field and what is not. This enables it to fine-tune its survey instruments without necessarily undertaking extensive and expensive pre-testing. The NCAER, however, has to undertake pre-testing for every new survey, and possibly go through an extended process of training and retraining its field personnel.

The second issue pertains to the ability to experiment. As a publicly-funded body, the NSSO does not necessarily have to justify its expenditures on the basis of the immediate returns from its surveys. Justification based on potential future utility is acceptable. As a consequence, the NSSO can potentially continuously evolve its survey methodologies and instruments keeping in view the changes in the milieu where the surveys are conducted. The NCAER does not enjoy this advantage, and has to justify every survey on the basis of its results. Pure experimentation is therefore much more difficult, and perforce its surveys to follow methodologies that are already well-established, although these may not be ideal for the purposes of the particular survey.

Third, the NSSO, along with other arms of the Ministry of Statistics and Programme Implementation (MoSPI), can invest in developing sampling frames³ of various types. The Economic Census, the Urban Frame Survey and the Annual Survey of Industries (ASI) frame are examples of such activities. These are well beyond the financial and manpower resources of NCAER. Although the development of frames is everywhere the responsibility of governmental or public organisations, and non-government agencies piggy-back such efforts, the sampling frames invariably reflect the interests and predilections of the developers, especially with regards to the indicators that would be used for stratification⁴ purposes. Thus, the possibility exists that considerations underlying the follow-up surveys by the NSSO could guide the structure of the sampling frame data, while NCAER necessarily has to shoe-horn its surveys into the publicly-generated frames.

Fourth, NSSO is constrained in undertaking surveys which could be politically sensitive and thereby lend themselves to charges of sample-selection bias⁵, such as surveys on institutional capacities, opinion polls, expectations surveys, etc. As a result, the NSSO has never in its history carried out any survey which is even remotely qualitative in nature, and has confined itself to purely quantitative indicators. This is a major lacuna, since much of modern research on behavioural issues requires qualitative information along with the usual quantitative variables. The NCAER, on the other hand, has much greater flexibility in this regard, which potentially can enhance the value of its surveys considerably.

Finally, there is a fundamental difference in the way the two organisations share their data. While the NSSO has in principle always had an open data policy, in practice it had to limit data access due to its own organisational and logistical constraints. It is only fairly recently that it has taken the path-breaking step of releasing all its unit-level records with suitable anonymisation⁶. This openness has had a fundamental impact on India-centric research as well as on attitudes within the organisation. NCAER, on the other hand, by and large does not freely make available the data from its surveys⁷ It quite correctly treats its data as proprietary and disseminates its reports on a commercial basis. This difference in data sharing policies has certain implications which should be noted.

The most important implication is that since the NSSO data is subjected to intense scrutiny by a broad and open research community, it enjoys a high degree of credibility. Its strengths and weaknesses are laid out clearly and appropriate adjustments can be made while undertaking analytical work. The non-transparency of the NCAER data essentially means that its credibility rests on the reputation of the in-house researchers and not on the quality of the data itself. A related implication is that the NSSO can get feedback and suggestions for improvement from a wide variety of users, which NCAER cannot. While this may not be of much consequence for one-off surveys, it makes considerable difference in repeat surveys.

It should be clear, therefore, that although NCAER and NSSO both carry out large-scale sample surveys, they have very different strengths and weaknesses. Indeed, although they are commonly seen as competitors, and in fact see each other in the same light, their respective attributes really suggest strong complementarities between them. Traditionally, these complementarities have not been utilised in a manner beneficial to both. In recent times things have begun to change. There has been some linkage for several years between the NSSO and NCAER, but the relationship has been asymmetric. By and large, the NSSO has assisted NCAER in sample design and little else, but has not drawn on NCAER’s expertise to any degree. The prospects of two-way cooperation are much larger, provided that both organisations recognise the complementarities and are able to shed their mutual inhibitions.

Given its specific characteristics, the NSSO is the only organisation that can provide the overall context within which other data collection and survey efforts take place. This is not just about the development of frames, but also about the manner in which diverse surveys are coordinated and data is shared. Nor is this only about the relationship between NSSO and NCAER; several other agencies which carry out micro-studies and large sample surveys exist, but they do not mesh with the larger context in which the NCAER/ NSSO-type surveys are conducted. Thus, at present, there is a lot of duplication and overlapping of effort which brings to the fore the importance of systematic convergence between these two agencies in conducting large-scale surveys in the country and with others which do similar work.

Data sharing and data dissemination policies

Greater synergy with other data-gathering efforts calls for not only an understanding of the relative strengths of the different organisations, but also the limitations under which they function today. At its heart lies the issue of data sharing. At present, the Government of India has a data dissemination policy for public data that is binding on governmental agencies, including the NSSO. This policy states that unless any data is specifically classified as confidential or sensitive, it must be placed in the public domain with whatever measures are necessary to preserve confidentiality of the respondents. NSSO’s practice of releasing anonymised unit-level records is compliant with this policy. However, there are two issues which need to be taken into account in this context.

First, the open data policy of the central government is not binding on non-governmental agencies, and indeed not even on state governments. There is, however, a way around this which the government can use if necessary. The Collection of Statistics Act 2008 empowers the government and its nominated agencies to compel sharing of any data, including non-anonymised data, from any person or institution in the country⁸. On the other side, the Act enjoins strict limitations on the manner of public release of such data by the government in order to protect the identity of the respondents, and provides stringent punishment for violation of the confidentiality provisions. Thus it is entirely possible for the NSSO to use the provisions of this Act to access the raw data from any survey or micro-study and integrate it with NSSO’s own database, if such a correspondence exists.

It is, however, difficult to see what purpose such a procedure would serve in view of NSSO’s lack of domain expertise, unless the NSSO carries out such integration, and releases the integrated dataset into the public domain with appropriate anonymisation. More importantly, this would be a fairly draconian measure and would undoubtedly raise the hackles of all agencies, not just the non-governmental, but governmental as well. The end result could be most counterproductive for the future of sample surveys in India.

Second, there is clearly a distinction between data dissemination and data sharing. In India, at present, there is no policy whatsoever on data sharing. This is a major lacuna indeed, and it is most surprising that no effort has been made in articulating a data sharing policy, at least for government agencies. In the absence of such a data sharing policy, there are serious problems not only in getting the most out of the large number of data collection efforts that go on in the country at any given point in time, but also with optimally using the frames generated through the use of public funds. Both the government agencies which generate frames - the Registrar General of India (RGI) and the MoSPI - have their own procedures for specific cases, including drawing of samples for third parties and setting up temporary data rooms⁹. The choice of partners too is discretionary and depends largely upon relationships.

The integration process: What it would take

With a proper data sharing policy in place, it is possible to visualise a situation in which considerable synergy can be obtained between the various surveys that are undertaken in the country by diverse agencies. At the heart of this system would have to be the NSSO, playing five key roles:

Developing a set of sampling frames for common use by all survey organisations in consultation with others.
Undertaking large-scale, usually cross-sectional, surveys to meet the essential needs of government.
Assisting other survey organisations in sample design and sample selection.
Undertaking experiments on survey design and methodology in consultation with other organisations.
Providing assistance in integrating surveys from various sources including its own.

The NCAER, and other survey organisations, in their turn will need to contribute to this process of integration by:

Agreeing to share data collected by them, perhaps with some pre-specified time lag.
Situating their samples in the frames/ surveys of the NSSO in order to facilitate integration.
Focusing their attention on types of surveys that the NSSO does not do, or has difficulties with.
Being actively involved in designing experiments, especially in terms of domain knowledge and research issues.

However, for it to play such a role, there has to be a reorientation, along with a corresponding reorganisation, of the NSSO. Some systems within the organisation have not kept pace with changes that have taken place over the years and with the requirements being envisioned. For a start, the central role in data sharing and data integration will have to be played by the Data Processing Division (DPD) of the NSSO, which is the custodian of frames and of the NSS survey data. It is also responsible for sample selection and for data integration, which, as of now, is limited to pooling of central and state samples. Along with these key functions, the DPD also does all the data processing work of the NSSO, including validation. It is clear that the DPD will have to be strengthened significantly since it will have to cater to the needs of other survey partners, along with all the functions it performs today. It will also have to develop and maintain systems for researchers to work with non-anonymised data, while maintaining confidentiality. It is clear that for DPD to play these roles, it will have to evolve technologically since survey techniques and data management systems are moving apace.

The other major component in the process is the Survey Design and Research Division (SDRD). The SDRD currently spends most of its time (perhaps as high as 70%) writing reports and less than 30% on survey design and research. This was necessary in the past when the NSSO did not disseminate its data, but in the current open-data scenario, the division should focus on research and experimentation, and on survey design and techniques. The data interpretation and analysis can be left to the users and other researchers who receive the data in the form of unit-level records.

However, doing research and experimentation effectively will require a vast amount of domain knowledge among the researchers in all the areas and subjects covered, which could be a challenge in a data collection agency. Thus, what is needed is to marry the expertise on survey techniques of SDRD with domain knowledge to be sourced from outside. The challenge thus, is to introduce domain knowledge into a system that generates data and information to meet the demands of the governmental system. Equally if not more important is to find ways to provide this domain knowledge in an ongoing manner.

In this context, three issues are relevant. The first is one of experimentation. If NSSO survey statisticians can be released from the task of report writing after a survey has been conducted, they would have more time to work on pilots and end-research projects. It has the capacity to do so, but expectations and incentives will have to be redesigned. However, experiments involve knowledge of both survey methods and the application domain, and a process of learning. As mentioned above, the NSSO can contribute to and learn from its experiments on issues of survey design, but cannot do so for application domain. The issue, thus, is designing a method whereby the experiments can be translated into retainable knowledge within the larger survey and research system in the country.

The second is also a design issue. Today’s research questions are predominantly those that can be answered by panel surveys¹⁰. While the NSSO is trying to undertake panel surveys, there are limits to the extent to which a government system can conduct these consistently over an extended period; this may be a good point for convergence and division of labour with other agencies. Rather than every agency undertaking both cross-sectional and panel studies and trying to defend its turf, it would be preferable to introduce a system where cross-sectional, descriptive information is delivered with a high degree of accuracy by cooperating agencies using well-established, smaller panels to actually inform views on behaviour. Something similar is also true of surveys focused on qualitative indicators and on indicators which are politically sensitive. These design issues need to be discussed and resolved.

The third and most demanding issue is the dire shortage of survey statisticians in the country today. While India has historically produced a very large number of survey statisticians, at present there are very few working in the field. (The average age of survey statisticians in the NSS working groups is over 60; and in another 15 years there may be very few left). The question is how to encourage younger people to enter the survey field, to offer them exciting opportunities. At present, neither the NSSO nor NCAER excite the youth to become dedicated survey statisticians; it is up to the larger research community to create the right conditions.

This article is based on a paper prepared for a symposium on “Leapfrogging Methodology and Technology in Household Survey Research: Lessons from the US and India” organised by the National Council for Applied Economic Research (NCAER) in New Delhi in November 2013.

Notes:

The Registrar General of India (RGI) also carries out surveys, the most notable of which is the Sample Registration System (SRS), but there are others as well. In recent years, there has been a proliferation of smaller organisations undertaking sample surveys for commercial and research purposes, and a few institutions which network smaller-survey entities to carry out fairly large surveys, such as the International Institute of Population Studies (IIPS).
Cross-sectional statistics refers to data collected on subjects (individuals, firms etc.) at a given point of time, or without regard to differences in time.
Sampling frames are a complete listing of all entities which are sought to be surveyed, such as a list of all villages or urban blocks or factories, etc.
In Statistics, stratification refers to the process of grouping members of the population into relatively homogeneous sub-groups, before sampling. Sampling is the selection of a sub-set of individuals from within the population to estimate characteristics of the entire population.
Sample-selection bias occurs when the sample of individuals selected from the larger population for the purpose of statistical analysis, is not representative of the population.
This means that while sharing data with a third party, identifiers such as names of individuals are removed; only information necessary for statistical analysis is provided in the dataset.
Lately the NCAER has made unit-level records available for some of its surveys, but these too have essentially been those which are funded from public resources.
The Act overrides whatever confidentiality provision exists in any other legislation or is imposed by any Ethics Committee.
Data rooms are restricted access facilities where people can work on the datasets but cannot take them out in any manner.
Panel surveys involve collection of data on the same set of subjects (individuals, firms etc.) over multiple time periods.