Challenges for the Modern Economist

Lars Vilhuber
2016-05-18

Challenges for the Modern Economist

*

Caveats

  • I don't run my own surveys
  • I occasionally create official statistics (bias!)
  • I did my Ph.D. in the 1990s

Big data

or rather…

Alternate data sources

What is "big" data?

Big data is:

  • 8 GB?
  • 7.5 TB?
  • 309.3 million people, measured once?
  • 150 million people, measured 80 times?
  • 16,721,787,543 tables?
  • 100 countries, 5-15 times?
  • 19.3 billion records / 50TB?
  • 112 weekly data points?

Representative big data is:

  • I can't run it on my laptop
  • 2 years worth of stock trades?
  • 10 questions / population/ 1 country?
  • 3 variables for 98% of one country's workforce?
  • 30+ variables for same?
  • 1% samples of 100 countries' censuses?
  • 10% of tweets?
  • 1 variable for 10% of Twitter users?

What is "big" data?

This brings up the question: How do we collect data?

How do we collect data?

Surveys Administrative Organic Data
Aim Informational Administer programs … something else (Twitter?)
Who Trained professionals designing, fielding, analyzing surveys Trained professionals running a bureaucracy, collecting necessary data Trained professionals optimizing revenue
Core Well established science, defining population, frame Definition of population, frame critical, but ex-post Population and frame often unclear
Stats Primary purpose is to create statistics Statistics about populations is secondary purpose Public statistics at best incidental, possibly self-serving

New challenges

  • treating admin/organic as a noisy data source, different from surveys
  • designing administrative data collection with statistics in mind
  • handling large data flows in commonly accepted ways
  • novel confidentiality issues <!– - reconceptualizing multiciplity of data sources –>

Data collection in surveys

Respondent load should always be considered when planning a statistical collection and there should be policies and practices in place to manage relationships with respondents. The aim should always be to keep reporting load to the minimum and to maintain the high quality of collections.”

Australian National Statistical Service

Data collection in administrative data

Incentives

  • Clients cannot get service without filling form
  • Coercion

Monsters

Data collection discrepancies

Data collection discrepancies

Survey

  • where did you work (precise lat/long) in the past 10 years?
  • who did you work for in the past 10 years?

Administrative

  • IRS Form W4, line 8
  • CRA-ARC T4, box 54

Data collection discrepancies

Eliminating discrepancies:

Census3

“Afin de réduire le nombre de questions […] Statistique Canada utilisera vos données sur le revenu […]” *

Multiplicity of sources

Migration sources

ACS Migration (5-year)

ACS Migration

IRS Migration (1 year)

IRS Migration

OnTheMap for Burleigh County

OnTheMap county

OnTheMap for North Dakota

OnTheMap county

Job-to-job flows

J2J Migration

Challenges

  • Underlying population is the same
  • Discrete data products
  • Reconciling differences

Potential organic migration sources

What about using organic data?

Potential organic migration sources

Twitter

  • Representativeness?
  • Reconciling differences is a challenge

Private sector sources

Google Flu

Private sector sources

BPP

ADP

J'en veux!

Vous pouvez.

Example: Timeliness of Census 2016

A Metadata Question consent

Availability of Consent?

Census 2016 release

8 février 2017

An estimate using GCS

As of 2016-05-17, 11:18

GCS-20160517

An estimate using GCS

Google Consumer Survey

  • Question: On the Canadian Census 2016, did you consent to making your information available in 2108 (in 92 years)?
  • Goal: 500 responses
  • Fielded: 2016-05-16
  • As of: 2016-05-17 11:18, 229 responses live

An Aside...

Live download from https://www.googleapis.com/consumersurveys/v2/surveys/qht3vffpx6peusl5jopzboksqm/results

{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "required",
    "message": "Login Required",
    "locationType": "header",
    "location": "Authorization"
   }
  ],
  "code": 401,
  "message": "Login Required"
 }
}

The Power of...

  • immediate estimate
  • timeliness
  • “enhancement” possible
  • Cost: $50…

Other tools

Google

  • Google Trends
  • Google Correlate

Challenges to Data Availability

Data Availability

Two challenges:

  • computation and storage
  • access

Size and Computation

“● A billion hours ago, modern homo sapiens emerged. ● A billion minutes ago, Christianity began. ● A billion seconds ago, the IBM PC was released. ● A billion Google searches ago … was this morning.”

Hal Varian

Example: Predicting House Sales with Search Data

Example: Michigan Twitter feed

Benefits of using Twitter

D Antenucci et al (2014)

  • compared to weekly frequency of the official (US) UI claims (high frequency and without sampling error)
  • high-frequency social media index tracks events that affect the job market in real time
  • index has greater signal to noise ratio than the official initial claims series

Cost of using Twitter or other sources

  • cost of storing 60-70+ TB of data
  • scraping
  • maintaining software

How to Validate?

See later

Example: Quarterly Workforce Indicators

Based on quarterly wage reports from 50 states and DC:

  • research file (LEHD Snapshot through 2012Q1) has information on
    • 1,579,393,000 jobs for
    • 262,106,000 people in
    • 21,793,000 firms
  • statistics re-computed quarterly (full-information model)

Challenge: Consensus Sampling of Big Data

Example: Michigan Twitter feed

Social Index

Critical Sampling by the Author Team

Clearly explained in the article:

  • 10 percent sample of all Tweets (first 28 months = 43.8TB)
  • analyze k-grams (\( k\leq4 \)), aggregated first to days and then weeks
  • narrow the feature space further by using domain knowledge Question:
  • how can we second-guess?

In physics...

CERN

Data processing in physics

  • Arecibo Radiotelescope (since 1963)
  • Large Hadron Collider (since 2008)

Data processing in physics

  • raw data: 600 million collisions every second
  • theory says 1 in a million of interest
  • HARDWARE pre-selects (0.01%) [CERN animation]
  • SOFTWARE selects (1%) using 15,000 processors
  • STORE 25 Petabytes/year to 11 centers
  • ANALYZE in 170 centers

Raw data

600 million collisions every second = 1 PB/s 600 million

Theory

1 in a million of interest 1 in million

Consensus in Domain

HARDWARE pre-selects (0.01%) - throws away 99.99% of data! Forever! Hardware

Possibility to reassess

SOFTWARE selects (1%) using 15,000 processors Software

stop

For social scientists

Computing gap

  • Census Bureau vs. XSEDE
  • Statistics Canada vs. Compute Canada

Challenges in

  • moving data to compute resource, or
  • compute resource to data

For social scientists

Administrative data and organic data increase the challenge

  • Google and Twitter provide
    • storage resources
    • implement data reduction
    • … without scientific consensus

Challenges: Access

Data Collection Procedures

How are administrative data collected and stored?

  • often complex and … organic
  • sometimes obfuscated
  • very heterogeneous

Example: Quarterly Workforce Indicators

QWI dates

Historical availability already compromised

  • cost
  • legal

Data Collection Procedures

How are organic data collected and stored?

  • incidental
  • selectively?

Example: Michigan Twitter feed

Social Index

Access to the raw data of the Ringtail system?

Example: Twitter again (sorry)

Example: Quarterly Workforce Indicators Again

How to validate:

  • Request access to the 14 (out of 50) states who have given access to non-government researchers

Snapshot as of 2012Q1

Administrative data silos

Many of the examples above are “siloed” because of computational constraints.

Administrative data is often “siloed” for a combination of (perceived and real) confidentiality constraints and legal barriers.

Example: Quarterly Workforce Indicators (Again)

LEHD succeeded in brining together 51 (+) state administrations, sharing their data to a trusted party.

Success: QWI

Less so: Researcher access (14 out of 51)

Example: European Data

Mostly in confidential silos

Broad consensus on legal framework

Fight Still some issues

Example: European Data

28 countries

  • Eurostat
  • very little common data
  • no common access to administrative data

Example: Canadian Data

10 provinces + 3 territories + 1 federal government

  • 14 silos
  • Fight

Challenges: Privacy

Challenges: Protection gap

  • More detailed data implies more data on specific individuals and firms
  • Are protection methods sufficient?
    • Statistics offices are concerned
    • Administrations are concerned
    • Private providers are taking action, because they are concerned.

One Solution:

Lock up the researcher

Scottish RDC

Access methods

  • apply methods, produce more public-use statistics (move the data to the researcher)
  • fail to apply methods, provide controlled direct access to data (move the researcher to the data; contracts/RDCs/etc.)

Privacy and Confidentiality

One view

  • Privacy is not asking
  • Confidentiality is not revealing

Example: Census 2016

Consent1

Consent3

Example: Census 2016

Statistics Canada

  • asks the respondents about the data on the Census form (compulsory response!) = confidentiality
  • makes the decision to use the respondents' tax data (also compulsory response!) = privacy (or not)

An Indication of the Desire for Privacy

decline

What level of protection?

  • older methods break down as published data become denser
  • newer methods are still being developed
    • synthetic data
    • noise infusion
  • more robust methods
    • differential privacy

How much protection?

  • tradeoff utility - protection
  • how much protection do data providers require (people, firms, etc.)
  • how much utility do stakeholders request (people, firms, government, etc.)
  • what technology is available to implement?

How much protection?

Challenges: Replication

Where are we?

  • Data are more complex
  • Data are more difficult to share
  • Data are more siloed

But we like the results!

But we like the results!

citations Kingi, Stanchi, Vilhuber (unpublished)

Replicating the Results

Social Index

Difficult to replicate

Kingi, Stanchi, Vilhuber

access

Significant number of papers not reproducible because of access difficulties

Kingi, Stanchi, Vilhuber

type of access

Should We Just Trust These Guys?

Car salesman

Conclusion

Big data

Or maybe: Pervasive data

Data availability and accessibility

  • the era of public-use data is declining
    • because other data are more interesting, richer
  • data sources multiply…
    • … but also hide in new silos

Confidentiality and privacy

  • Confidentiality by obscurity is dead
  • the era of public-use data is declining
    • because confidentiality is too easy to compromise
  • data sources multiply…
    • … making the risk of confidentiality breach more obvious
    • … enforcing the creation of silos

Confidentiality and privacy

Challenge is to

  • fuse the silos (f.i. provincial data available in national network)
  • develop new methods that are robust to confidentiality protection methods

Big data

  • new “big data” methods will become part of the canon of applied economics
  • will include methods absorbed from computer science, biology, physics
  • will need to iterate towards a consensus what data to keep (à la physics)

Replicability

  • all of the above constitute a challenge to the scientific integrity of our findings
  • for old-style data, greater replicability will be required
  • for new-style data, new methods to allow for replication need to be developed
    • building big data infrastructure may require building an access mechanism as well (synthetic data, validation server)
    • social science archives need to make a magnitude leap in archiving capability
  • funding

Thank you