



Why Datetimes Need Units: Avoiding a Y2262 Problem & Harnessing the Power of NumPy's datetime64




================================================================================

Brief Summary
maximum 400 characters.


This talk will introduce the NumPy `datetime64` datatype, describing its features and performance in comparison to Python's `date` and `datetime` objects. Practical examples of working with, and converting between, these types will be provided. The usage of `datetime64` with time series data in Pandas and StaticFrame will be compared, illustrating the value of using units with `datetime64`.

# useful discussion here:
https://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64
And quote of Wes saying: Welcome to hell

#-------------------------------------------------------------------------------

Brief Bullet Point Outline
Brief outline.

I. Introduction
I.A. Motivation
I.B. Confusing examples
I.C. Y2262 illustrates datatype misuse.

II. Fundamentals of ``datetime64`` and Python ``datetime``
II.A. NumPy v. Python: boxed values v. contiguous bytes
II.B. Datetime data representations
II.C. Datetime units
II.D. Representation differences between ``datetime64`` and Python ``datetime``
II.E. ``timedelta64`` and ``timedelta``
II.F. Comparing constructors

III. NumPy dtype and ``datetime64``
III.A. dtype basics
III.B. dtype with units
III.C. Comparing dtype

IV. Comparing ``datetime64`` and Python ``datetime``
IV.A. Resolution
IV.B. Range
IV.C. Converting Units
IV.D. Comparing
IV.E. Shifting
IV.F. Missing Time

V. Performance Comparisons
V.A. The problem of Python ``datetime`` in object arrays
V.B. DirectFromString, ParseFromString
V.C. DeriveYear
V.D. TrueOnMonday
V.E. TrueLessThan
V.F. ShiftDay

VI. Converting between ``datetime64`` and Python ``datetime``
VI.A. ``datetime64`` to Python ``datetime``
VI.B. Python ``datetime`` to ``datetime64``

VII. Labeling Data with Datetime
VII.A. Pandas exclusive usage of nanoseconds in DatetimeIndex and Timestamp objects
VII.B. Pandas retyping any ``datetime64`` to nanoseconds
VII.C. StaticFrame datetime Index family permits usage of units

VIII. Advantages of Units in Datetime
VIII.A. Unambiguous resolution specification
VIII.B. Represent a larger range of dates and times
VIII.C. More explicit typing leads to more-maintainable code



Description
Detailed outline.


NumPy supports a datetime array datatype called ``datetime64``. Unlike Python's standard library types (``datetime`` and ``date``), ``datetime64`` supports an extensive range of time units, from year to attosecond. This specification of unit permits unambiguous resolution specification, more narrow typing of time information, and taking full advantage of time ranges that fit within the underlying representation (a 64-bit signed integer).

This talk will introduce ``datetime64`` arrays and describe their features and performance in comparison to Python's ``date`` and ``datetime`` types. Practical examples of working with, and converting between, these formats will be provided. As date and time information is particularly useful for labeled time-series data, the usage of ``datetime64`` in Pandas and StaticFrame indices will be examined. Pandas exclusive and coercive use of only a single unit (nanosecond) will be shown to lead to a "Y2262" problem and offer other disadvantages compared to StaticFrame's full support for ``datetime64`` units.

The audience for this talk is anyone working with NumPy ``datetime64`` or Pandas ``DatetimeIndex`` or ``Timestamp`` types, or those wanting to better understand the limitations of Python's ``date`` and ``datetime`` objects, particularly when used in NumPy arrays. Basic familiarity with these types is helpful but not required. This will be an informative presentation with concise code examples and practical tips for working with these types. Audience members will come away with a firm understanding of the limits and opportunities of these types, relevant for anyone working with time series data.




================================================================================

# min and max of nanosecond representation

>>> np.datetime64(np.iinfo(np.int64).max, 'ns')
numpy.datetime64('2262-04-11T23:47:16.854775807')

>>> np.datetime64(np.iinfo(np.int64).min+1, 'ns')
numpy.datetime64('1677-09-21T00:12:43.145224193')


# pd.Timestamp internal representation uses same ints as ns

>>> pd.Timestamp('1990-01-01').value
631152000000000000

>>> np.datetime64('1990-01-01', 'ns').astype(int)
631152000000000000


#-------------------------------------------------------------------------------

https://en.wikipedia.org/wiki/Year_2038_problem

relates to representing time in many digital systems as the number of seconds passed since 00:00:00 UTC on 1 January 1970 and storing it as a signed 32-bit integer. Such implementations cannot encode times after 03:14:07 UTC on 19 January 2038. Similar to the Y2K problem, the Year 2038 problem is caused by insufficient capacity used to represent time.

https://en.wikipedia.org/wiki/Time_formatting_and_storage_bugs#Year_2262



#-------------------------------------------------------------------------------

Pandas models all date or timestamp values as NumPy ``datetime64[ns]`` (nanosecond) arrays, regardless of if nanosecond-level resolution is practical or appropriate. This creates a "Y2262 problem" for Pandas: dates beyond 2262-04-11 cannot be expressed. While I can create a ``pd.DatetimeIndex`` up to 2262-04-11, one day further and Pandas raises an error.

>>> pd.date_range('1980', '2262-04-11')
DatetimeIndex(['1980-01-01', '1980-01-02', '1980-01-03', '1980-01-04',
               '1980-01-05', '1980-01-06', '1980-01-07', '1980-01-08',
               '1980-01-09', '1980-01-10',
               ...
               '2262-04-02', '2262-04-03', '2262-04-04', '2262-04-05',
               '2262-04-06', '2262-04-07', '2262-04-08', '2262-04-09',
               '2262-04-10', '2262-04-11'],
              dtype='datetime64[ns]', length=103100, freq='D')
>>> pd.date_range('1980', '2262-04-12')
Traceback (most recent call last):
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-04-12 00:00:00


As indices are often used for date-time values far less granular than nanoseconds (such as dates, months, or years), StaticFrame offers the full range of NumPy typed ``datetime64`` indices. This permits exact date-time type specification, and avoids the limits of nanosecond-based units.

While not possible with Pandas, creating an index of years or dates extending to 3000 is simple with StaticFrame.


#-------------------------------------------------------------------------------

